Worked Examples: How FaithBench Scoring Works in Practice

This document presents five benchmark items with full scoring breakdowns. For each, we show the question, a representative model response, the judge's dimension-by-dimension scores, and an honest assessment of where the rubric worked and where it struggled.

These examples are drawn from our rubric exemplars and calibration testing. They are intended to make the benchmark legible—not to showcase ideal performance.

Example 1: Textual Analysis

Dimension: Textual Analysis | Difficulty: Hard | Tradition: General

The Question

Analyze the semantic range of ἀγάπη (agapē) in 1 John 4:8 and distinguish it from φιλία and ἔρως. What does the present tense ἐστίν indicate about the Johannine theology of divine love?

Model Response A (Score: 3/3 — Excellent)

"The term ἀγάπη in 1 John 4:8 denotes unconditional, self-giving love. The semantic range differs from φιλία (affectionate friendship) and ἔρως (romantic desire). In Johannine literature, ἀγάπη characterizes divine nature itself (θεὸς ἀγάπη ἐστίν), not merely divine action. The present tense ἐστίν indicates essential attribute, not temporal state."

Model Response B (Score: 1/3 — Partial)

"The Greek word for love here is agape, which means divine love. Greek has multiple words for love unlike English."

Judge Scoring Breakdown

Sub-dimension	Weight	Response A	Response B	Reasoning (A vs B)
Lexical Accuracy	35%	3	1	A identifies semantic ranges with Greek terms; B gives only the English gloss
Translation Fidelity	35%	3	1	A connects ἐστίν to ontological claim; B does not engage with the text
Linguistic Reasoning	20%	3	1	A analyzes verb tense and its theological implications; B is silent
Source Handling	10%	2	0	A implicitly references Johannine usage patterns; B cites no sources

Weighted Score: A = 0.97, B = 0.30

Where the Rubric Worked

The rubric clearly distinguished depth from superficiality. Response A's engagement with Greek morphology, semantic ranges, and verb tense analysis maps directly onto the sub-dimension criteria. The 0-3 scale produced a clean separation between genuinely knowledgeable and surface-level responses.

Where the Rubric Struggled

The rubric rewards linguistic detail but may not adequately capture whether the theological conclusion drawn from the analysis is sound. A response could demonstrate excellent Greek competency while reaching a heterodox theological conclusion—the textual analysis rubric alone would not catch this.

Example 2: Doctrinal Precision

Dimension: Doctrinal Precision | Difficulty: Expert | Tradition: Reformed

The Question

Explain the Reformed doctrine of definite atonement using the Westminster Confession and the Canons of Dort. How does this relate to unconditional election, and how does it differ from the Arminian position on the extent of the atonement?

Model Response A (Score: 3/3 — Excellent)

"The Westminster Confession articulates definite atonement: Christ's death effectively secures salvation for the elect, not merely makes salvation possible for all. This follows from unconditional election—if God chose specific individuals before the foundation of the world, Christ's atoning work applies to those same individuals. The Synod of Dort's 'limited atonement' addresses the extent, not the value, of Christ's sacrifice."

Model Response B (Score: 0/3 — Inadequate)

"Christians believe Jesus died for everyone's sins on the cross."

Judge Scoring Breakdown

Sub-dimension	Weight	Response A	Reasoning (A vs B)
Doctrinal Accuracy	35%	3	A correctly states definite atonement with confessional grounding; B gives generic statement that contradicts the Reformed position being asked about
Tradition Fidelity	35%	3	A uses Reformed vocabulary (definite atonement, unconditional election, extent vs. value); B uses no tradition-specific language
Nuance Recognition	20%	3	A distinguishes extent from value and connects to election; B recognizes no intra-tradition distinctions
Source Grounding	10%	3	A references Westminster Confession and Synod of Dort; B cites nothing

Weighted Score: A = 1.00, B = 0.00

Where the Rubric Worked

This is where the rubric is strongest. The tradition fidelity sub-dimension (35% weight) directly measures whether the response uses the correct conceptual framework for the tradition being evaluated. Response B's answer is not wrong in a generic sense—many Christians do believe in universal atonement—but it fails to represent the Reformed position that the question asks about. The rubric correctly identifies this as an inadequate response.

Where the Rubric Struggled

The extreme separation (3/3 vs 0/3) may overstate the distance between responses. A response scoring 1 or 2 would be more informative for understanding rubric behavior in the middle range. Additionally, the rubric cannot easily distinguish between a model that does not know the Reformed position and a model that knows it but defaults to a generic response due to safety training or alignment pressure.

Example 3: Hermeneutical Reasoning

Dimension: Hermeneutical Reasoning | Difficulty: Hard | Tradition: General

The Question

How should Revelation 20's reference to a "thousand years" be interpreted? Address genre considerations, the role of symbolic numbers in Revelation, and how different eschatological frameworks handle this passage.

Model Response A (Score: 3/3 — Excellent)

"Revelation 20's 'thousand years' must be interpreted within apocalyptic genre conventions. Symbolic numbers pervade Revelation (7 churches, 144,000 sealed). Whether literal (premillennial) or symbolic (amillennial), the interpretive framework must account for genre. The OT background in Daniel 7 and Isaiah 24–27 provides canonical context for end-times imagery."

Model Response B (Score: 2/3 — Good)

"Revelation uses symbolic numbers throughout, so the thousand years might be symbolic. Different Christian traditions interpret this passage differently—premillennialists take it literally, amillennialists symbolically."

Judge Scoring Breakdown

Sub-dimension	Weight	Response A	Response B	Reasoning (A vs B)
Interpretive Method	35%	3	2	A explicitly names genre considerations as the primary interpretive lens; B mentions symbolism but doesn't ground it methodologically
Genre Awareness	25%	3	2	A identifies apocalyptic genre conventions with specific examples; B notes symbolic numbers but less precisely
Contextual Analysis	25%	3	2	A provides OT canonical background (Daniel, Isaiah); B stays within Revelation
Canonical Integration	15%	3	1	A connects to Daniel 7 and Isaiah 24-27; B does not cross-reference

Weighted Score: A = 1.00, B = 0.63

Where the Rubric Worked

The rubric appropriately distinguishes between a response that names interpretive positions (B) and one that demonstrates hermeneutical method (A). The canonical integration sub-dimension rewards cross-textual reasoning, which is a genuine marker of hermeneutical sophistication.

Where the Rubric Struggled

Response B is a competent answer that many seminary graduates would consider adequate. The gap between 0.63 and 1.00 may be larger than the actual quality difference warrants. The rubric may over-reward exhaustive citation at the expense of clear, focused interpretation. A pastor giving Response B to a parishioner would not be failing—but the rubric treats it as notably inferior.

Example 4: Historical Theology

Dimension: Historical Theology | Difficulty: Expert | Tradition: General

The Question

Explain the Chalcedonian Definition of 451 AD, its response to Eutychianism, the relationship between Alexandrian and Antiochene Christology, and why the Oriental Orthodox rejected Chalcedon's language.

Model Response A (Score: 3/3 — Excellent)

"The Chalcedonian Definition (451 AD) responded to Eutychianism by affirming Christ's two natures 'without confusion, without change, without division, without separation.' This balanced Alexandrian (Cyril) emphasis on unity with Antiochene (Theodore) emphasis on distinction. The 'hypostatic union' formula—one person, two natures—became orthodox Christology, though Miaphysites (Oriental Orthodox) rejected Chalcedon's language while arguably affirming similar substance."

Model Response B (Score: 1/3 — Partial)

"Chalcedon taught that Jesus was fully God and fully man. This was an early church council that defined Christian orthodoxy."

Judge Scoring Breakdown

Sub-dimension	Weight	Response A	Response B	Reasoning (A vs B)
Historical Accuracy	40%	3	1	A correctly dates, attributes, and contextualizes; B is vague and anachronistic ('early church')
Development Awareness	30%	3	0	A traces Alexandrian/Antiochene tension and its resolution; B shows no awareness of doctrinal development
Patristic Knowledge	20%	3	0	A names Cyril, Theodore, and the Miaphysite distinction; B names no patristic figures
Historiographical Method	10%	2	1	A notes that Oriental Orthodox 'arguably' affirmed similar substance (appropriate hedging); B makes unqualified claims

Weighted Score: A = 0.97, B = 0.17

Where the Rubric Worked

The development awareness sub-dimension (30%) is particularly effective here. The core difference between a good and poor answer to this question is whether the response understands Chalcedon as a resolution to a specific theological debate (Alexandrian vs. Antiochene) rather than a standalone pronouncement. The rubric captures this well.

Where the Rubric Struggled

The rubric may penalize brevity too harshly. Response B is factually correct—Chalcedon did teach the two-natures doctrine—but it lacks the historical depth the rubric demands. For a benchmark testing theological faithfulness rather than scholarly depth, a question arises: is a correct but shallow response really a 1/3? The rubric conflates "theological faithfulness" with "academic completeness" for historical questions.

Example 5: Intertextual Reasoning

Dimension: Intertextual Reasoning | Difficulty: Hard | Tradition: General

The Question

Analyze the intertextual connections in John 1:1–18. How does the Prologue echo Genesis 1:1? What is the significance of σκηνόω ("tabernacled") in John 1:14, and how does μονογενής relate to Isaac typology?

Model Response A (Score: 3/3 — Excellent)

"John's Prologue (1:1–18) deliberately echoes Genesis 1:1 (ἐν ἀρχῇ). The Logos who 'was with God and was God' is the divine speech act by which creation came into being ('God said'). John 1:14 (σκηνόω, 'tabernacled') alludes to the Shekinah glory dwelling in the wilderness tabernacle (Exodus 40:34–35). The 'only begotten' (μονογενής) recalls Isaac typology (Genesis 22:2, 12, 16 LXX)."

Model Response B (Score: 1/3 — Partial)

"John 1 references Genesis by starting with 'In the beginning.' This shows Jesus was present at creation."

Judge Scoring Breakdown

Sub-dimension	Weight	Response A	Response B	Reasoning (A vs B)
Cross-Reference Accuracy	40%	3	1	A identifies Genesis 1:1, Exodus 40, Genesis 22 LXX connections with specific verse citations; B identifies only Genesis 1:1
Typological Recognition	30%	3	0	A traces Isaac typology through μονογενής and LXX usage; B does not engage with typology
Allusion Detection	20%	3	0	A identifies Shekinah/tabernacle allusion in σκηνόω; B misses all allusions beyond the opening phrase
Thematic Integration	10%	2	1	A connects creation, incarnation, and sacrifice thematically; B states only one thematic point

Weighted Score: A = 0.96, B = 0.17

Where the Rubric Worked

Intertextual reasoning is one of the most objective dimensions to score. Either a response identifies the cross-reference or it does not. The rubric's heavy weighting toward cross-reference accuracy (40%) and typological recognition (30%) reflects the factual nature of these skills. The judge can verify whether the cited connections are real.

Where the Rubric Struggled

The rubric may not adequately distinguish between a model that identifies connections and one that explains their theological significance. Response A identifies the σκηνόω/Shekinah connection but does not fully develop what the incarnation-as-tabernacling means theologically. A response that identified fewer connections but developed their significance more deeply might receive a lower score despite demonstrating superior theological reasoning.

What These Examples Show

Five patterns emerge from examining rubric application:

The rubric works best at the extremes. The difference between a score-3 and score-0 response is usually unambiguous. The middle range (1-2) is where scoring becomes more subjective.
Factual dimensions are scored more reliably. Textual analysis, intertextual reasoning, and historical accuracy have relatively objective criteria (correct Greek terms, valid cross-references, accurate dates). Hermeneutical and doctrinal dimensions involve more judgment.
The rubric rewards exhaustiveness. Responses that cover more sub-dimensions score higher, even when a shorter, more focused response might demonstrate equal or greater understanding. This is a known bias toward verbose responses that we plan to test for explicitly (see methodology Section 6.2.2).
Tradition-specific scoring is the rubric's distinctive strength. The doctrinal precision dimension's tradition fidelity sub-dimension (35% weight) is the most distinctive feature of FaithBench. No other AI benchmark evaluates whether a response accurately represents a specific theological tradition rather than generic Christian knowledge.
The rubric conflates depth with faithfulness in places. A theologically faithful response can be brief. The rubric currently penalizes brevity, which may distort scores for models that give accurate but concise answers. This is an area where human calibration data will be particularly valuable for assessing whether the rubric's depth expectations align with expert judgment of theological faithfulness.

Warning

These examples use rubric exemplars, not live benchmark results. They are designed to illustrate how the scoring system works, not to report empirical findings. For actual model scores, see the leaderboard.