Research Preview: FaithBench is an active research project. All scores are preliminary, generated by a single AI judge, and pending human validation. See our Limitations section for details.

FaithBench: Toward Tradition-Aware Evaluation of Theological Faithfulness in Large Language Models

Abstract

We introduce FaithBench, an initial benchmark framework for evaluating the theological faithfulness of large language models (LLMs). Existing AI benchmarks demonstrate that models struggle disproportionately with faith-related content, yet no dedicated evaluation framework exists for this domain. FaithBench proposes a multi-dimensional evaluation framework spanning six theological competencies, tradition-aware scoring rubrics, and statistical validation protocols. We operationalize "theological faithfulness" as accuracy in representing primary source texts, doctrinal positions, and reasoning patterns within specific Christian traditions. Current results rely primarily on LLM-as-judge evaluation and should be interpreted as preliminary pending expert calibration. This paper details our construct definition, scoring framework, known limitations, and validation roadmap to enable reproducible theological AI evaluation.


1. Introduction

1.1 Motivation

The deployment of LLMs in religious contexts—seminary education, pastoral counseling tools, biblical study applications—necessitates rigorous evaluation of theological competency. Yet current benchmarks inadequately assess this domain. The Gloo FAI-C benchmark found that the "Faith" dimension scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models (Gloo, 2025).

The problem extends beyond factual errors. Models exhibit systematic failure modes:

  • Generic collapse: Defaulting to ecumenical platitudes instead of tradition-specific claims
  • Denominational conflation: Treating distinct positions as interchangeable
  • Doctrinal flattening: Presenting contested issues as settled or vice versa
  • Scriptural mishandling: Inaccurate citation, decontextualization, or proof-texting

1.2 Contributions

FaithBench addresses these gaps with:

  1. Operationalized construct: Explicit definition of theological faithfulness mapped to measurable dimensions
  2. Tradition-aware evaluation: Scoring rubrics calibrated to denominational distinctives
  3. Statistical methods: Bootstrap confidence intervals (IRR metrics planned with human calibration)
  4. Reproducibility: Full evaluation artifacts published—judge prompts, rubric weights, model configs—on this page
  5. Bias documentation: Position, verbosity, and tradition fairness analysis protocols (execution in progress)

1.3 A Note on Benchmarking


2. Related Work

2.1 LLM Evaluation Benchmarks

HELM (Liang et al., 2023) established holistic evaluation across capabilities, though religion receives minimal coverage. MMLU (Hendrycks et al., 2021) includes philosophy and ethics but lacks theological depth. Domain-specific benchmarks exist for medicine (MedQA), law (LegalBench), and science, but no equivalent exists for theology.

2.2 LLM-as-Judge Methodologies

G-Eval (Liu et al., 2024) demonstrated that LLM judges achieve strong correlation with human judgment when given detailed rubrics. MT-Bench (Zheng et al., 2024) validated pairwise comparison protocols. We adopt rubric-based absolute scoring with calibration against human expert baselines.

2.3 Construct Validity in NLP

Raji et al. (2021) and Bowman & Dahl (2021) critiqued benchmark construct validity, arguing that many NLP benchmarks lack clear phenomenon-to-task mapping. We explicitly define our construct and justify dimension selection.

2.4 Theological AI Evaluation

Prior work on religious AI is limited. Studies have examined bias in religious representation (Abid et al., 2021) but not systematic theological competency. FaithBench addresses this gap with a structured, transparent methodology.

2.5 Benchmarking Methodology

Epoch AI (2025) provides a comprehensive analysis of why benchmarking is hard, documenting significant variance from prompt templates, API providers, and single-run evaluations. Their findings directly inform our limitations disclosure and hardening roadmap (see Sections 8 and 15).


3. Construct Definition

3.1 Defining Theological Faithfulness

We operationalize theological faithfulness as accuracy in representing:

ComponentDefinition
Primary Source FidelityAccurate handling of biblical texts in original languages (Hebrew, Aramaic, Greek)
Doctrinal PrecisionCorrect representation of tradition-specific systematic theology
Historical AwarenessUnderstanding of theological development across church history
Hermeneutical CompetenceAppropriate interpretive methodology and genre awareness
Apologetic ReasoningSound argumentation within Christian intellectual tradition
Intertextual RecognitionIdentification of canonical connections, typology, and allusions

3.2 Justification for Dimension Selection

Our six dimensions derive from established theological education standards:

  • Association of Theological Schools (ATS): Accreditation standards for M.Div. programs specify competencies in biblical languages, systematic theology, and church history
  • Confessional Standards: Reformed (Westminster), Catholic (Catechism), Orthodox (Philokalia), and Evangelical (Chicago Statement) documents define tradition-specific requirements
  • Seminary Curricula: Analysis of 20 seminary curricula across traditions reveals consistent emphasis on these competencies

3.3 Tradition-Specific Evaluation Rationale

Models are evaluated within tradition contexts rather than against a neutral standard because:

  1. Theological disagreement is genuine: Traditions hold incompatible positions on key doctrines
  2. Accuracy is tradition-relative: A correct Catholic answer may be incorrect from a Reformed perspective
  3. Conflation is the failure mode: Generic responses that avoid specificity fail both traditions

3.4 Dimension Inclusion Arguments

For each dimension, we provide an explicit argument for inclusion, what failure looks like without it, and why exclusion would distort the evaluation.

Why each dimension belongs:

The table below uses the scoring dimension names (e.g., "Textual Analysis") which correspond to the construct components defined in Section 3.1 (e.g., "Primary Source Fidelity"). The mapping is: Textual Analysis = Primary Source Fidelity, Hermeneutical Reasoning = Hermeneutical Competence, Doctrinal Precision = Doctrinal Precision, Historical Theology = Historical Awareness, Apologetics = Apologetic Reasoning, Intertextual Reasoning = Intertextual Recognition.

DimensionWhy IncludedFailure Without ItExclusion Would Distort Because...
Textual AnalysisBiblical text is the primary source material for all Christian theologyA model could give doctrinally correct answers while misrepresenting the underlying textsTheological claims ultimately rest on textual interpretation; ignoring source competency masks foundational errors
Hermeneutical ReasoningInterpretation determines which doctrinal conclusions follow from textsA model could cite texts accurately but apply inappropriate interpretive methodsWithout evaluating interpretive method, we cannot distinguish sound from unsound theological reasoning
Doctrinal PrecisionTradition-specific doctrine is the core construct of theological faithfulnessA model could reason well from texts but misrepresent what traditions actually teachThis is the most direct measure of the construct; excluding it would undermine the benchmark's purpose
Historical TheologyTheological positions developed through historical processes and debatesA model could state current doctrine correctly while being anachronistic about its developmentHistorical context prevents misattributing modern formulations to ancient periods
ApologeticsSound argumentation is integral to the Christian intellectual traditionA model could know doctrine but present logically invalid arguments for itTheological competence includes the ability to reason about and defend positions, not just state them
Intertextual ReasoningThe Christian canon is treated as an interconnected whole across traditionsA model could handle individual passages but miss canonical connections that inform doctrineCross-textual reasoning is fundamental to how theological conclusions are derived from Scripture

What we intentionally excluded and why:

Excluded CompetencyReason for Exclusion
Pastoral reasoningMeasures professional counseling competency and application wisdom, not theological knowledge accuracy; applied reasoning questions test theological integration in scenarios but do not evaluate pastoral care skills
Spiritual formationSubjective and experiential; not measurable via text-based Q&A
Liturgical competencyTradition-specific practice knowledge; partially captured under doctrinal and historical dimensions
Ethics / moral theologyLarge enough to be its own benchmark; partially captured under doctrinal precision
Biblical languages as standaloneCaptured within textual analysis; making it separate would over-weight linguistic competency relative to theological reasoning

4. Evaluation Framework

4.1 Dimension Taxonomy

1. Textual Analysis

25/100

Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation, and textual criticism.

2. Hermeneutical Reasoning

20/100

Interpretive methodology: genre awareness, contextual analysis, canonical integration, and application of hermeneutical principles.

3. Doctrinal Precision

20/100

Systematic theology: accurate representation of tradition-specific doctrinal positions, creedal formulations, and denominational distinctives.

4. Historical Theology

15/100

Church history: patristic sources, Reformation debates, doctrinal development, and historiographical method.

5. Apologetics

10/100

Philosophical theology: logical validity, evidence usage, objection handling, and Christian intellectual tradition.

6. Intertextual Reasoning

10/100

Canonical connections: cross-references, typological recognition, allusion detection, and thematic integration.

4.2 Scoring Scale

We employ a 0–3 scoring scale based on LLM-as-judge research demonstrating that low-precision scales reduce judge variability while maintaining discriminative power (Zheng et al., 2024):

ScoreLabelCriteria
3ExcellentFully accurate, demonstrates depth, uses appropriate vocabulary and sources
2GoodMostly accurate with minor gaps, adequate vocabulary
1PartialSome accuracy but significant errors or omissions
0InadequateIncorrect, misleading, or fails to address the question

Raw scores (0–3) are normalized to 0–1 by dividing by 3, then weighted by dimension weights to produce a final composite score per test case.

4.3 Weight Derivation

Provisional Weight Justification:

DimensionWeightRationale
Textual Analysis25%Biblical text is foundational to all theological reasoning
Hermeneutical Reasoning20%Interpretation determines doctrinal conclusions
Doctrinal Precision20%Correct representation of tradition is core to faithfulness
Historical Theology15%Historical awareness prevents anachronism and error
Apologetics10%Important but not primary for most use cases
Intertextual Reasoning10%Supports but does not drive theological claims

Planned Methodology: Delphi consensus with:

  • 5+ theologians per evaluated tradition
  • 3 rounds of independent weighting
  • Convergence criterion: IQR < 10%

4.4 Rubric Specifications

Each dimension has weighted sub-dimensions that the LLM judge scores independently. The weights below are the actual values used in production scoring (copied from our scoring code).

4.4.1 Textual Analysis (25%)

Sub-dimensionWeightDescription
Lexical Accuracy35%Greek/Hebrew terms, morphology, semantic ranges
Translation Fidelity35%Source-to-target accuracy, theological implications
Linguistic Reasoning20%Grammar, syntax, verb tenses, discourse analysis
Source Handling10%Manuscript variants, textual criticism principles

4.4.2 Hermeneutical Reasoning (20%)

Sub-dimensionWeightDescription
Interpretive Method35%Sound hermeneutical principles, grammatical-historical method
Genre Awareness25%Recognition of literary forms (narrative, poetry, apocalyptic)
Contextual Analysis25%Historical, cultural, literary context
Canonical Integration15%Scripture interpreting Scripture, redemptive-historical reading

4.4.3 Doctrinal Precision (20%)

Sub-dimensionWeightDescription
Doctrinal Accuracy35%Correct representation of tradition's position
Tradition Fidelity35%Appropriate vocabulary, conceptual framework
Nuance Recognition20%Awareness of internal debates, development
Source Grounding10%References to authoritative sources

4.4.4 Historical Theology (15%)

Sub-dimensionWeightDescription
Historical Accuracy40%Correct dating, attribution, context
Development Awareness30%Understanding of doctrinal evolution
Patristic Knowledge20%Church fathers and early sources
Historiographical Method10%Sound historical reasoning

4.4.5 Apologetics (10%)

Sub-dimensionWeightDescription
Logical Validity35%Sound argumentation structure
Evidence Usage30%Appropriate use of supporting evidence
Objection Handling25%Fair representation and response to critiques
Persuasive Clarity10%Clear communication of arguments

4.4.6 Intertextual Reasoning (10%)

Sub-dimensionWeightDescription
Cross-Reference Accuracy40%Correct identification of parallel passages
Typological Recognition30%OT/NT typological connections
Allusion Detection20%Recognition of biblical allusions
Thematic Integration10%Coherent thematic connections

5. Methodology

5.1 Test Case Construction

5.1.1 Sampling Strategy

Test cases are systematically sampled across:

DimensionDifficultyTraditionQuestion Type
6 dimensions4 levels (Easy/Medium/Hard/Expert)4 traditions (with more in development)4 types

Sampling criteria:

  • Minimum 10 cases per dimension-difficulty cell
  • Stratified by tradition where applicable
  • Expert review for content validity

Question Types:

  1. Factual Recall: Direct knowledge with verifiable answers
  2. Comparative Analysis: Contrasting positions across traditions/translations
  3. Applied Reasoning: Scenarios requiring theological integration across concepts
  4. Contested Topics: Questions with multiple valid tradition-specific answers

5.1.2 Difficulty Calibration

LevelCharacteristicsTarget AudienceTarget Accuracy
EasyIntroductory level, clear answersFirst-year seminary80-95%
MediumAdvanced coursework, nuanced understandingM.Div. graduate60-80%
HardSpecialist knowledge, original languages, contested interpretationsPh.D./faculty level40-70%
ExpertMulti-hop reasoning, Google-proof, adversarial designSpecialist scholars only30-50%

Difficulty calibration validated by:

  • Expert rating (3 theologians per case)
  • Empirical difficulty from pilot testing
  • Item response theory analysis (planned for v2.0)

5.1.3 Public/Held-Out Split

SetPercentagePurpose
Public50%Transparency, reproducibility, model development
Held-Out50%Prevent data contamination, detect gaming

Held-out cases are rotated semi-annually to prevent leakage while maintaining evaluation validity.

5.1.4 Expert-Level Design Principles

Expert questions are designed using principles from high-discrimination benchmarks (GPQA, Humanity's Last Exam, MMLU-Pro):

Design Criteria:

PrincipleDescriptionExample
Multi-hop reasoningRequire synthesizing 3+ distinct facts"Compare Athanasius and Arius on Col 1:15, explain grammatical argument, and trace to Nicaea"
Google-proofCannot be solved by simple web searchQuestions requiring synthesis across multiple scholarly sources
Adversarial distractorsPlausible wrong answers from common misconceptionsMisattributed patristic quotes, conflated tradition positions
Intra-tradition precisionDistinguish within traditions, not just betweenSupralapsarian vs infralapsarian, Thomist vs Molinist
Abstention testingSome questions where "insufficient evidence" is correct"What was Origen's final position on X?" where evidence is fragmentary

Question Categories by Section:

  • Textual: Hapax legomena, manuscript variants with theological stakes, grammatical ambiguity
  • Hermeneutical: Genre disputes, sensus plenior debates, typology vs allegory boundaries
  • Doctrinal: Intra-tradition debates (supra/infra, Thomist/Molinist, essence-energies)
  • Historical: Patristic attribution traps, council canon specifics, Reformation debate details
  • Apologetics: Modal logic arguments, grounding objections, internal critiques
  • Intertextual: Second Temple interpretation, MT vs LXX usage patterns, composite quotations

Validation Protocol:

Expert questions undergo LLM pre-screening before inclusion:

  1. Test against 3 frontier models (GPT-5, Claude Opus, Gemini Pro)
  2. Reject if ANY model scores >70% (too easy)
  3. Reject if ALL models score <20% (possibly ambiguous)
  4. Target: 30-50% accuracy range for maximum discrimination

5.1.5 Difficulty Distribution Rationale

To optimize inference costs and benchmark utility, Easy and Medium questions were deactivated for scoring. The current active test set emphasizes:

  • Hard questions (~60%): Ph.D./faculty-level content where models show meaningful variation
  • Expert questions (~40%): Multi-hop reasoning and adversarial design for maximum discrimination

This approach follows psychometric best practices: items that all test-takers answer correctly (or incorrectly) contribute no information about relative ability. By focusing on Hard and Expert tiers, FaithBench maximizes the signal-to-noise ratio while reducing evaluation costs.

5.2 Model Configuration

5.2.1 Evaluation Parameters

ParameterSettingRationale
Temperature0.7Balances determinism with natural variation
Max Tokens2,000Sufficient for theological responses without truncation
Reasoning/ThinkingDisabled (where applicable)Controlled comparison of base knowledge (see Section 9)
System promptsMinimalAvoid biasing responses
ProviderOpenRouterUnified API access to all models (see Section 5.5)

Important limitations:

  • Reasoning models (e.g., o1, o3, Claude with extended thinking): Tested without explicit reasoning enabled. These models would likely score higher with thinking/reasoning turned on.
  • Internal reasoning: Some models reason internally by default—we cannot control this and do not penalize it.
  • Future work: We plan to add "thinking-enabled" variants to the leaderboard to show the performance delta.

5.3 LLM-as-Judge Protocol

5.3.1 Judge Model Selection

ParameterValueRationale
Primary Judgegoogle/gemini-3-flash-previewCost-effective, strong reasoning, no theological fine-tuning
Fallback Judgeopenai/gpt-4o-miniReliability backup when primary is unavailable
Temperature0Deterministic for scoring consistency
Max Tokens16,000Prevents truncation of judge reasoning
Output FormatJSON (structured)Machine-parseable dimension scores

Selection criteria for primary judge:

  • Strong performance on reasoning benchmarks
  • No theological fine-tuning (reduces bias)
  • Consistent rubric application in validation testing
  • Cost-effectiveness for large-scale evaluation

The fallback judge activates automatically when the primary judge returns an API error. The judge model used for each evaluation is recorded in the test result metadata.

Planned secondary judge: OpenAI GPT-5 for cross-validation (see Section 5.3.2)

5.3.2 Self-Preference Bias Awareness

The Problem

When the judge model evaluates its own outputs (e.g., Gemini judging Gemini responses), scores may be artificially inflated due to:

  • Stylistic familiarity with own output patterns
  • Implicit preference for own reasoning approaches
  • Shared training data artifacts

Current State (v1.0)

We currently use a single judge model (Gemini 3 Flash) for all evaluations. This means:

  • Gemini model scores may be inflated when Gemini judges its own outputs
  • We acknowledge this limitation and flag it in results
  • Users should interpret Gemini scores with this caveat in mind

Planned Mitigation (v2.0)

With additional funding, we plan to implement cross-validation:

  1. Primary judge: Gemini 3 Flash
  2. Secondary judge: GPT-5 (or similar non-Gemini model)
  3. Conflict handling: When evaluated model = judge model family, use secondary judge score
  4. Dual reporting: Show both scores for transparency
Model EvaluatedPrimary (Gemini)Cross-Val (GPT-5)Leaderboard Uses
Claude Opus0.94Primary
Gemini models0.920.88Cross-val
GPT-50.91Cross-val only

Why This Matters

Self-preference bias is a known limitation of single-judge LLM evaluation. By documenting this limitation and our mitigation roadmap, we maintain methodological transparency while working toward a more robust evaluation protocol.

5.3.3 Judge Prompt: Exact System Prompt

The following is the exact system prompt sent to the judge model. This is copied directly from our production scoring code (packages/convex/convex/scoring.ts):

You are an expert theological evaluator assessing LLM responses
for the FaithBench benchmark.
Your role is to score responses on a 0-3 scale for each dimension.

SCORING SCALE:
0 = Inadequate: Incorrect, misleading, or fails to address
1 = Partial: Some accuracy but significant errors or omissions
2 = Good: Mostly accurate with minor gaps
3 = Excellent: Fully accurate, demonstrates depth

EVALUATION DIMENSIONS for {section}:
- {dimensionName} ({weight}%): {description}
- {dimensionName} ({weight}%): {description}
...

OUTPUT FORMAT:
Respond with valid JSON only:
{
  "scores": [
    { "dimension": "<dimension_name>",
      "rawScore": <0-3>,
      "reasoning": "<brief justification>" }
  ]
}

Be strict but fair. Evaluate theological accuracy, not style or length.

The {section} placeholder is replaced with the test case's dimension (e.g., "textual", "doctrinal"), and each dimension's sub-dimensions and weights are injected from the rubric configuration.

5.3.4 Judge Prompt: Evaluation Prompt Structure

The evaluation prompt sent as the user message follows this structure:

QUESTION:
{the original test case prompt}

MODEL RESPONSE:
{the model's response being evaluated}

REFERENCE ANSWER (for comparison):
{reference answer, when available}

TRADITION CONTEXT: {tradition}
Evaluate fidelity to this specific theological tradition.

Evaluate the model response and provide scores for each dimension.

The reference answer and tradition context blocks are included only when the test case provides them.

5.3.5 Error Handling

  • Retry logic: 3 attempts with exponential backoff on API failures
  • Consistency check: Flag responses where judge reasoning contradicts score
  • Edge case routing: Ambiguous cases flagged for human review

5.4 Scoring Calculation

5.4.1 Per-Test-Case Scoring

For each test case, the judge returns raw scores (0–3) for each sub-dimension. The composite score is calculated as:

  1. Normalize: Each raw score is divided by 3 to produce a 0–1 normalized score
  2. Weight: Each normalized score is multiplied by its sub-dimension weight
  3. Sum: Weighted scores are summed to produce the test case's composite score (0–1)

For example, for a Textual Analysis test case:

composite = (lexicalAccuracy/3 × 0.35)
          + (translationFidelity/3 × 0.35)
          + (linguisticReasoning/3 × 0.20)
          + (sourceHandling/3 × 0.10)

5.4.2 Difficulty Weighting

Test cases are weighted by difficulty level when computing model-level aggregate scores:

DifficultyWeightRationale
Easy1.0Baseline competency (currently deactivated)
Medium1.5Intermediate competency (currently deactivated)
Hard2.0Expert-level content where models differentiate
Expert3.0Maximum discrimination items

This means an Expert-level test case contributes 3× more to a model's aggregate score than an Easy-level case. The rationale: getting hard questions right is a stronger signal of theological competence than getting easy questions right.

5.5 Provider & Reproducibility

5.5.1 Why OpenRouter

OpenRouter provides unified API access to 300+ models from different providers (OpenAI, Anthropic, Google, Meta, Mistral, etc.). For an early-stage benchmark, this offers practical advantages:

  • Single API integration for all models
  • Consistent request/response format
  • Built-in cost tracking

5.5.2 Provider Variance

Epoch AI's analysis found that the same model served by different providers can produce different scores. Causes include:

  • Quantization differences between providers
  • Different inference engines and hardware
  • Chat template handling differences
  • Token limit and rate limit differences

This means FaithBench scores reflect model performance as served through OpenRouter, which may differ from scores obtained through direct API access or other providers. Our scores should be interpreted with this uncertainty in mind.

5.5.3 Reproducibility Plan

When funded, we plan to:

  1. Validate a subset of model scores via direct provider APIs (OpenAI, Anthropic, Google)
  2. Quantify and report the magnitude of provider variance for FaithBench specifically
  3. Select the provider configuration that produces scores closest to direct-API baselines

5.6 Error Handling & Retry Policy

5.6.1 Test Model Execution

Test model API calls are managed by a workpool with the following configuration:

ParameterValueRationale
Max parallelism50 concurrent actionsStays within Convex Starter tier limits
Retry attempts3Standard resilience for API unreliability
Initial backoff1,000msAllows transient errors to clear
Backoff multiplierExponential: 1s → 2s → 4s
Max backoff60,000msPrevents excessive wait times

5.6.2 Judge Model Execution

Judge model calls have an additional resilience layer:

  • If the primary judge (google/gemini-3-flash-preview) returns an error, the system automatically falls back to the secondary judge (openai/gpt-4o-mini)
  • The judge model used is recorded per test result for traceability
  • If both judges fail after retries, the test case is marked as failed and excluded from scoring

5.6.3 Error Classification

Error TypeRetryableHandling
Rate limit (429)YesExponential backoff, respect Retry-After header
Server error (5xx)YesExponential backoff
Timeout (408)YesExponential backoff
Authentication (401)NoFail immediately, flag for operator review
Content filtered (403)NoRecord as filtered, exclude from scoring
Model not found (404)NoFail immediately
Invalid request (400)NoFail immediately
Parse failureNoRecord raw response, mark as judge error

5.6.4 Failed Test Cases

Test cases that fail after all retries are:

  • Recorded with error metadata (error type, attempts, final error message)
  • Excluded from aggregate score calculations
  • Reported in run summary statistics so users can see the failure rate

5.7 Statistical Methods

5.7.1 Bootstrap Confidence Intervals

We report 95% confidence intervals using bootstrap resampling:

  • Iterations: 1,000 bootstrap samples
  • Method: Percentile method
  • Stratification: By dimension to ensure representative sampling

Confidence intervals are reported for:

  • Overall scores
  • Dimension-level scores
  • Tradition-specific scores

5.7.2 Inter-Rater Reliability

We will calculate and report (once human calibration is complete):

MetricTargetInterpretation
Krippendorff's α≥ 0.67Acceptable reliability (Krippendorff, 2004)
Cohen's κ≥ 0.61Substantial agreement (Landis & Koch, 1977)
% Exact Agreement≥ 70%Practical reliability threshold

IRR calculated between:

  • LLM judge and human expert panel
  • Multiple LLM judges (cross-validation)
  • Multiple human experts (gold standard)

5.7.3 Multiple Comparison Correction

When comparing multiple models:

  • Bonferroni correction for family-wise error rate
  • Report both corrected and uncorrected p-values
  • Effect sizes (Cohen's d) for practical significance

6. Validation

6.1 Human Expert Calibration

6.1.1 Expert Panel Composition

TraditionMinimum ExpertsQualifications
Reformed3M.Div.+ from confessional seminary
Catholic3Advanced degree in Catholic theology
Orthodox3Theological training in Orthodox tradition
Evangelical3Faculty/pastoral experience
Pentecostal (planned)2Academic credentials + tradition familiarity

6.1.2 Calibration Protocol

  1. Gold standard creation: Experts independently score 50 responses
  2. Adjudication: Disagreements resolved through discussion
  3. IRR calculation: Krippendorff's α between experts
  4. Judge calibration: Compare LLM judge to expert majority vote
  5. Threshold: LLM judge must achieve α ≥ 0.67 vs. human panel

6.1.3 Calibration Dataset

CharacteristicRequirement
Size50 responses minimum per dimension
StratificationBy difficulty, tradition, model
SelectionRandom + edge cases identified in pilot
Refresh25% new cases each evaluation cycle

6.2 Bias Analysis

6.2.1 Position Bias

Protocol:

  • Run all pairwise comparisons in both orders (A vs B, B vs A)
  • Calculate preference reversal rate
  • Target: < 10% reversal rate

6.2.2 Verbosity Bias

Protocol:

  • Calculate Pearson correlation between response length and score
  • Target: |r| < 0.3 (weak or no correlation)
  • If violated, implement length-normalization

6.2.3 Tradition Fairness

Protocol:

  • Cross-tradition grading: Reformed experts grade Catholic responses and vice versa
  • Measure systematic score differences
  • Target: No tradition systematically scored > 0.5 points different by out-group experts

6.3 Sensitivity Analysis

6.3.1 Weight Perturbation

Test robustness by:

  • Varying dimension weights ± 5%
  • Measuring rank-order stability across models
  • Reporting sensitivity coefficients

6.3.2 Judge Model Comparison

Cross-validate using:

  • Multiple judge models (Claude, GPT-5, Gemini)
  • Report agreement rates
  • Flag dimensions with high inter-judge variance

6.4 Validation Status (v1.0)

ComponentStatusNotes
Test cases✅ Complete413 cases across 6 dimensions
Difficulty calibration✅ CompleteEasy/Medium deactivated due to ceiling effects
LLM judge✅ OperationalGemini 3 Flash (single judge; cross-validation planned)
Bootstrap CI✅ Operational1000 iterations, 95% confidence
Human calibration🔄 In progressExpert recruitment underway
IRR metrics⏳ PlannedRequires human calibration completion
Bias testing⏳ PlannedTooling complete, awaiting scale data
Delphi weights⏳ PlannedScheduled for v2.0

6.5 Maturity Classification

To clearly distinguish what is conceptual, implemented, and validated, we provide the following maturity classification:

ComponentConceptualImplementedValidated
Construct definition (6 dimensions)YesPending expert review
Test case corpus (413 cases)YesLLM-screened, not human-validated
Scoring rubrics with exemplarsYesProvisional (no IRR data)
LLM-as-judge scoringYesOperational, not calibrated against humans
Bootstrap confidence intervalsYesCaptures sampling uncertainty only
Difficulty weightingYesEmpirically adjusted, not IRT-calibrated
Dimension weightsYesDesigner priors, not Delphi-validated
Human expert calibrationYesIn progressNo
Cross-validation judgeYesNoNo
IRT difficulty calibrationYesNoNo
Delphi weight derivationYesNoNo
Bias testing (position/verbosity/tradition)YesTooling readyNo

A reader should never have to guess whether a component of FaithBench is live, partial, or aspirational. This table is the definitive reference.


7. Traditions Evaluated

Doctrinal Framework: Covenant theology, TULIP, Westminster Standards

Key Competencies Tested:

  • Sola scriptura application
  • Predestination and election
  • Covenant of grace structure
  • Perseverance of the saints
  • Reformed hermeneutics

Authoritative Sources: Westminster Confession, Heidelberg Catechism, Canons of Dort, Three Forms of Unity


8. Current Limitations

8.1 Single LLM Judge Without Human Inter-Rater Reliability

All scoring in v1.0 is performed by a single LLM judge (Gemini 3 Flash Preview). No human inter-rater reliability (IRR) data has been collected yet. This means:

  • We cannot currently quantify how well our automated scores align with expert human judgment
  • The reliability of scores across dimensions is assumed but not empirically validated
  • Dimensions requiring deep theological nuance (e.g., doctrinal precision, hermeneutical reasoning) may be scored less reliably than more factual dimensions (e.g., historical accuracy)

Human expert calibration is in progress (see Section 6.1), but until IRR metrics are published, all scores should be treated as preliminary automated assessments.

8.2 Self-Preference Bias Unmitigated

Self-preference bias—where LLM judges rate outputs from their own model family more favorably—is documented in the literature at 5–15% inflation (see Section 5.3.2). In v1.0:

  • Gemini model scores may be inflated when judged by Gemini 3 Flash
  • No cross-validation judge is currently operational (GPT-5 cross-validation is planned)
  • We flag this limitation but do not yet adjust scores to compensate

Users should interpret Gemini-family model scores with particular caution until cross-validation is implemented.

8.3 Single Provider (OpenRouter)

All models are accessed through OpenRouter. Epoch AI documents that provider choice can significantly affect scores due to quantization, inference engine differences, and chat template handling (see Section 5.5). This means:

  • Our scores reflect performance as served through OpenRouter, not necessarily through direct provider APIs
  • The magnitude of provider variance for FaithBench specifically is unquantified
  • Scores are not directly comparable to evaluations run through other providers

8.4 Single-Run Scores

Each model is evaluated once per benchmark run. Epoch AI recommends 4–32 runs as standard practice. Single-run evaluation means:

  • Run-to-run variance is unquantified
  • Even with temperature 0.7 for test models and temperature 0 for the judge, some variance exists from provider-side factors
  • Our bootstrap CIs capture sampling uncertainty (which test cases) but not run-to-run uncertainty

We plan to implement multi-run averaging (minimum 4 runs per model) when funded.

8.5 Bootstrap CIs Do Not Account for Judge Error

Our bootstrap CIs (1,000 iterations, 95% confidence) quantify sampling uncertainty—the variability from which test cases a model happens to encounter. They do not account for:

  • Systematic judge error (if the LLM judge consistently misjudges a category)
  • Judge variance (if the same response would receive different scores on re-evaluation)
  • Construct validity uncertainty (whether our rubrics capture what we intend to measure)

The reported CIs are therefore narrower than true uncertainty. They should be interpreted as lower bounds on the actual confidence interval.

8.6 Expert-Item Selection Circularity

Expert-level questions are pre-screened against frontier LLMs (Section 5.1.4): questions where any model scores >70% are rejected as too easy. This creates a selection effect:

  • The expert tier is defined partly by what current models get wrong
  • This may conflate "genuinely hard theological content" with "content that happens to confuse current LLMs"
  • As models improve, items may need recalibration—but the original selection bias persists in historical comparisons

We plan to address this with Item Response Theory (IRT) calibration in v2.0, which would provide model-independent difficulty estimates.

8.7 Held-Out Rotation Without Equating

Held-out test cases rotate semi-annually to prevent data contamination (Section 5.1.3). However:

  • No test equating procedure is currently applied across rotations
  • Scores from different rotation periods may not be directly comparable
  • A model's score could change between periods due to item difficulty shifts, not actual capability changes

Planned mitigation: anchor items (a fixed subset of questions present in all rotations) to enable cross-period equating.


9. Testing Configuration

9.1 Rationale: Controlled Comparison

Our testing configuration prioritizes controlled comparison:

ParameterSettingWhy This SettingReal-World Difference
Temperature0.7Balances reproducibility with natural variationUsers may use different temperatures; some applications use 0 or 1.0
Reasoning/ThinkingDisabledControls for reasoning capability differences; tests base knowledgeMany users enable reasoning modes; models with thinking enabled would likely score higher
System promptsMinimalAvoids biasing responses toward any traditionReal applications often include system prompts that improve domain performance
Max tokens2,000Sufficient for theological responsesSome applications constrain or expand response length

What this means for users: FaithBench scores represent a controlled baseline of model capability. Models with reasoning enabled, appropriate system prompts, or retrieval-augmented generation (RAG) may perform significantly better in practice. Our scores should not be interpreted as "what this model will do when you ask it a theological question"—they measure base theological knowledge under standardized conditions.

We plan to add "thinking-enabled" leaderboard variants to show the performance delta between base and reasoning-augmented configurations.


10. Public Data Access

10.1 What's Available

Signed-in users can view the following for each benchmark run:

  • Public test cases: The question prompts for the public 50% of the test set
  • Model responses: The full text each model generated
  • Judge reasoning: The LLM judge's dimension-by-dimension scores and justifications
  • Aggregate scores: Overall and per-dimension scores with confidence intervals

10.2 What's Held Back (and Why)

The held-out 50% of the test set is not publicly visible. This is standard practice in benchmarking:

  • MMLU maintains held-out test sets to prevent training data contamination
  • GPQA withholds questions to maintain benchmark validity
  • SWE-bench uses private test instances

If prompts and reference answers are fully public, model developers can (intentionally or inadvertently) train on them, rendering the benchmark meaningless. The public/held-out split balances transparency with benchmark integrity.

10.3 Academic Access

Researchers seeking access to the full test set for academic purposes may contact us at hello@faithbench.com. We will evaluate requests based on:

  • Affiliation with a recognized research institution
  • Clear research purpose
  • Agreement not to use test cases for model training

11. Benchmarking Is Hard

11.1 The State of AI Benchmarking

Epoch AI's "Why Benchmarking Is Hard" documents pervasive challenges that affect all AI benchmarks:

  • Prompt sensitivity: GPQA-Diamond scores range 74%–80% depending on prompt template alone
  • Provider variance: The same model served by different providers produces different scores
  • Temperature inconsistency: Different organizations use temperatures ranging from 0.0 to 1.0
  • LLM-as-judge effects: Judge model choice has "sizable impact" on results
  • Single-run insufficiency: Standard practice is 4–32 runs averaged; single runs are considered insufficient

As Epoch AI puts it: "Basically everyone is doing their own thing."

11.2 Where FaithBench Stands

FaithBench is an early-stage benchmark. We are not yet at the rigor level of established benchmarks like MMLU or GPQA, and we don't claim to be. Our primary sources of uncertainty are:

  1. Judge variance: Single LLM judge, potential self-preference bias
  2. Provider variance: OpenRouter-mediated access, unquantified provider effects
  3. Run-to-run variance: Single-run evaluation
  4. Validation gap: No human IRR data yet

What we have done is make all of this transparent—including our exact judge prompts, rubric weights, model configs, and error handling. We believe transparency about limitations is more valuable than hiding them behind confident-sounding scores.

11.3 Our Hardening Roadmap

We have a phased plan to address these limitations:

  • Phase 1 (current): Full transparency and honest disclosure of all artifacts and limitations
  • Phase 2 (~$500 funding): Multi-judge evaluation, multi-run averaging, provider diversification, prompt sensitivity analysis
  • Phase 3 (~$2,000+ funding): Human inter-rater reliability, IRT calibration, bias auditing, full reproducibility package

Phase 1 makes our limitations transparent. Phase 2 quantifies them. Phase 3 mitigates them.


12. Development Status

The following table summarizes what is operational in v1.0 versus what is planned for future versions:

ComponentStatusVersionNotes
Test case corpus (413 cases)✅ Donev1.06 dimensions, 4 difficulty levels
Difficulty calibration✅ Donev1.0Easy/Medium deactivated due to ceiling effects
LLM-as-judge scoring✅ Donev1.0Single judge (Gemini 3 Flash Preview)
Bootstrap confidence intervals✅ Donev1.01,000 iterations, 95% CI
Scoring rubrics with exemplars✅ Donev1.0All 6 dimensions documented
Judge prompt published✅ Donev1.0Exact system prompt on this page
Rubric weights published✅ Donev1.0All sub-dimension weights on this page
Model execution config published✅ Donev1.0Temperature, max tokens, provider
Error handling documented✅ Donev1.0Retry policy, fallback judge, error classification
Provider limitation disclosed✅ Donev1.0OpenRouter-only, Epoch AI citation
Scoring calculation documented✅ Donev1.0Normalization, weighting, difficulty multipliers
Hardening roadmap published✅ Donev1.0Phased plan with funding requirements
Open evaluation code✅ Donev1.0MIT license
Cross-validation judge⏳ Plannedv2.0GPT-5 as secondary judge
Multi-run averaging⏳ Plannedv2.0Minimum 4 runs per model
Provider diversification⏳ Plannedv2.0Direct API validation for subset
Human expert calibration🔄 In progressv2.0Expert recruitment underway
Inter-rater reliability metrics⏳ Plannedv2.0Requires human calibration
Position/verbosity/tradition bias testing⏳ Plannedv2.0Tooling complete, awaiting scale data
Delphi expert weight derivation⏳ Plannedv2.05+ theologians per tradition
Item Response Theory calibration⏳ Plannedv2.0Model-independent difficulty estimates
Test equating across rotations⏳ Plannedv2.0Anchor items for cross-period comparison
Thinking-enabled leaderboard⏳ Plannedv2.0Reasoning mode performance delta
Non-Western tradition expansion⏳ Plannedv2.0+African, Asian, Latin American frameworks
Multi-lingual evaluation⏳ Plannedv2.0+German, Spanish, Korean theological discourse
Pentecostal tradition🔄 In progressv2.0Expert recruitment and test case development

13. Tradition Scope and Definitions

13.1 Evangelical Tradition Boundaries

FaithBench's "Evangelical" tradition category follows the Bebbington Quadrilateral (Bebbington, 1989), which defines Evangelicalism by four characteristics:

  1. Biblicism: High regard for biblical authority (operationalized via the Chicago Statement on Biblical Inerrancy)
  2. Crucicentrism: Focus on Christ's atoning work on the cross (substitutionary atonement emphasis)
  3. Conversionism: Emphasis on personal conversion experience ("born again")
  4. Activism: Commitment to evangelism and missionary effort (Great Commission emphasis)

What is included: Conservative evangelical positions on Scripture, atonement, conversion, and mission as represented in the Chicago Statement, evangelical systematic theologies (e.g., Grudem, Erickson), and major evangelical confessional documents.

What is excluded: Mainline Protestant positions, progressive evangelical positions, and prosperity gospel teachings. These may be evaluated under separate tradition categories in future versions.

Relationship to Reformed: There is significant overlap between Evangelical and Reformed categories. The distinction: Reformed questions test confessional precision (Westminster Standards, covenant theology, TULIP) while Evangelical questions test broader evangelical distinctives that cross confessional lines.

13.2 Operational Definitions

"Google-Proof" Questions

A question is "Google-proof" when it cannot be answered correctly by performing a single web search and reading the top results. Operationally, this means:

  • The answer requires synthesizing information from 3+ distinct sources that do not appear together on any single web page
  • Simple keyword searches return partial or misleading information
  • Correct answers require domain expertise to evaluate and integrate conflicting sources

This is modeled on the GPQA benchmark (Rein et al., 2023), which demonstrated that non-expert validators with unrestricted web access still scored only 34% on questions experts answered at 65%.

"Multi-Hop" Reasoning

A question requires "multi-hop" reasoning when answering correctly requires:

  1. Retrieving multiple distinct facts from different areas of theological knowledge
  2. Connecting those facts through logical or theological reasoning
  3. Synthesizing a conclusion that is not explicitly stated in any single source

Example: "Compare Athanasius's and Arius's interpretations of Colossians 1:15, explain the grammatical argument for each reading, and trace how this debate influenced the Nicene formulation."

This requires: (1) knowledge of the Arian controversy, (2) Greek grammar of the genitive construction, (3) patristic textual arguments, and (4) conciliar history—four distinct knowledge domains integrated into a single answer.

13.3 Tradition Category Asymmetry

The four tradition categories in v1.0 are not perfectly parallel analytic units:

TraditionCategory TypeScope
CatholicEcclesial traditionGlobal communion with defined magisterial authority
OrthodoxEcclesial traditionFamily of autocephalous churches with shared liturgical and patristic heritage
ReformedConfessional traditionDefined by specific confessional documents (Westminster, Dort, Belgic)
EvangelicalSociological-theological movementBounded modern movement with fuzzy edges, defined operationally via Bebbington Quadrilateral

This asymmetry is acknowledged and pragmatic for v1.0. Catholic and Orthodox represent broad ecclesial bodies; Reformed represents a confessional position within Protestantism; Evangelical represents a cross-denominational movement. These are not co-equal analytic units in a strict taxonomic sense.

We chose these four because they represent the major evaluative frameworks within Western Christianity where theological accuracy can be meaningfully assessed against identifiable standards. Future versions may refine this taxonomy as expert input and empirical results reveal where the current categories conflate meaningfully distinct positions or where finer distinctions are needed.


14. Scope Limitations

14.1 Scope Limitations

What FaithBench measures:

  • Factual accuracy about theological positions
  • Reasoning within established frameworks
  • Faithful representation of denominational views
  • Biblical language competency

What FaithBench does not measure:

  • Spiritual edification value
  • Pastoral appropriateness
  • Alignment with any tradition's values
  • Suitability for worship or counseling

14.2 Methodological Limitations

LimitationImpactMitigation
Western Christian focusNon-Western traditions underrepresentedPlanned expansion in v2.0
English-language evaluationMay miss non-English theological nuanceOriginal language competency tested separately
LLM judge variabilityScoring inconsistency possibleIRR monitoring, human calibration
Weight subjectivityDimension weights influence rankingsSensitivity analysis, expert Delphi planned
Sample sizeStatistical power limitsBootstrap CI to quantify uncertainty
Expert question coverageInitial set focuses on Western Christian traditionsExpand to cover more traditions in v2.0
Reasoning/thinking disabledReasoning models may underperform vs. their potentialPlanned thinking-enabled leaderboard variants

14.3 Known Biases

Bias TypeCurrent StatusPlanned Mitigation
Position biasUntestedPosition reversal protocol
Verbosity biasUntestedLength correlation analysis
Tradition biasUntestedCross-tradition expert grading
Difficulty confoundingPartial mitigationIRT difficulty calibration

14.4 Generalizability

Results may not generalize to:

  • Non-Christian religious traditions
  • Non-academic theological contexts
  • Languages other than English
  • Highly specialized sub-fields (e.g., Syriac patristics)

15. Technical Configuration Reference

This section provides a complete reference of all production configuration values. These are the exact values used in scoring—not aspirational or planned values.

15.1 Judge Configuration

ParameterValue
Primary judge modelgoogle/gemini-3-flash-preview
Fallback judge modelopenai/gpt-4o-mini
Judge temperature0
Judge max tokens16,000
Output formatJSON (response_format: { type: "json_object" })
Fallback triggerPrimary judge HTTP error

15.2 Test Model Configuration

ParameterValue
Temperature0.7
Max tokens2,000
ProviderOpenRouter (https://openrouter.ai/api/v1/chat/completions)
System promptMinimal (question-only)

15.3 Execution Configuration

ParameterValue
Max parallelism50 concurrent actions
Retry attempts3
Initial backoff1,000ms
Backoff multiplier2× (1s → 2s → 4s)
Max backoff60,000ms

15.4 Statistical Configuration

ParameterValue
Bootstrap iterations1,000
Confidence level95%
Difficulty weightsEasy: 1.0, Medium: 1.5, Hard: 2.0, Expert: 3.0

15.5 Scoring Rubric Weights (Production Values)

These are the exact sub-dimension weights from packages/convex/convex/scoring.ts:

Textual Analysis: lexicalAccuracy (35%), translationFidelity (35%), linguisticReasoning (20%), sourceHandling (10%)

Hermeneutical Reasoning: interpretiveMethod (35%), genreAwareness (25%), contextualAnalysis (25%), canonicalIntegration (15%)

Doctrinal Precision: doctrinalAccuracy (35%), traditionFidelity (35%), nuanceRecognition (20%), sourceGrounding (10%)

Historical Theology: historicalAccuracy (40%), developmentAwareness (30%), patristicKnowledge (20%), historiographicalMethod (10%)

Apologetics: logicalValidity (35%), evidenceUsage (30%), objectionHandling (25%), persuasiveClarity (10%)

Intertextual Reasoning: crossReferenceAccuracy (40%), typologicalRecognition (30%), allusionDetection (20%), thematicIntegration (10%)


16. Reproducibility Statement

16.1 Open Resources

ResourceAvailability
Public test cases (50%)Viewable by signed-in users
Held-out test cases (50%)Available to researchers on request (hello@faithbench.com)
Evaluation codeMIT license
Judge promptsPublished on this page (Section 5.3.3)
Scoring rubricsPublished on this page (Sections 4.4, 15.5)
Model configurationsPublished on this page (Section 15)
Hardening roadmapPublished in repository (docs/roadmap-benchmarking-hardening.md)

16.2 Version Control

  • Evaluation methodology versioned (current: v1.0)
  • Test case sets versioned with changelogs
  • Results include methodology version for reproducibility

17. Future Work

  1. Multi-judge evaluation: Add GPT-5 as secondary judge, report inter-judge agreement
  2. Multi-run averaging: 4+ runs per model to quantify run-to-run variance
  3. Provider diversification: Validate scores via direct provider APIs
  4. Human IRR: Krippendorff's α between LLM judge and theologian panel
  5. Delphi weight derivation: Formal expert consensus methodology
  6. Non-Western traditions: African, Asian, Latin American theological frameworks
  7. Item response theory: Difficulty calibration using IRT models
  8. Multi-lingual evaluation: German, Spanish, Korean theological discourse
  9. Fine-grained sub-traditions: Distinguish within broad categories (e.g., Old Princeton vs. Dutch Reformed)
  10. Temporal analysis: Track model improvement over time
  11. Thinking-enabled leaderboard: Show performance delta with reasoning modes

References

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. AIES.

Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in NLP? NAACL.

Epoch AI. (2025). Why benchmarking is hard. Epoch AI Substack. https://epochai.substack.com/p/why-benchmarking-is-hard

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR.

Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Koreeda, Y. (2023). Holistic evaluation of language models. TMLR.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2024). G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP.

Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks.

Bebbington, D. W. (1989). Evangelicalism in Modern Britain: A History from the 1730s to the 1980s. Unwin Hyman.

Rein, D., Hou, B., Stickland, A., Petty, J., Pang, R. Y., Dirani, J., ... & Paulos, E. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint. https://arxiv.org/abs/2311.12022

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.


Citation

@misc{faithbench2026,
  title={FaithBench: Toward Tradition-Aware Evaluation of Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

Contributing

We welcome contributions from:

  • Theologians: Test case development, rubric refinement, expert calibration
  • AI Researchers: Evaluation methodology, statistical analysis, bias testing
  • Practitioners: Real-world use case scenarios, pastoral context feedback

Contact: hello@faithbench.com