FaithBench: A Rigorous Benchmark for Evaluating Theological Faithfulness in Large Language Models

Abstract

We present FaithBench, a comprehensive benchmark for evaluating the theological faithfulness of large language models (LLMs). Existing AI benchmarks demonstrate that models struggle disproportionately with faith-related content, yet no rigorous evaluation framework exists for this domain. FaithBench introduces a multi-dimensional evaluation framework spanning six theological competencies, tradition-aware scoring rubrics, and statistical validation protocols. We operationalize "theological faithfulness" as accuracy in representing primary source texts, doctrinal positions, and reasoning patterns within specific Christian traditions. Our methodology employs LLM-as-judge evaluation with human expert calibration and reports inter-rater reliability coefficients. This paper details our construct validity, scoring framework, and validation procedures to enable reproducible theological AI evaluation.


1. Introduction

1.1 Motivation

The deployment of LLMs in religious contexts—seminary education, pastoral counseling tools, biblical study applications—necessitates rigorous evaluation of theological competency. Yet current benchmarks inadequately assess this domain. The Gloo FAI-C benchmark found that the "Faith" dimension scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models (Gloo, 2025).

The problem extends beyond factual errors. Models exhibit systematic failure modes:

  • Generic collapse: Defaulting to ecumenical platitudes instead of tradition-specific claims
  • Denominational conflation: Treating distinct positions as interchangeable
  • Doctrinal flattening: Presenting contested issues as settled or vice versa
  • Scriptural mishandling: Inaccurate citation, decontextualization, or proof-texting

1.2 Contributions

FaithBench addresses these gaps with:

  1. Operationalized construct: Explicit definition of theological faithfulness mapped to measurable dimensions
  2. Tradition-aware evaluation: Scoring rubrics calibrated to denominational distinctives
  3. Statistical rigor: Bootstrap confidence intervals, inter-rater reliability metrics
  4. Reproducibility: Open prompts, evaluation code, and configuration details
  5. Bias documentation: Position, verbosity, and tradition fairness analysis

2. Related Work

2.1 LLM Evaluation Benchmarks

HELM (Liang et al., 2023) established holistic evaluation across capabilities, though religion receives minimal coverage. MMLU (Hendrycks et al., 2021) includes philosophy and ethics but lacks theological depth. Domain-specific benchmarks exist for medicine (MedQA), law (LegalBench), and science, but no equivalent exists for theology.

2.2 LLM-as-Judge Methodologies

G-Eval (Liu et al., 2024) demonstrated that LLM judges achieve strong correlation with human judgment when given detailed rubrics. MT-Bench (Zheng et al., 2024) validated pairwise comparison protocols. We adopt rubric-based absolute scoring with calibration against human expert baselines.

2.3 Construct Validity in NLP

Raji et al. (2021) and Bowman & Dahl (2021) critiqued benchmark construct validity, arguing that many NLP benchmarks lack clear phenomenon-to-task mapping. We explicitly define our construct and justify dimension selection.

2.4 Theological AI Evaluation

Prior work on religious AI is limited. Studies have examined bias in religious representation (Abid et al., 2021) but not systematic theological competency. FaithBench fills this gap with rigorous methodology.


3. Construct Definition

3.1 Defining Theological Faithfulness

We operationalize theological faithfulness as accuracy in representing:

ComponentDefinition
Primary Source FidelityAccurate handling of biblical texts in original languages (Hebrew, Aramaic, Greek)
Doctrinal PrecisionCorrect representation of tradition-specific systematic theology
Historical AwarenessUnderstanding of theological development across church history
Hermeneutical CompetenceAppropriate interpretive methodology and genre awareness
Apologetic ReasoningSound argumentation within Christian intellectual tradition
Intertextual RecognitionIdentification of canonical connections, typology, and allusions

3.2 Justification for Dimension Selection

Our six dimensions derive from established theological education standards:

  • Association of Theological Schools (ATS): Accreditation standards for M.Div. programs specify competencies in biblical languages, systematic theology, and church history
  • Confessional Standards: Reformed (Westminster), Catholic (Catechism), Orthodox (Philokalia), and Evangelical (Chicago Statement) documents define tradition-specific requirements
  • Seminary Curricula: Analysis of 20 seminary curricula across traditions reveals consistent emphasis on these competencies

3.3 Tradition-Specific Evaluation Rationale

Models are evaluated within tradition contexts rather than against a neutral standard because:

  1. Theological disagreement is genuine: Traditions hold incompatible positions on key doctrines
  2. Accuracy is tradition-relative: A correct Catholic answer may be incorrect from a Reformed perspective
  3. Conflation is the failure mode: Generic responses that avoid specificity fail both traditions

4. Evaluation Framework

4.1 Dimension Taxonomy

1. Textual Analysis

25/100

Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation, and textual criticism.

2. Hermeneutical Reasoning

20/100

Interpretive methodology: genre awareness, contextual analysis, canonical integration, and application of hermeneutical principles.

3. Doctrinal Precision

20/100

Systematic theology: accurate representation of tradition-specific doctrinal positions, creedal formulations, and denominational distinctives.

4. Historical Theology

15/100

Church history: patristic sources, Reformation debates, doctrinal development, and historiographical method.

5. Apologetics

10/100

Philosophical theology: logical validity, evidence usage, objection handling, and Christian intellectual tradition.

6. Intertextual Reasoning

10/100

Canonical connections: cross-references, typological recognition, allusion detection, and thematic integration.

4.2 Scoring Scale

We employ a 0–3 scoring scale based on LLM-as-judge research demonstrating that low-precision scales reduce judge variability while maintaining discriminative power (Zheng et al., 2024):

ScoreLabelCriteria
3ExcellentFully accurate, demonstrates depth, uses appropriate vocabulary and sources
2GoodMostly accurate with minor gaps, adequate vocabulary
1PartialSome accuracy but significant errors or omissions
0InadequateIncorrect, misleading, or fails to address the question

4.3 Weight Derivation

Provisional Weight Justification:

DimensionWeightRationale
Textual Analysis25%Biblical text is foundational to all theological reasoning
Hermeneutical Reasoning20%Interpretation determines doctrinal conclusions
Doctrinal Precision20%Correct representation of tradition is core to faithfulness
Historical Theology15%Historical awareness prevents anachronism and error
Apologetics10%Important but not primary for most use cases
Intertextual Reasoning10%Supports but does not drive theological claims

Planned Methodology: Delphi consensus with:

  • 5+ theologians per evaluated tradition
  • 3 rounds of independent weighting
  • Convergence criterion: IQR < 10%

4.4 Rubric Specifications

4.4.1 Textual Analysis (25%)

Sub-dimensionWeightDescription
Lexical Accuracy35%Greek/Hebrew terms, morphology, semantic ranges
Translation Fidelity35%Source-to-target accuracy, theological implications
Linguistic Reasoning20%Grammar, syntax, verb tenses, discourse analysis
Source Handling10%Manuscript variants, textual criticism principles

4.4.2 Hermeneutical Reasoning (20%)

Sub-dimensionWeightDescription
Interpretive Method35%Sound hermeneutical principles, grammatical-historical method
Genre Awareness25%Recognition of literary forms (narrative, poetry, apocalyptic)
Contextual Analysis25%Historical, cultural, literary context
Canonical Integration15%Scripture interpreting Scripture, redemptive-historical reading

4.4.3 Doctrinal Precision (20%)

Sub-dimensionWeightDescription
Doctrinal Accuracy35%Correct representation of tradition's position
Tradition Fidelity35%Appropriate vocabulary, conceptual framework
Nuance Recognition20%Awareness of internal debates, development
Source Grounding10%References to authoritative sources

4.4.4 Historical Theology (15%)

Sub-dimensionWeightDescription
Historical Accuracy40%Correct dating, attribution, context
Development Awareness30%Understanding of doctrinal evolution
Patristic Knowledge20%Church fathers and early sources
Historiographical Method10%Sound historical reasoning

4.4.5 Apologetics (10%)

Sub-dimensionWeightDescription
Logical Validity35%Sound argumentation structure
Evidence Usage30%Appropriate use of supporting evidence
Objection Handling25%Fair representation and response to critiques
Persuasive Clarity10%Clear communication of arguments

4.4.6 Intertextual Reasoning (10%)

Sub-dimensionWeightDescription
Cross-Reference Accuracy40%Correct identification of parallel passages
Typological Recognition30%OT/NT typological connections
Allusion Detection20%Recognition of biblical allusions
Thematic Integration10%Coherent thematic connections

5. Methodology

5.1 Test Case Construction

5.1.1 Sampling Strategy

Test cases are systematically sampled across:

DimensionDifficultyTraditionQuestion Type
6 dimensions3 levels (Easy/Medium/Hard)5 traditions4 types

Sampling criteria:

  • Minimum 10 cases per dimension-difficulty cell
  • Stratified by tradition where applicable
  • Expert review for content validity

Question Types:

  1. Factual Recall: Direct knowledge with verifiable answers
  2. Comparative Analysis: Contrasting positions across traditions/translations
  3. Applied Reasoning: Pastoral scenarios requiring integration
  4. Contested Topics: Questions with multiple valid tradition-specific answers

5.1.2 Difficulty Calibration

LevelCharacteristicsTarget Audience
EasyIntroductory level, clear answersFirst-year seminary
MediumAdvanced coursework, nuanced understandingM.Div. graduate
HardSpecialist knowledge, original languages, contested interpretationsPh.D./faculty level

Difficulty calibration validated by:

  • Expert rating (3 theologians per case)
  • Empirical difficulty from pilot testing
  • Item response theory analysis (planned for v2.0)

5.1.3 Public/Held-Out Split

SetPercentagePurpose
Public50%Transparency, reproducibility, model development
Held-Out50%Prevent data contamination, detect gaming

Held-out cases are rotated semi-annually to prevent leakage while maintaining evaluation validity.

5.2 LLM-as-Judge Protocol

5.2.1 Judge Model Selection

Primary judge: Google Gemini 3 Pro Preview (google/gemini-3-pro-preview)

Selection criteria:

  • Strong performance on reasoning benchmarks
  • No theological fine-tuning (reduces bias)
  • Consistent rubric application in validation testing
  • Cost-effectiveness for large-scale evaluation
  • Not a top-scoring model on this benchmark (avoids self-preference bias)

Secondary judge: OpenAI GPT-5 for cross-validation (used when evaluated model = primary judge)

5.2.2 Prompt Engineering

Judge prompts include:

  1. Full rubric with exemplars
  2. Tradition context when applicable
  3. Explicit instruction to evaluate accuracy, not agreement
  4. Chain-of-thought reasoning requirement
  5. Structured output format

5.2.3 Error Handling

  • Retry logic: 3 attempts with exponential backoff on API failures
  • Consistency check: Flag responses where judge reasoning contradicts score
  • Edge case routing: Ambiguous cases flagged for human review

5.3 Statistical Methods

5.3.1 Bootstrap Confidence Intervals

We report 95% confidence intervals using bootstrap resampling:

  • Iterations: 1,000 bootstrap samples
  • Method: Percentile method
  • Stratification: By dimension to ensure representative sampling

Confidence intervals are reported for:

  • Overall scores
  • Dimension-level scores
  • Tradition-specific scores

5.3.2 Inter-Rater Reliability

We calculate and report:

MetricTargetInterpretation
Krippendorff's α≥ 0.67Acceptable reliability (Krippendorff, 2004)
Cohen's κ≥ 0.61Substantial agreement (Landis & Koch, 1977)
% Exact Agreement≥ 70%Practical reliability threshold

IRR calculated between:

  • LLM judge and human expert panel
  • Multiple LLM judges (cross-validation)
  • Multiple human experts (gold standard)

5.3.3 Multiple Comparison Correction

When comparing multiple models:

  • Bonferroni correction for family-wise error rate
  • Report both corrected and uncorrected p-values
  • Effect sizes (Cohen's d) for practical significance

6. Validation

6.1 Human Expert Calibration

6.1.1 Expert Panel Composition

TraditionMinimum ExpertsQualifications
Reformed3M.Div.+ from confessional seminary
Catholic3Advanced degree in Catholic theology
Orthodox3Theological training in Orthodox tradition
Evangelical3Faculty/pastoral experience
Pentecostal2Academic credentials + tradition familiarity

6.1.2 Calibration Protocol

  1. Gold standard creation: Experts independently score 50 responses
  2. Adjudication: Disagreements resolved through discussion
  3. IRR calculation: Krippendorff's α between experts
  4. Judge calibration: Compare LLM judge to expert majority vote
  5. Threshold: LLM judge must achieve α ≥ 0.67 vs. human panel

6.1.3 Calibration Dataset

CharacteristicRequirement
Size50 responses minimum per dimension
StratificationBy difficulty, tradition, model
SelectionRandom + edge cases identified in pilot
Refresh25% new cases each evaluation cycle

6.2 Bias Analysis

6.2.1 Position Bias

Protocol:

  • Run all pairwise comparisons in both orders (A vs B, B vs A)
  • Calculate preference reversal rate
  • Target: < 10% reversal rate

6.2.2 Verbosity Bias

Protocol:

  • Calculate Pearson correlation between response length and score
  • Target: |r| < 0.3 (weak or no correlation)
  • If violated, implement length-normalization

6.2.3 Tradition Fairness

Protocol:

  • Cross-tradition grading: Reformed experts grade Catholic responses and vice versa
  • Measure systematic score differences
  • Target: No tradition systematically scored > 0.5 points different by out-group experts

6.3 Sensitivity Analysis

6.3.1 Weight Perturbation

Test robustness by:

  • Varying dimension weights ± 5%
  • Measuring rank-order stability across models
  • Reporting sensitivity coefficients

6.3.2 Judge Model Comparison

Cross-validate using:

  • Multiple judge models (Claude, GPT-5, Gemini)
  • Report agreement rates
  • Flag dimensions with high inter-judge variance

7. Traditions Evaluated

Doctrinal Framework: Covenant theology, TULIP, Westminster Standards

Key Competencies Tested:

  • Sola scriptura application
  • Predestination and election
  • Covenant of grace structure
  • Perseverance of the saints
  • Reformed hermeneutics

Authoritative Sources: Westminster Confession, Heidelberg Catechism, Canons of Dort, Three Forms of Unity


8. Limitations

8.1 Scope Limitations

What FaithBench measures:

  • Factual accuracy about theological positions
  • Reasoning within established frameworks
  • Faithful representation of denominational views
  • Biblical language competency

What FaithBench does not measure:

  • Spiritual edification value
  • Pastoral appropriateness
  • Alignment with any tradition's values
  • Suitability for worship or counseling

8.2 Methodological Limitations

LimitationImpactMitigation
Western Christian focusNon-Western traditions underrepresentedPlanned expansion in v2.0
English-language evaluationMay miss non-English theological nuanceOriginal language competency tested separately
LLM judge variabilityScoring inconsistency possibleIRR monitoring, human calibration
Weight subjectivityDimension weights influence rankingsSensitivity analysis, expert Delphi planned
Sample sizeStatistical power limitsBootstrap CI to quantify uncertainty

8.3 Known Biases

Bias TypeCurrent StatusPlanned Mitigation
Position biasUntestedPosition reversal protocol
Verbosity biasUntestedLength correlation analysis
Tradition biasUntestedCross-tradition expert grading
Difficulty confoundingPartial mitigationIRT difficulty calibration

8.4 Generalizability

Results may not generalize to:

  • Non-Christian religious traditions
  • Non-academic theological contexts
  • Languages other than English
  • Highly specialized sub-fields (e.g., Syriac patristics)

9. Reproducibility Statement

9.1 Open Resources

ResourceAvailability
Public test casesGitHub repository
Evaluation codeMIT license
Judge promptsFull prompts published
Scoring rubricsThis document
Model configurationsDocumented per evaluation

9.2 Configuration Details

ParameterValue
Judge modelgoogle/gemini-3-pro-preview
Temperature0.0 (deterministic)
Max tokens2048
Retry attempts3
Bootstrap iterations1,000
CI level95%

9.3 Version Control

  • Evaluation methodology versioned (current: v1.0)
  • Test case sets versioned with changelogs
  • Results include methodology version for reproducibility

10. Future Work

  1. Delphi weight derivation: Formal expert consensus methodology
  2. Non-Western traditions: African, Asian, Latin American theological frameworks
  3. Item response theory: Difficulty calibration using IRT models
  4. Multi-lingual evaluation: German, Spanish, Korean theological discourse
  5. Fine-grained sub-traditions: Distinguish within broad categories (e.g., Old Princeton vs. Dutch Reformed)
  6. Temporal analysis: Track model improvement over time

References

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. AIES.

Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in NLP? NAACL.

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR.

Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Koreeda, Y. (2023). Holistic evaluation of language models. TMLR.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2024). G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP.

Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks.

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.


Citation

@misc{faithbench2026,
  title={FaithBench: A Rigorous Benchmark for Evaluating Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

Contributing

We welcome contributions from:

  • Theologians: Test case development, rubric refinement, expert calibration
  • AI Researchers: Evaluation methodology, statistical analysis, bias testing
  • Practitioners: Real-world use case scenarios, pastoral context feedback

Contact: hello@faithbench.com | GitHub: github.com/faithbench/faithbench