FaithBench: A Rigorous Benchmark for Evaluating Theological Faithfulness in Large Language Models
Abstract
We present FaithBench, a comprehensive benchmark for evaluating the theological faithfulness of large language models (LLMs). Existing AI benchmarks demonstrate that models struggle disproportionately with faith-related content, yet no rigorous evaluation framework exists for this domain. FaithBench introduces a multi-dimensional evaluation framework spanning six theological competencies, tradition-aware scoring rubrics, and statistical validation protocols. We operationalize "theological faithfulness" as accuracy in representing primary source texts, doctrinal positions, and reasoning patterns within specific Christian traditions. Our methodology employs LLM-as-judge evaluation with human expert calibration and reports inter-rater reliability coefficients. This paper details our construct validity, scoring framework, and validation procedures to enable reproducible theological AI evaluation.
1. Introduction
1.1 Motivation
The deployment of LLMs in religious contexts—seminary education, pastoral counseling tools, biblical study applications—necessitates rigorous evaluation of theological competency. Yet current benchmarks inadequately assess this domain. The Gloo FAI-C benchmark found that the "Faith" dimension scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models (Gloo, 2025).
The problem extends beyond factual errors. Models exhibit systematic failure modes:
- Generic collapse: Defaulting to ecumenical platitudes instead of tradition-specific claims
- Denominational conflation: Treating distinct positions as interchangeable
- Doctrinal flattening: Presenting contested issues as settled or vice versa
- Scriptural mishandling: Inaccurate citation, decontextualization, or proof-texting
1.2 Contributions
FaithBench addresses these gaps with:
- Operationalized construct: Explicit definition of theological faithfulness mapped to measurable dimensions
- Tradition-aware evaluation: Scoring rubrics calibrated to denominational distinctives
- Statistical rigor: Bootstrap confidence intervals, inter-rater reliability metrics
- Reproducibility: Open prompts, evaluation code, and configuration details
- Bias documentation: Position, verbosity, and tradition fairness analysis
2. Related Work
2.1 LLM Evaluation Benchmarks
HELM (Liang et al., 2023) established holistic evaluation across capabilities, though religion receives minimal coverage. MMLU (Hendrycks et al., 2021) includes philosophy and ethics but lacks theological depth. Domain-specific benchmarks exist for medicine (MedQA), law (LegalBench), and science, but no equivalent exists for theology.
2.2 LLM-as-Judge Methodologies
G-Eval (Liu et al., 2024) demonstrated that LLM judges achieve strong correlation with human judgment when given detailed rubrics. MT-Bench (Zheng et al., 2024) validated pairwise comparison protocols. We adopt rubric-based absolute scoring with calibration against human expert baselines.
2.3 Construct Validity in NLP
Raji et al. (2021) and Bowman & Dahl (2021) critiqued benchmark construct validity, arguing that many NLP benchmarks lack clear phenomenon-to-task mapping. We explicitly define our construct and justify dimension selection.
2.4 Theological AI Evaluation
Prior work on religious AI is limited. Studies have examined bias in religious representation (Abid et al., 2021) but not systematic theological competency. FaithBench fills this gap with rigorous methodology.
3. Construct Definition
3.1 Defining Theological Faithfulness
We operationalize theological faithfulness as accuracy in representing:
Note
Theological faithfulness is measured relative to tradition-specific standards, not a single normative position. A model can score highly on Catholic evaluation while scoring differently on Reformed evaluation—this is expected and valid.
| Component | Definition |
|---|---|
| Primary Source Fidelity | Accurate handling of biblical texts in original languages (Hebrew, Aramaic, Greek) |
| Doctrinal Precision | Correct representation of tradition-specific systematic theology |
| Historical Awareness | Understanding of theological development across church history |
| Hermeneutical Competence | Appropriate interpretive methodology and genre awareness |
| Apologetic Reasoning | Sound argumentation within Christian intellectual tradition |
| Intertextual Recognition | Identification of canonical connections, typology, and allusions |
3.2 Justification for Dimension Selection
Our six dimensions derive from established theological education standards:
- Association of Theological Schools (ATS): Accreditation standards for M.Div. programs specify competencies in biblical languages, systematic theology, and church history
- Confessional Standards: Reformed (Westminster), Catholic (Catechism), Orthodox (Philokalia), and Evangelical (Chicago Statement) documents define tradition-specific requirements
- Seminary Curricula: Analysis of 20 seminary curricula across traditions reveals consistent emphasis on these competencies
3.3 Tradition-Specific Evaluation Rationale
Models are evaluated within tradition contexts rather than against a neutral standard because:
- Theological disagreement is genuine: Traditions hold incompatible positions on key doctrines
- Accuracy is tradition-relative: A correct Catholic answer may be incorrect from a Reformed perspective
- Conflation is the failure mode: Generic responses that avoid specificity fail both traditions
4. Evaluation Framework
4.1 Dimension Taxonomy
25/100
Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation, and textual criticism.
20/100
Interpretive methodology: genre awareness, contextual analysis, canonical integration, and application of hermeneutical principles.
20/100
Systematic theology: accurate representation of tradition-specific doctrinal positions, creedal formulations, and denominational distinctives.
15/100
Church history: patristic sources, Reformation debates, doctrinal development, and historiographical method.
10/100
Philosophical theology: logical validity, evidence usage, objection handling, and Christian intellectual tradition.
10/100
Canonical connections: cross-references, typological recognition, allusion detection, and thematic integration.
4.2 Scoring Scale
We employ a 0–3 scoring scale based on LLM-as-judge research demonstrating that low-precision scales reduce judge variability while maintaining discriminative power (Zheng et al., 2024):
| Score | Label | Criteria |
|---|---|---|
| 3 | Excellent | Fully accurate, demonstrates depth, uses appropriate vocabulary and sources |
| 2 | Good | Mostly accurate with minor gaps, adequate vocabulary |
| 1 | Partial | Some accuracy but significant errors or omissions |
| 0 | Inadequate | Incorrect, misleading, or fails to address the question |
4.3 Weight Derivation
Important
Current weights are provisional, derived from preliminary expert consultation. Full Delphi methodology with 5+ theologians per tradition is planned for v2.0.
Provisional Weight Justification:
| Dimension | Weight | Rationale |
|---|---|---|
| Textual Analysis | 25% | Biblical text is foundational to all theological reasoning |
| Hermeneutical Reasoning | 20% | Interpretation determines doctrinal conclusions |
| Doctrinal Precision | 20% | Correct representation of tradition is core to faithfulness |
| Historical Theology | 15% | Historical awareness prevents anachronism and error |
| Apologetics | 10% | Important but not primary for most use cases |
| Intertextual Reasoning | 10% | Supports but does not drive theological claims |
Planned Methodology: Delphi consensus with:
- 5+ theologians per evaluated tradition
- 3 rounds of independent weighting
- Convergence criterion: IQR < 10%
4.4 Rubric Specifications
4.4.1 Textual Analysis (25%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Lexical Accuracy | 35% | Greek/Hebrew terms, morphology, semantic ranges |
| Translation Fidelity | 35% | Source-to-target accuracy, theological implications |
| Linguistic Reasoning | 20% | Grammar, syntax, verb tenses, discourse analysis |
| Source Handling | 10% | Manuscript variants, textual criticism principles |
4.4.2 Hermeneutical Reasoning (20%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Interpretive Method | 35% | Sound hermeneutical principles, grammatical-historical method |
| Genre Awareness | 25% | Recognition of literary forms (narrative, poetry, apocalyptic) |
| Contextual Analysis | 25% | Historical, cultural, literary context |
| Canonical Integration | 15% | Scripture interpreting Scripture, redemptive-historical reading |
4.4.3 Doctrinal Precision (20%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Doctrinal Accuracy | 35% | Correct representation of tradition's position |
| Tradition Fidelity | 35% | Appropriate vocabulary, conceptual framework |
| Nuance Recognition | 20% | Awareness of internal debates, development |
| Source Grounding | 10% | References to authoritative sources |
4.4.4 Historical Theology (15%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Historical Accuracy | 40% | Correct dating, attribution, context |
| Development Awareness | 30% | Understanding of doctrinal evolution |
| Patristic Knowledge | 20% | Church fathers and early sources |
| Historiographical Method | 10% | Sound historical reasoning |
4.4.5 Apologetics (10%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Logical Validity | 35% | Sound argumentation structure |
| Evidence Usage | 30% | Appropriate use of supporting evidence |
| Objection Handling | 25% | Fair representation and response to critiques |
| Persuasive Clarity | 10% | Clear communication of arguments |
4.4.6 Intertextual Reasoning (10%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Cross-Reference Accuracy | 40% | Correct identification of parallel passages |
| Typological Recognition | 30% | OT/NT typological connections |
| Allusion Detection | 20% | Recognition of biblical allusions |
| Thematic Integration | 10% | Coherent thematic connections |
5. Methodology
5.1 Test Case Construction
5.1.1 Sampling Strategy
Test cases are systematically sampled across:
| Dimension | Difficulty | Tradition | Question Type |
|---|---|---|---|
| 6 dimensions | 3 levels (Easy/Medium/Hard) | 5 traditions | 4 types |
Sampling criteria:
- Minimum 10 cases per dimension-difficulty cell
- Stratified by tradition where applicable
- Expert review for content validity
Question Types:
- Factual Recall: Direct knowledge with verifiable answers
- Comparative Analysis: Contrasting positions across traditions/translations
- Applied Reasoning: Pastoral scenarios requiring integration
- Contested Topics: Questions with multiple valid tradition-specific answers
5.1.2 Difficulty Calibration
| Level | Characteristics | Target Audience |
|---|---|---|
| Easy | Introductory level, clear answers | First-year seminary |
| Medium | Advanced coursework, nuanced understanding | M.Div. graduate |
| Hard | Specialist knowledge, original languages, contested interpretations | Ph.D./faculty level |
Difficulty calibration validated by:
- Expert rating (3 theologians per case)
- Empirical difficulty from pilot testing
- Item response theory analysis (planned for v2.0)
5.1.3 Public/Held-Out Split
| Set | Percentage | Purpose |
|---|---|---|
| Public | 50% | Transparency, reproducibility, model development |
| Held-Out | 50% | Prevent data contamination, detect gaming |
Held-out cases are rotated semi-annually to prevent leakage while maintaining evaluation validity.
5.2 LLM-as-Judge Protocol
5.2.1 Judge Model Selection
Primary judge: Google Gemini 3 Pro Preview (google/gemini-3-pro-preview)
Selection criteria:
- Strong performance on reasoning benchmarks
- No theological fine-tuning (reduces bias)
- Consistent rubric application in validation testing
- Cost-effectiveness for large-scale evaluation
- Not a top-scoring model on this benchmark (avoids self-preference bias)
Secondary judge: OpenAI GPT-5 for cross-validation (used when evaluated model = primary judge)
5.2.2 Prompt Engineering
Judge prompts include:
- Full rubric with exemplars
- Tradition context when applicable
- Explicit instruction to evaluate accuracy, not agreement
- Chain-of-thought reasoning requirement
- Structured output format
5.2.3 Error Handling
- Retry logic: 3 attempts with exponential backoff on API failures
- Consistency check: Flag responses where judge reasoning contradicts score
- Edge case routing: Ambiguous cases flagged for human review
5.3 Statistical Methods
5.3.1 Bootstrap Confidence Intervals
We report 95% confidence intervals using bootstrap resampling:
- Iterations: 1,000 bootstrap samples
- Method: Percentile method
- Stratification: By dimension to ensure representative sampling
Confidence intervals are reported for:
- Overall scores
- Dimension-level scores
- Tradition-specific scores
5.3.2 Inter-Rater Reliability
We calculate and report:
| Metric | Target | Interpretation |
|---|---|---|
| Krippendorff's α | ≥ 0.67 | Acceptable reliability (Krippendorff, 2004) |
| Cohen's κ | ≥ 0.61 | Substantial agreement (Landis & Koch, 1977) |
| % Exact Agreement | ≥ 70% | Practical reliability threshold |
IRR calculated between:
- LLM judge and human expert panel
- Multiple LLM judges (cross-validation)
- Multiple human experts (gold standard)
5.3.3 Multiple Comparison Correction
When comparing multiple models:
- Bonferroni correction for family-wise error rate
- Report both corrected and uncorrected p-values
- Effect sizes (Cohen's d) for practical significance
6. Validation
6.1 Human Expert Calibration
6.1.1 Expert Panel Composition
| Tradition | Minimum Experts | Qualifications |
|---|---|---|
| Reformed | 3 | M.Div.+ from confessional seminary |
| Catholic | 3 | Advanced degree in Catholic theology |
| Orthodox | 3 | Theological training in Orthodox tradition |
| Evangelical | 3 | Faculty/pastoral experience |
| Pentecostal | 2 | Academic credentials + tradition familiarity |
6.1.2 Calibration Protocol
- Gold standard creation: Experts independently score 50 responses
- Adjudication: Disagreements resolved through discussion
- IRR calculation: Krippendorff's α between experts
- Judge calibration: Compare LLM judge to expert majority vote
- Threshold: LLM judge must achieve α ≥ 0.67 vs. human panel
6.1.3 Calibration Dataset
| Characteristic | Requirement |
|---|---|
| Size | 50 responses minimum per dimension |
| Stratification | By difficulty, tradition, model |
| Selection | Random + edge cases identified in pilot |
| Refresh | 25% new cases each evaluation cycle |
6.2 Bias Analysis
6.2.1 Position Bias
Protocol:
- Run all pairwise comparisons in both orders (A vs B, B vs A)
- Calculate preference reversal rate
- Target: < 10% reversal rate
6.2.2 Verbosity Bias
Protocol:
- Calculate Pearson correlation between response length and score
- Target: |r| < 0.3 (weak or no correlation)
- If violated, implement length-normalization
6.2.3 Tradition Fairness
Protocol:
- Cross-tradition grading: Reformed experts grade Catholic responses and vice versa
- Measure systematic score differences
- Target: No tradition systematically scored > 0.5 points different by out-group experts
6.3 Sensitivity Analysis
6.3.1 Weight Perturbation
Test robustness by:
- Varying dimension weights ± 5%
- Measuring rank-order stability across models
- Reporting sensitivity coefficients
6.3.2 Judge Model Comparison
Cross-validate using:
- Multiple judge models (Claude, GPT-5, Gemini)
- Report agreement rates
- Flag dimensions with high inter-judge variance
7. Traditions Evaluated
Doctrinal Framework: Covenant theology, TULIP, Westminster Standards
Key Competencies Tested:
- Sola scriptura application
- Predestination and election
- Covenant of grace structure
- Perseverance of the saints
- Reformed hermeneutics
Authoritative Sources: Westminster Confession, Heidelberg Catechism, Canons of Dort, Three Forms of Unity
8. Limitations
Important
FaithBench measures theological knowledge and reasoning accuracy. It does not measure spiritual wisdom, pastoral sensitivity, or appropriateness for ministry contexts.
8.1 Scope Limitations
What FaithBench measures:
- Factual accuracy about theological positions
- Reasoning within established frameworks
- Faithful representation of denominational views
- Biblical language competency
What FaithBench does not measure:
- Spiritual edification value
- Pastoral appropriateness
- Alignment with any tradition's values
- Suitability for worship or counseling
8.2 Methodological Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Western Christian focus | Non-Western traditions underrepresented | Planned expansion in v2.0 |
| English-language evaluation | May miss non-English theological nuance | Original language competency tested separately |
| LLM judge variability | Scoring inconsistency possible | IRR monitoring, human calibration |
| Weight subjectivity | Dimension weights influence rankings | Sensitivity analysis, expert Delphi planned |
| Sample size | Statistical power limits | Bootstrap CI to quantify uncertainty |
8.3 Known Biases
| Bias Type | Current Status | Planned Mitigation |
|---|---|---|
| Position bias | Untested | Position reversal protocol |
| Verbosity bias | Untested | Length correlation analysis |
| Tradition bias | Untested | Cross-tradition expert grading |
| Difficulty confounding | Partial mitigation | IRT difficulty calibration |
8.4 Generalizability
Results may not generalize to:
- Non-Christian religious traditions
- Non-academic theological contexts
- Languages other than English
- Highly specialized sub-fields (e.g., Syriac patristics)
9. Reproducibility Statement
9.1 Open Resources
| Resource | Availability |
|---|---|
| Public test cases | GitHub repository |
| Evaluation code | MIT license |
| Judge prompts | Full prompts published |
| Scoring rubrics | This document |
| Model configurations | Documented per evaluation |
9.2 Configuration Details
| Parameter | Value |
|---|---|
| Judge model | google/gemini-3-pro-preview |
| Temperature | 0.0 (deterministic) |
| Max tokens | 2048 |
| Retry attempts | 3 |
| Bootstrap iterations | 1,000 |
| CI level | 95% |
9.3 Version Control
- Evaluation methodology versioned (current: v1.0)
- Test case sets versioned with changelogs
- Results include methodology version for reproducibility
10. Future Work
- Delphi weight derivation: Formal expert consensus methodology
- Non-Western traditions: African, Asian, Latin American theological frameworks
- Item response theory: Difficulty calibration using IRT models
- Multi-lingual evaluation: German, Spanish, Korean theological discourse
- Fine-grained sub-traditions: Distinguish within broad categories (e.g., Old Princeton vs. Dutch Reformed)
- Temporal analysis: Track model improvement over time
References
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. AIES.
Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in NLP? NAACL.
Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR.
Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Koreeda, Y. (2023). Holistic evaluation of language models. TMLR.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2024). G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP.
Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks.
Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.
Citation
@misc{faithbench2026,
title={FaithBench: A Rigorous Benchmark for Evaluating Theological
Faithfulness in Large Language Models},
author={FaithBench Team},
year={2026},
url={https://faithbench.com},
note={Methodology v1.0}
}
Contributing
We welcome contributions from:
- Theologians: Test case development, rubric refinement, expert calibration
- AI Researchers: Evaluation methodology, statistical analysis, bias testing
- Practitioners: Real-world use case scenarios, pastoral context feedback
Contact: hello@faithbench.com | GitHub: github.com/faithbench/faithbench