FaithBench Benchmark Card
This document follows the benchmark card format adapted from Gebru et al. (2021) "Datasheets for Datasets" and Mitchell et al. (2019) "Model Cards for Model Reporting." It provides a standardized summary of FaithBench for AI researchers and evaluators.
1. Benchmark Overview
| Field | Value |
|---|---|
| Name | FaithBench |
| Version | v1.0 |
| URL | faithbench.com |
| License | MIT (evaluation code) |
| Maintainer | FaithBench Team |
| Contact | hello@faithbench.com |
One-sentence summary: FaithBench is an initial benchmark framework for evaluating the theological faithfulness of large language models across Christian traditions, using LLM-as-judge scoring with planned human expert calibration.
2. Intended Use
FaithBench is designed for:
- Comparing theological knowledge across LLMs under controlled conditions (standardized prompts, temperature, token limits)
- Identifying dimension-specific strengths and weaknesses (e.g., a model may handle historical theology well but struggle with textual analysis)
- Identifying tradition-specific performance differences (e.g., a model may represent Catholic doctrine more accurately than Reformed doctrine)
- Informing model selection for theological applications (seminary tools, Bible study apps, pastoral AI)
- Research into LLM evaluation methodology for subjective, interpretive domains
3. Out-of-Scope Use
FaithBench should not be used for:
- Certifying theological safety. A high FaithBench score does not mean a model is safe for unsupervised theological use.
- Measuring spiritual wisdom or pastoral sensitivity. FaithBench measures knowledge accuracy, not wisdom, empathy, or pastoral appropriateness.
- Making deployment decisions without additional evaluation. FaithBench tests controlled conditions; real-world performance depends on system prompts, RAG, temperature, and user interaction patterns.
- Evaluating non-Christian traditions. The benchmark covers only Christian theology in v1.0.
- Comparing models across different benchmark versions. Held-out rotation and methodology changes may affect cross-version comparability.
- Ranking models with small score differences. Models within each other's confidence intervals should be treated as statistically indistinguishable.
4. Benchmark Composition
| Characteristic | Value |
|---|---|
| Total test cases | 413 |
| Dimensions | 6 (textual, hermeneutical, doctrinal, historical, apologetics, intertextual) |
| Difficulty levels | 4 (easy, medium, hard, expert) |
| Active difficulty levels | 2 (hard, expert — easy/medium deactivated due to ceiling effects) |
| Traditions | 4 (Reformed, Catholic, Orthodox, Evangelical) |
| Question types | 4 (factual recall, comparative analysis, applied reasoning, contested topics) |
| Public/held-out split | 50% / 50% |
| Held-out rotation | Semi-annual |
| Minimum cases per cell | 10 per dimension-difficulty pair |
Active test set composition:
- Hard questions: ~60% of scored items (Ph.D./faculty-level content)
- Expert questions: ~40% of scored items (multi-hop reasoning, adversarial design)
5. Evaluation Protocol
Test Model Configuration
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Max tokens | 2,000 |
| Provider | OpenRouter |
| Reasoning/thinking | Disabled (where configurable) |
| System prompt | Minimal |
Judge Configuration
| Parameter | Value |
|---|---|
| Primary judge | google/gemini-3-flash-preview |
| Fallback judge | openai/gpt-4o-mini |
| Temperature | 0 |
| Max tokens | 16,000 |
| Output format | Structured JSON |
Scoring
- Scale: 0–3 per sub-dimension (Inadequate / Partial / Good / Excellent)
- Normalization: Raw scores divided by 3 to produce 0–1 range
- Weighting: Sub-dimension weights within each dimension, dimension weights for composite
- Difficulty weighting: Easy=1.0, Medium=1.5, Hard=2.0, Expert=3.0
- Confidence intervals: Bootstrap, 1,000 iterations, 95% confidence
Score Interpretation Hierarchy
- Primary: Per-dimension performance profiles
- Secondary: Tradition-specific scores
- Tertiary: Overall composite score
The composite score is useful for leaderboard simplicity but may obscure meaningful dimension-level variation.
6. Tradition Scope
| Tradition | Category Type | Defining Standards |
|---|---|---|
| Reformed | Confessional tradition | Westminster Confession, Heidelberg Catechism, Canons of Dort |
| Catholic | Ecclesial tradition | Catechism of the Catholic Church, conciliar documents, papal encyclicals |
| Orthodox | Ecclesial tradition | Nicene Creed, Ecumenical Councils, Philokalia, Church Fathers |
| Evangelical | Sociological-theological movement | Chicago Statement, Lausanne Covenant, Bebbington Quadrilateral |
Note
These categories are not perfectly parallel analytic units. Catholic and Orthodox are broad ecclesial traditions; Reformed is a confessional position within Protestantism; Evangelical is a cross-denominational movement with fuzzy boundaries. This asymmetry is pragmatic for v1.0 and acknowledged as a limitation. See methodology Section 13.3 for full discussion.
Scope limitations:
- Western Christian focus
- English-language evaluation only
- Non-Western traditions (African, Asian, Latin American) planned for v2.0+
7. Known Limitations
| Limitation | Impact | Severity |
|---|---|---|
| Single LLM judge | All scores reflect one judge model's assessment | High |
| No human calibration data | Cannot quantify alignment with expert judgment | High |
| Self-preference bias | Gemini model scores may be inflated when judged by Gemini | Medium |
| Single provider (OpenRouter) | Scores may differ from direct-API evaluation | Medium |
| Single-run evaluation | Run-to-run variance unquantified | Medium |
| Bootstrap CIs exclude judge error | Reported CIs are narrower than true uncertainty | Medium |
| Designer prior weights | Dimension and difficulty weights not empirically validated | Medium |
| English-only | Misses non-English theological discourse | Low-Medium |
| Western Christian focus | Non-Western traditions unrepresented | Low-Medium |
8. Known Failure Modes
| Failure Mode | Description | Mitigation Status |
|---|---|---|
| Verbosity bias | Judge may reward longer responses regardless of quality | Planned (length-correlation analysis) |
| Style-substance conflation | Judge may score academic style over theological accuracy | Unmitigated |
| Judge tradition bias | Judge may have its own theological leanings that affect scoring | Unmitigated |
| Expert item circularity | Expert questions defined partly by what current LLMs get wrong | Acknowledged; IRT calibration planned |
| Generic-response penalty | Models trained for safety may default to generic answers, receiving low scores even when they "know" the specific answer | Unmitigated |
| Depth-faithfulness conflation | Rubric may penalize accurate but brief responses | Acknowledged; human calibration will assess |
9. Validation Status
| Component | Conceptual | Implemented | Validated |
|---|---|---|---|
| Construct definition (6 dimensions) | — | Yes | Pending expert review |
| Test case corpus (413 cases) | — | Yes | LLM-screened, not human-validated |
| Scoring rubrics with exemplars | — | Yes | Provisional (no IRR data) |
| LLM-as-judge scoring | — | Yes | Operational, not calibrated against humans |
| Bootstrap confidence intervals | — | Yes | Captures sampling uncertainty only |
| Difficulty weighting | — | Yes | Empirically adjusted, not IRT-calibrated |
| Dimension weights | — | Yes | Designer priors, not Delphi-validated |
| Human expert calibration | Yes | In progress | No |
| Cross-validation judge | Yes | No | No |
| IRT difficulty calibration | Yes | No | No |
| Delphi weight derivation | Yes | No | No |
| Bias testing (position/verbosity/tradition) | Yes | Tooling ready | No |
10. Recommended Citation
@misc{faithbench2026,
title={FaithBench: Toward Tradition-Aware Evaluation of Theological
Faithfulness in Large Language Models},
author={FaithBench Team},
year={2026},
url={https://faithbench.com},
note={Methodology v1.0}
}
11. Contact & Access
| Resource | Access |
|---|---|
| Public test cases (50%) | Sign in at faithbench.com |
| Held-out test cases (50%) | Academic request to hello@faithbench.com |
| Evaluation code | MIT license, open source |
| Methodology | faithbench.com/methodology |
| Worked examples | faithbench.com/methodology/worked-examples |
References
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019). Model cards for model reporting. FAT.