FaithBench Benchmark Card

This document follows the benchmark card format adapted from Gebru et al. (2021) "Datasheets for Datasets" and Mitchell et al. (2019) "Model Cards for Model Reporting." It provides a standardized summary of FaithBench for AI researchers and evaluators.

1. Benchmark Overview

Field	Value
Name	FaithBench
Version	v1.0
URL	faithbench.com
License	MIT (evaluation code)
Maintainer	FaithBench Team
Contact	hello@faithbench.com

One-sentence summary: FaithBench is an initial benchmark framework for evaluating the theological faithfulness of large language models across Christian traditions, using LLM-as-judge scoring with planned human expert calibration.

2. Intended Use

FaithBench is designed for:

Comparing theological knowledge across LLMs under controlled conditions (standardized prompts, temperature, token limits)
Identifying dimension-specific strengths and weaknesses (e.g., a model may handle historical theology well but struggle with textual analysis)
Identifying tradition-specific performance differences (e.g., a model may represent Catholic doctrine more accurately than Reformed doctrine)
Informing model selection for theological applications (seminary tools, Bible study apps, pastoral AI)
Research into LLM evaluation methodology for subjective, interpretive domains

3. Out-of-Scope Use

FaithBench should not be used for:

Certifying theological safety. A high FaithBench score does not mean a model is safe for unsupervised theological use.
Measuring spiritual wisdom or pastoral sensitivity. FaithBench measures knowledge accuracy, not wisdom, empathy, or pastoral appropriateness.
Making deployment decisions without additional evaluation. FaithBench tests controlled conditions; real-world performance depends on system prompts, RAG, temperature, and user interaction patterns.
Evaluating non-Christian traditions. The benchmark covers only Christian theology in v1.0.
Comparing models across different benchmark versions. Held-out rotation and methodology changes may affect cross-version comparability.
Ranking models with small score differences. Models within each other's confidence intervals should be treated as statistically indistinguishable.

4. Benchmark Composition

Characteristic	Value
Total test cases	413
Dimensions	6 (textual, hermeneutical, doctrinal, historical, apologetics, intertextual)
Difficulty levels	4 (easy, medium, hard, expert)
Active difficulty levels	2 (hard, expert — easy/medium deactivated due to ceiling effects)
Traditions	4 (Reformed, Catholic, Orthodox, Evangelical)
Question types	4 (factual recall, comparative analysis, applied reasoning, contested topics)
Public/held-out split	50% / 50%
Held-out rotation	Semi-annual
Minimum cases per cell	10 per dimension-difficulty pair

Active test set composition:

Hard questions: ~60% of scored items (Ph.D./faculty-level content)
Expert questions: ~40% of scored items (multi-hop reasoning, adversarial design)

5. Evaluation Protocol

Test Model Configuration

Parameter	Value
Temperature	0.7
Max tokens	2,000
Provider	OpenRouter
Reasoning/thinking	Disabled (where configurable)
System prompt	Minimal

Judge Configuration

Parameter	Value
Primary judge	google/gemini-3-flash-preview
Fallback judge	openai/gpt-4o-mini
Temperature	0
Max tokens	16,000
Output format	Structured JSON

Scoring

Scale: 0–3 per sub-dimension (Inadequate / Partial / Good / Excellent)
Normalization: Raw scores divided by 3 to produce 0–1 range
Weighting: Sub-dimension weights within each dimension, dimension weights for composite
Difficulty weighting: Easy=1.0, Medium=1.5, Hard=2.0, Expert=3.0
Confidence intervals: Bootstrap, 1,000 iterations, 95% confidence

Score Interpretation Hierarchy

Primary: Per-dimension performance profiles
Secondary: Tradition-specific scores
Tertiary: Overall composite score

The composite score is useful for leaderboard simplicity but may obscure meaningful dimension-level variation.

6. Tradition Scope

Tradition	Category Type	Defining Standards
Reformed	Confessional tradition	Westminster Confession, Heidelberg Catechism, Canons of Dort
Catholic	Ecclesial tradition	Catechism of the Catholic Church, conciliar documents, papal encyclicals
Orthodox	Ecclesial tradition	Nicene Creed, Ecumenical Councils, Philokalia, Church Fathers
Evangelical	Sociological-theological movement	Chicago Statement, Lausanne Covenant, Bebbington Quadrilateral

Note

These categories are not perfectly parallel analytic units. Catholic and Orthodox are broad ecclesial traditions; Reformed is a confessional position within Protestantism; Evangelical is a cross-denominational movement with fuzzy boundaries. This asymmetry is pragmatic for v1.0 and acknowledged as a limitation. See methodology Section 13.3 for full discussion.

Scope limitations:

Western Christian focus
English-language evaluation only
Non-Western traditions (African, Asian, Latin American) planned for v2.0+

7. Known Limitations

Limitation	Impact	Severity
Single LLM judge	All scores reflect one judge model's assessment	High
No human calibration data	Cannot quantify alignment with expert judgment	High
Self-preference bias	Gemini model scores may be inflated when judged by Gemini	Medium
Single provider (OpenRouter)	Scores may differ from direct-API evaluation	Medium
Single-run evaluation	Run-to-run variance unquantified	Medium
Bootstrap CIs exclude judge error	Reported CIs are narrower than true uncertainty	Medium
Designer prior weights	Dimension and difficulty weights not empirically validated	Medium
English-only	Misses non-English theological discourse	Low-Medium
Western Christian focus	Non-Western traditions unrepresented	Low-Medium

8. Known Failure Modes

Failure Mode	Description	Mitigation Status
Verbosity bias	Judge may reward longer responses regardless of quality	Planned (length-correlation analysis)
Style-substance conflation	Judge may score academic style over theological accuracy	Unmitigated
Judge tradition bias	Judge may have its own theological leanings that affect scoring	Unmitigated
Expert item circularity	Expert questions defined partly by what current LLMs get wrong	Acknowledged; IRT calibration planned
Generic-response penalty	Models trained for safety may default to generic answers, receiving low scores even when they "know" the specific answer	Unmitigated
Depth-faithfulness conflation	Rubric may penalize accurate but brief responses	Acknowledged; human calibration will assess

9. Validation Status

Component	Conceptual	Implemented	Validated
Construct definition (6 dimensions)	—	Yes	Pending expert review
Test case corpus (413 cases)	—	Yes	LLM-screened, not human-validated
Scoring rubrics with exemplars	—	Yes	Provisional (no IRR data)
LLM-as-judge scoring	—	Yes	Operational, not calibrated against humans
Bootstrap confidence intervals	—	Yes	Captures sampling uncertainty only
Difficulty weighting	—	Yes	Empirically adjusted, not IRT-calibrated
Dimension weights	—	Yes	Designer priors, not Delphi-validated
Human expert calibration	Yes	In progress	No
Cross-validation judge	Yes	No	No
IRT difficulty calibration	Yes	No	No
Delphi weight derivation	Yes	No	No
Bias testing (position/verbosity/tradition)	Yes	Tooling ready	No

10. Recommended Citation

@misc{faithbench2026,
  title={FaithBench: Toward Tradition-Aware Evaluation of Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

11. Contact & Access

Resource	Access
Public test cases (50%)	Sign in at faithbench.com
Held-out test cases (50%)	Academic request to hello@faithbench.com
Evaluation code	MIT license, open source
Methodology	faithbench.com/methodology
Worked examples	faithbench.com/methodology/worked-examples

References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019). Model cards for model reporting. FAT.