Research Preview: FaithBench is an active research project. All scores are preliminary, generated by a single AI judge, and pending human validation. See our Limitations section for details.

FaithBench Benchmark Card

This document follows the benchmark card format adapted from Gebru et al. (2021) "Datasheets for Datasets" and Mitchell et al. (2019) "Model Cards for Model Reporting." It provides a standardized summary of FaithBench for AI researchers and evaluators.


1. Benchmark Overview

FieldValue
NameFaithBench
Versionv1.0
URLfaithbench.com
LicenseMIT (evaluation code)
MaintainerFaithBench Team
Contacthello@faithbench.com

One-sentence summary: FaithBench is an initial benchmark framework for evaluating the theological faithfulness of large language models across Christian traditions, using LLM-as-judge scoring with planned human expert calibration.


2. Intended Use

FaithBench is designed for:

  • Comparing theological knowledge across LLMs under controlled conditions (standardized prompts, temperature, token limits)
  • Identifying dimension-specific strengths and weaknesses (e.g., a model may handle historical theology well but struggle with textual analysis)
  • Identifying tradition-specific performance differences (e.g., a model may represent Catholic doctrine more accurately than Reformed doctrine)
  • Informing model selection for theological applications (seminary tools, Bible study apps, pastoral AI)
  • Research into LLM evaluation methodology for subjective, interpretive domains

3. Out-of-Scope Use

FaithBench should not be used for:

  • Certifying theological safety. A high FaithBench score does not mean a model is safe for unsupervised theological use.
  • Measuring spiritual wisdom or pastoral sensitivity. FaithBench measures knowledge accuracy, not wisdom, empathy, or pastoral appropriateness.
  • Making deployment decisions without additional evaluation. FaithBench tests controlled conditions; real-world performance depends on system prompts, RAG, temperature, and user interaction patterns.
  • Evaluating non-Christian traditions. The benchmark covers only Christian theology in v1.0.
  • Comparing models across different benchmark versions. Held-out rotation and methodology changes may affect cross-version comparability.
  • Ranking models with small score differences. Models within each other's confidence intervals should be treated as statistically indistinguishable.

4. Benchmark Composition

CharacteristicValue
Total test cases413
Dimensions6 (textual, hermeneutical, doctrinal, historical, apologetics, intertextual)
Difficulty levels4 (easy, medium, hard, expert)
Active difficulty levels2 (hard, expert — easy/medium deactivated due to ceiling effects)
Traditions4 (Reformed, Catholic, Orthodox, Evangelical)
Question types4 (factual recall, comparative analysis, applied reasoning, contested topics)
Public/held-out split50% / 50%
Held-out rotationSemi-annual
Minimum cases per cell10 per dimension-difficulty pair

Active test set composition:

  • Hard questions: ~60% of scored items (Ph.D./faculty-level content)
  • Expert questions: ~40% of scored items (multi-hop reasoning, adversarial design)

5. Evaluation Protocol

Test Model Configuration

ParameterValue
Temperature0.7
Max tokens2,000
ProviderOpenRouter
Reasoning/thinkingDisabled (where configurable)
System promptMinimal

Judge Configuration

ParameterValue
Primary judgegoogle/gemini-3-flash-preview
Fallback judgeopenai/gpt-4o-mini
Temperature0
Max tokens16,000
Output formatStructured JSON

Scoring

  • Scale: 0–3 per sub-dimension (Inadequate / Partial / Good / Excellent)
  • Normalization: Raw scores divided by 3 to produce 0–1 range
  • Weighting: Sub-dimension weights within each dimension, dimension weights for composite
  • Difficulty weighting: Easy=1.0, Medium=1.5, Hard=2.0, Expert=3.0
  • Confidence intervals: Bootstrap, 1,000 iterations, 95% confidence

Score Interpretation Hierarchy

  1. Primary: Per-dimension performance profiles
  2. Secondary: Tradition-specific scores
  3. Tertiary: Overall composite score

The composite score is useful for leaderboard simplicity but may obscure meaningful dimension-level variation.


6. Tradition Scope

TraditionCategory TypeDefining Standards
ReformedConfessional traditionWestminster Confession, Heidelberg Catechism, Canons of Dort
CatholicEcclesial traditionCatechism of the Catholic Church, conciliar documents, papal encyclicals
OrthodoxEcclesial traditionNicene Creed, Ecumenical Councils, Philokalia, Church Fathers
EvangelicalSociological-theological movementChicago Statement, Lausanne Covenant, Bebbington Quadrilateral

Scope limitations:

  • Western Christian focus
  • English-language evaluation only
  • Non-Western traditions (African, Asian, Latin American) planned for v2.0+

7. Known Limitations

LimitationImpactSeverity
Single LLM judgeAll scores reflect one judge model's assessmentHigh
No human calibration dataCannot quantify alignment with expert judgmentHigh
Self-preference biasGemini model scores may be inflated when judged by GeminiMedium
Single provider (OpenRouter)Scores may differ from direct-API evaluationMedium
Single-run evaluationRun-to-run variance unquantifiedMedium
Bootstrap CIs exclude judge errorReported CIs are narrower than true uncertaintyMedium
Designer prior weightsDimension and difficulty weights not empirically validatedMedium
English-onlyMisses non-English theological discourseLow-Medium
Western Christian focusNon-Western traditions unrepresentedLow-Medium

8. Known Failure Modes

Failure ModeDescriptionMitigation Status
Verbosity biasJudge may reward longer responses regardless of qualityPlanned (length-correlation analysis)
Style-substance conflationJudge may score academic style over theological accuracyUnmitigated
Judge tradition biasJudge may have its own theological leanings that affect scoringUnmitigated
Expert item circularityExpert questions defined partly by what current LLMs get wrongAcknowledged; IRT calibration planned
Generic-response penaltyModels trained for safety may default to generic answers, receiving low scores even when they "know" the specific answerUnmitigated
Depth-faithfulness conflationRubric may penalize accurate but brief responsesAcknowledged; human calibration will assess

9. Validation Status

ComponentConceptualImplementedValidated
Construct definition (6 dimensions)YesPending expert review
Test case corpus (413 cases)YesLLM-screened, not human-validated
Scoring rubrics with exemplarsYesProvisional (no IRR data)
LLM-as-judge scoringYesOperational, not calibrated against humans
Bootstrap confidence intervalsYesCaptures sampling uncertainty only
Difficulty weightingYesEmpirically adjusted, not IRT-calibrated
Dimension weightsYesDesigner priors, not Delphi-validated
Human expert calibrationYesIn progressNo
Cross-validation judgeYesNoNo
IRT difficulty calibrationYesNoNo
Delphi weight derivationYesNoNo
Bias testing (position/verbosity/tradition)YesTooling readyNo

10. Recommended Citation

@misc{faithbench2026,
  title={FaithBench: Toward Tradition-Aware Evaluation of Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

11. Contact & Access

ResourceAccess
Public test cases (50%)Sign in at faithbench.com
Held-out test cases (50%)Academic request to hello@faithbench.com
Evaluation codeMIT license, open source
Methodologyfaithbench.com/methodology
Worked examplesfaithbench.com/methodology/worked-examples

References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., ... & Gebru, T. (2019). Model cards for model reporting. FAT.