FaithBench: Toward Tradition-Aware Evaluation of Theological Faithfulness in Large Language Models

Abstract

We introduce FaithBench, an initial benchmark framework for evaluating the theological faithfulness of large language models (LLMs). Existing AI benchmarks demonstrate that models struggle disproportionately with faith-related content, yet no dedicated evaluation framework exists for this domain. FaithBench proposes a multi-dimensional evaluation framework spanning six theological competencies, tradition-aware scoring rubrics, and statistical validation protocols. We operationalize "theological faithfulness" as accuracy in representing primary source texts, doctrinal positions, and reasoning patterns within specific Christian traditions. Current results rely primarily on LLM-as-judge evaluation and should be interpreted as preliminary pending expert calibration. This paper details our construct definition, scoring framework, known limitations, and validation roadmap to enable reproducible theological AI evaluation.

1. Introduction

1.1 Motivation

The deployment of LLMs in religious contexts—seminary education, pastoral counseling tools, biblical study applications—necessitates rigorous evaluation of theological competency. Yet current benchmarks inadequately assess this domain. The Gloo FAI-C benchmark found that the "Faith" dimension scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models (Gloo, 2025).

The problem extends beyond factual errors. Models exhibit systematic failure modes:

Generic collapse: Defaulting to ecumenical platitudes instead of tradition-specific claims
Denominational conflation: Treating distinct positions as interchangeable
Doctrinal flattening: Presenting contested issues as settled or vice versa
Scriptural mishandling: Inaccurate citation, decontextualization, or proof-texting

1.2 Contributions

FaithBench addresses these gaps with:

Operationalized construct: Explicit definition of theological faithfulness mapped to measurable dimensions
Tradition-aware evaluation: Scoring rubrics calibrated to denominational distinctives
Statistical methods: Bootstrap confidence intervals (IRR metrics planned with human calibration)
Reproducibility: Full evaluation artifacts published—judge prompts, rubric weights, model configs—on this page
Bias documentation: Position, verbosity, and tradition fairness analysis protocols (execution in progress)

Note

Companion documents: For concrete scoring examples showing how the rubric applies in practice, see Worked Examples. For a standardized benchmark summary in datasheets-for-datasets format, see the Benchmark Card.

1.3 A Note on Benchmarking

Warning

Benchmarking AI systems is harder than it looks. Epoch AI's analysis demonstrates that even well-established benchmarks like GPQA-Diamond show 6%+ score variance from prompt template differences alone. Provider choice, temperature settings, and run-to-run variance add further uncertainty. FaithBench is an early-stage benchmark. We are committed to transparency about what we know, what we don't, and what we plan to improve.

2. Related Work

2.1 LLM Evaluation Benchmarks

HELM (Liang et al., 2023) established holistic evaluation across capabilities, though religion receives minimal coverage. MMLU (Hendrycks et al., 2021) includes philosophy and ethics but lacks theological depth. Domain-specific benchmarks exist for medicine (MedQA), law (LegalBench), and science, but no equivalent exists for theology.

2.2 LLM-as-Judge Methodologies

G-Eval (Liu et al., 2024) demonstrated that LLM judges achieve strong correlation with human judgment when given detailed rubrics. MT-Bench (Zheng et al., 2024) validated pairwise comparison protocols. We adopt rubric-based absolute scoring with calibration against human expert baselines.

2.3 Construct Validity in NLP

Raji et al. (2021) and Bowman & Dahl (2021) critiqued benchmark construct validity, arguing that many NLP benchmarks lack clear phenomenon-to-task mapping. We explicitly define our construct and justify dimension selection.

2.4 Theological AI Evaluation

Prior work on religious AI is limited. Studies have examined bias in religious representation (Abid et al., 2021) but not systematic theological competency. FaithBench addresses this gap with a structured, transparent methodology.

2.5 Benchmarking Methodology

Epoch AI (2025) provides a comprehensive analysis of why benchmarking is hard, documenting significant variance from prompt templates, API providers, and single-run evaluations. Their findings directly inform our limitations disclosure and hardening roadmap (see Sections 8 and 15).

3. Construct Definition

3.1 Defining Theological Faithfulness

We operationalize theological faithfulness as accuracy in representing:

Note

Theological faithfulness is measured relative to tradition-specific standards, not a single normative position. A model can score highly on Catholic evaluation while scoring differently on Reformed evaluation—this is expected and valid.

Component	Definition
Primary Source Fidelity	Accurate handling of biblical texts in original languages (Hebrew, Aramaic, Greek)
Doctrinal Precision	Correct representation of tradition-specific systematic theology
Historical Awareness	Understanding of theological development across church history
Hermeneutical Competence	Appropriate interpretive methodology and genre awareness
Apologetic Reasoning	Sound argumentation within Christian intellectual tradition
Intertextual Recognition	Identification of canonical connections, typology, and allusions

Note

Construct complexity note: These six dimensions may represent related but distinct competencies rather than a single unified construct. A model could excel at textual analysis while performing poorly on doctrinal precision, since accurate Greek morphology and faithful doctrinal representation are different skills. The composite score should therefore be interpreted alongside dimension-level profiles, which may provide a more honest picture of model capabilities. We plan to investigate the factor structure empirically through inter-dimension correlation analysis.

3.2 Justification for Dimension Selection

Our six dimensions derive from established theological education standards:

Association of Theological Schools (ATS): Accreditation standards for M.Div. programs specify competencies in biblical languages, systematic theology, and church history
Confessional Standards: Reformed (Westminster), Catholic (Catechism), Orthodox (Philokalia), and Evangelical (Chicago Statement) documents define tradition-specific requirements
Seminary Curricula: Analysis of 20 seminary curricula across traditions reveals consistent emphasis on these competencies

3.3 Tradition-Specific Evaluation Rationale

Models are evaluated within tradition contexts rather than against a neutral standard because:

Theological disagreement is genuine: Traditions hold incompatible positions on key doctrines
Accuracy is tradition-relative: A correct Catholic answer may be incorrect from a Reformed perspective
Conflation is the failure mode: Generic responses that avoid specificity fail both traditions

3.4 Dimension Inclusion Arguments

For each dimension, we provide an explicit argument for inclusion, what failure looks like without it, and why exclusion would distort the evaluation.

Why each dimension belongs:

The table below uses the scoring dimension names (e.g., "Textual Analysis") which correspond to the construct components defined in Section 3.1 (e.g., "Primary Source Fidelity"). The mapping is: Textual Analysis = Primary Source Fidelity, Hermeneutical Reasoning = Hermeneutical Competence, Doctrinal Precision = Doctrinal Precision, Historical Theology = Historical Awareness, Apologetics = Apologetic Reasoning, Intertextual Reasoning = Intertextual Recognition.

Dimension	Why Included	Failure Without It	Exclusion Would Distort Because...
Textual Analysis	Biblical text is the primary source material for all Christian theology	A model could give doctrinally correct answers while misrepresenting the underlying texts	Theological claims ultimately rest on textual interpretation; ignoring source competency masks foundational errors
Hermeneutical Reasoning	Interpretation determines which doctrinal conclusions follow from texts	A model could cite texts accurately but apply inappropriate interpretive methods	Without evaluating interpretive method, we cannot distinguish sound from unsound theological reasoning
Doctrinal Precision	Tradition-specific doctrine is the core construct of theological faithfulness	A model could reason well from texts but misrepresent what traditions actually teach	This is the most direct measure of the construct; excluding it would undermine the benchmark's purpose
Historical Theology	Theological positions developed through historical processes and debates	A model could state current doctrine correctly while being anachronistic about its development	Historical context prevents misattributing modern formulations to ancient periods
Apologetics	Sound argumentation is integral to the Christian intellectual tradition	A model could know doctrine but present logically invalid arguments for it	Theological competence includes the ability to reason about and defend positions, not just state them
Intertextual Reasoning	The Christian canon is treated as an interconnected whole across traditions	A model could handle individual passages but miss canonical connections that inform doctrine	Cross-textual reasoning is fundamental to how theological conclusions are derived from Scripture

What we intentionally excluded and why:

Excluded Competency	Reason for Exclusion
Pastoral reasoning	Measures professional counseling competency and application wisdom, not theological knowledge accuracy; applied reasoning questions test theological integration in scenarios but do not evaluate pastoral care skills
Spiritual formation	Subjective and experiential; not measurable via text-based Q&A
Liturgical competency	Tradition-specific practice knowledge; partially captured under doctrinal and historical dimensions
Ethics / moral theology	Large enough to be its own benchmark; partially captured under doctrinal precision
Biblical languages as standalone	Captured within textual analysis; making it separate would over-weight linguistic competency relative to theological reasoning

4. Evaluation Framework

4.1 Dimension Taxonomy

1. Textual Analysis

25/100

Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation, and textual criticism.

2. Hermeneutical Reasoning

20/100

Interpretive methodology: genre awareness, contextual analysis, canonical integration, and application of hermeneutical principles.

3. Doctrinal Precision

20/100

Systematic theology: accurate representation of tradition-specific doctrinal positions, creedal formulations, and denominational distinctives.

4. Historical Theology

15/100

Church history: patristic sources, Reformation debates, doctrinal development, and historiographical method.

5. Apologetics

10/100

Philosophical theology: logical validity, evidence usage, objection handling, and Christian intellectual tradition.

6. Intertextual Reasoning

10/100

Canonical connections: cross-references, typological recognition, allusion detection, and thematic integration.

4.2 Scoring Scale

We employ a 0–3 scoring scale based on LLM-as-judge research demonstrating that low-precision scales reduce judge variability while maintaining discriminative power (Zheng et al., 2024):

Score	Label	Criteria
3	Excellent	Fully accurate, demonstrates depth, uses appropriate vocabulary and sources
2	Good	Mostly accurate with minor gaps, adequate vocabulary
1	Partial	Some accuracy but significant errors or omissions
0	Inadequate	Incorrect, misleading, or fails to address the question

Raw scores (0–3) are normalized to 0–1 by dividing by 3, then weighted by dimension weights to produce a final composite score per test case.

4.3 Weight Derivation

Important

Current weights are provisional, derived from preliminary expert consultation. Full Delphi methodology with 5+ theologians per tradition is planned for v2.0.

Provisional Weight Justification:

Dimension	Weight	Rationale
Textual Analysis	25%	Biblical text is foundational to all theological reasoning
Hermeneutical Reasoning	20%	Interpretation determines doctrinal conclusions
Doctrinal Precision	20%	Correct representation of tradition is core to faithfulness
Historical Theology	15%	Historical awareness prevents anachronism and error
Apologetics	10%	Important but not primary for most use cases
Intertextual Reasoning	10%	Supports but does not drive theological claims

Planned Methodology: Delphi consensus with:

5+ theologians per evaluated tradition
3 rounds of independent weighting
Convergence criterion: IQR < 10%

4.4 Rubric Specifications

Each dimension has weighted sub-dimensions that the LLM judge scores independently. The weights below are the actual values used in production scoring (copied from our scoring code).

4.4.1 Textual Analysis (25%)

Sub-dimension	Weight	Description
Lexical Accuracy	35%	Greek/Hebrew terms, morphology, semantic ranges
Translation Fidelity	35%	Source-to-target accuracy, theological implications
Linguistic Reasoning	20%	Grammar, syntax, verb tenses, discourse analysis
Source Handling	10%	Manuscript variants, textual criticism principles

4.4.2 Hermeneutical Reasoning (20%)

Sub-dimension	Weight	Description
Interpretive Method	35%	Sound hermeneutical principles, grammatical-historical method
Genre Awareness	25%	Recognition of literary forms (narrative, poetry, apocalyptic)
Contextual Analysis	25%	Historical, cultural, literary context
Canonical Integration	15%	Scripture interpreting Scripture, redemptive-historical reading

4.4.3 Doctrinal Precision (20%)

Sub-dimension	Weight	Description
Doctrinal Accuracy	35%	Correct representation of tradition's position
Tradition Fidelity	35%	Appropriate vocabulary, conceptual framework
Nuance Recognition	20%	Awareness of internal debates, development
Source Grounding	10%	References to authoritative sources

4.4.4 Historical Theology (15%)

Sub-dimension	Weight	Description
Historical Accuracy	40%	Correct dating, attribution, context
Development Awareness	30%	Understanding of doctrinal evolution
Patristic Knowledge	20%	Church fathers and early sources
Historiographical Method	10%	Sound historical reasoning

4.4.5 Apologetics (10%)

Sub-dimension	Weight	Description
Logical Validity	35%	Sound argumentation structure
Evidence Usage	30%	Appropriate use of supporting evidence
Objection Handling	25%	Fair representation and response to critiques
Persuasive Clarity	10%	Clear communication of arguments

4.4.6 Intertextual Reasoning (10%)

Sub-dimension	Weight	Description
Cross-Reference Accuracy	40%	Correct identification of parallel passages
Typological Recognition	30%	OT/NT typological connections
Allusion Detection	20%	Recognition of biblical allusions
Thematic Integration	10%	Coherent thematic connections

5. Methodology

5.1 Test Case Construction

5.1.1 Sampling Strategy

Test cases are systematically sampled across:

Dimension	Difficulty	Tradition	Question Type
6 dimensions	4 levels (Easy/Medium/Hard/Expert)	4 traditions (with more in development)	4 types

Sampling criteria:

Minimum 10 cases per dimension-difficulty cell
Stratified by tradition where applicable
Expert review for content validity

Question Types:

Factual Recall: Direct knowledge with verifiable answers
Comparative Analysis: Contrasting positions across traditions/translations
Applied Reasoning: Scenarios requiring theological integration across concepts
Contested Topics: Questions with multiple valid tradition-specific answers

5.1.2 Difficulty Calibration

Level	Characteristics	Target Audience	Target Accuracy
Easy	Introductory level, clear answers	First-year seminary	80-95%
Medium	Advanced coursework, nuanced understanding	M.Div. graduate	60-80%
Hard	Specialist knowledge, original languages, contested interpretations	Ph.D./faculty level	40-70%
Expert	Multi-hop reasoning, Google-proof, adversarial design	Specialist scholars only	30-50%

Difficulty calibration validated by:

Expert rating (3 theologians per case)
Empirical difficulty from pilot testing
Item response theory analysis (planned for v2.0)

5.1.3 Public/Held-Out Split

Set	Percentage	Purpose
Public	50%	Transparency, reproducibility, model development
Held-Out	50%	Prevent data contamination, detect gaming

Held-out cases are rotated semi-annually to prevent leakage while maintaining evaluation validity.

5.1.4 Expert-Level Design Principles

Expert questions are designed using principles from high-discrimination benchmarks (GPQA, Humanity's Last Exam, MMLU-Pro):

Note

Expert-level questions target 30-50% accuracy on frontier models. Questions scoring >70% are rejected as too easy; questions scoring <20% are rejected as potentially ambiguous.

Design Criteria:

Principle	Description	Example
Multi-hop reasoning	Require synthesizing 3+ distinct facts	"Compare Athanasius and Arius on Col 1:15, explain grammatical argument, and trace to Nicaea"
Google-proof	Cannot be solved by simple web search	Questions requiring synthesis across multiple scholarly sources
Adversarial distractors	Plausible wrong answers from common misconceptions	Misattributed patristic quotes, conflated tradition positions
Intra-tradition precision	Distinguish within traditions, not just between	Supralapsarian vs infralapsarian, Thomist vs Molinist
Abstention testing	Some questions where "insufficient evidence" is correct	"What was Origen's final position on X?" where evidence is fragmentary

Question Categories by Section:

Textual: Hapax legomena, manuscript variants with theological stakes, grammatical ambiguity
Hermeneutical: Genre disputes, sensus plenior debates, typology vs allegory boundaries
Doctrinal: Intra-tradition debates (supra/infra, Thomist/Molinist, essence-energies)
Historical: Patristic attribution traps, council canon specifics, Reformation debate details
Apologetics: Modal logic arguments, grounding objections, internal critiques
Intertextual: Second Temple interpretation, MT vs LXX usage patterns, composite quotations

Validation Protocol:

Expert questions undergo LLM pre-screening before inclusion:

Test against 3 frontier models (GPT-5, Claude Opus, Gemini Pro)
Reject if ANY model scores >70% (too easy)
Reject if ALL models score <20% (possibly ambiguous)
Target: 30-50% accuracy range for maximum discrimination

5.1.5 Difficulty Distribution Rationale

Note

Initial benchmark development included balanced sampling across all difficulty levels. Pilot testing revealed ceiling effects on Easy and Medium questions—frontier models achieved >90% accuracy, providing minimal discriminative power between models.

To optimize inference costs and benchmark utility, Easy and Medium questions were deactivated for scoring. The current active test set emphasizes:

Hard questions (~60%): Ph.D./faculty-level content where models show meaningful variation
Expert questions (~40%): Multi-hop reasoning and adversarial design for maximum discrimination

This approach follows psychometric best practices: items that all test-takers answer correctly (or incorrectly) contribute no information about relative ability. By focusing on Hard and Expert tiers, FaithBench maximizes the signal-to-noise ratio while reducing evaluation costs.

5.2 Model Configuration

Important

All models are evaluated with temperature 0.7 and max tokens 2,000 via OpenRouter. Reasoning/thinking is disabled where configurable. These settings prioritize natural response variation while maintaining controlled comparison. See Section 9 for full rationale and caveats.

5.2.1 Evaluation Parameters

Parameter	Setting	Rationale
Temperature	0.7	Balances determinism with natural variation
Max Tokens	2,000	Sufficient for theological responses without truncation
Reasoning/Thinking	Disabled (where applicable)	Controlled comparison of base knowledge (see Section 9)
System prompts	Minimal	Avoid biasing responses
Provider	OpenRouter	Unified API access to all models (see Section 5.5)

Important limitations:

Reasoning models (e.g., o1, o3, Claude with extended thinking): Tested without explicit reasoning enabled. These models would likely score higher with thinking/reasoning turned on.
Internal reasoning: Some models reason internally by default—we cannot control this and do not penalize it.
Future work: We plan to add "thinking-enabled" variants to the leaderboard to show the performance delta.

5.3 LLM-as-Judge Protocol

5.3.1 Judge Model Selection

Parameter	Value	Rationale
Primary Judge	`google/gemini-3-flash-preview`	Cost-effective, strong reasoning, no theological fine-tuning
Fallback Judge	`openai/gpt-4o-mini`	Reliability backup when primary is unavailable
Temperature	0	Deterministic for scoring consistency
Max Tokens	16,000	Prevents truncation of judge reasoning
Output Format	JSON (structured)	Machine-parseable dimension scores

Selection criteria for primary judge:

Strong performance on reasoning benchmarks
No theological fine-tuning (reduces bias)
Consistent rubric application in validation testing
Cost-effectiveness for large-scale evaluation

The fallback judge activates automatically when the primary judge returns an API error. The judge model used for each evaluation is recorded in the test result metadata.

Planned secondary judge: OpenAI GPT-5 for cross-validation (see Section 5.3.2)

5.3.2 Self-Preference Bias Awareness

Important

Self-preference bias is a well-documented phenomenon where LLM judges rate their own outputs more favorably than other judges would. Research shows this can inflate self-evaluated scores by 5-15%.

The Problem

When the judge model evaluates its own outputs (e.g., Gemini judging Gemini responses), scores may be artificially inflated due to:

Stylistic familiarity with own output patterns
Implicit preference for own reasoning approaches
Shared training data artifacts

Current State (v1.0)

We currently use a single judge model (Gemini 3 Flash) for all evaluations. This means:

Gemini model scores may be inflated when Gemini judges its own outputs
We acknowledge this limitation and flag it in results
Users should interpret Gemini scores with this caveat in mind

Planned Mitigation (v2.0)

With additional funding, we plan to implement cross-validation:

Primary judge: Gemini 3 Flash
Secondary judge: GPT-5 (or similar non-Gemini model)
Conflict handling: When evaluated model = judge model family, use secondary judge score
Dual reporting: Show both scores for transparency

Model Evaluated	Primary (Gemini)	Cross-Val (GPT-5)	Leaderboard Uses
Claude Opus	0.94	—	Primary
Gemini models	0.92	0.88	Cross-val
GPT-5	—	0.91	Cross-val only

Why This Matters

Self-preference bias is a known limitation of single-judge LLM evaluation. By documenting this limitation and our mitigation roadmap, we maintain methodological transparency while working toward a more robust evaluation protocol.

5.3.3 Judge Prompt: Exact System Prompt

The following is the exact system prompt sent to the judge model. This is copied directly from our production scoring code (packages/convex/convex/scoring.ts):

You are an expert theological evaluator assessing LLM responses
for the FaithBench benchmark.
Your role is to score responses on a 0-3 scale for each dimension.

SCORING SCALE:
0 = Inadequate: Incorrect, misleading, or fails to address
1 = Partial: Some accuracy but significant errors or omissions
2 = Good: Mostly accurate with minor gaps
3 = Excellent: Fully accurate, demonstrates depth

EVALUATION DIMENSIONS for {section}:
- {dimensionName} ({weight}%): {description}
- {dimensionName} ({weight}%): {description}
...

OUTPUT FORMAT:
Respond with valid JSON only:
{
  "scores": [
    { "dimension": "<dimension_name>",
      "rawScore": <0-3>,
      "reasoning": "<brief justification>" }
  ]
}

Be strict but fair. Evaluate theological accuracy, not style or length.

The {section} placeholder is replaced with the test case's dimension (e.g., "textual", "doctrinal"), and each dimension's sub-dimensions and weights are injected from the rubric configuration.

5.3.4 Judge Prompt: Evaluation Prompt Structure

The evaluation prompt sent as the user message follows this structure:

QUESTION:
{the original test case prompt}

MODEL RESPONSE:
{the model's response being evaluated}

REFERENCE ANSWER (for comparison):
{reference answer, when available}

TRADITION CONTEXT: {tradition}
Evaluate fidelity to this specific theological tradition.

Evaluate the model response and provide scores for each dimension.

The reference answer and tradition context blocks are included only when the test case provides them.

5.3.5 Error Handling

Retry logic: 3 attempts with exponential backoff on API failures
Consistency check: Flag responses where judge reasoning contradicts score
Edge case routing: Ambiguous cases flagged for human review

5.4 Scoring Calculation

5.4.1 Per-Test-Case Scoring

For each test case, the judge returns raw scores (0–3) for each sub-dimension. The composite score is calculated as:

Normalize: Each raw score is divided by 3 to produce a 0–1 normalized score
Weight: Each normalized score is multiplied by its sub-dimension weight
Sum: Weighted scores are summed to produce the test case's composite score (0–1)

For example, for a Textual Analysis test case:

composite = (lexicalAccuracy/3 × 0.35)
          + (translationFidelity/3 × 0.35)
          + (linguisticReasoning/3 × 0.20)
          + (sourceHandling/3 × 0.10)

Important

Score interpretation hierarchy: Dimension-level profiles are the primary unit of analysis. Tradition-specific scores are secondary. The overall composite score is tertiary—useful for leaderboard simplicity but potentially misleading as a single summary of multidimensional capabilities. Users should examine dimension breakdowns before drawing conclusions from composite scores.

5.4.2 Difficulty Weighting

Test cases are weighted by difficulty level when computing model-level aggregate scores:

Difficulty	Weight	Rationale
Easy	1.0	Baseline competency (currently deactivated)
Medium	1.5	Intermediate competency (currently deactivated)
Hard	2.0	Expert-level content where models differentiate
Expert	3.0	Maximum discrimination items

This means an Expert-level test case contributes 3× more to a model's aggregate score than an Easy-level case. The rationale: getting hard questions right is a stronger signal of theological competence than getting easy questions right.

5.5 Provider & Reproducibility

Warning

All models in FaithBench v1.0 are accessed via OpenRouter as a single API provider. This is a known limitation that affects reproducibility and score comparability.

5.5.1 Why OpenRouter

OpenRouter provides unified API access to 300+ models from different providers (OpenAI, Anthropic, Google, Meta, Mistral, etc.). For an early-stage benchmark, this offers practical advantages:

Single API integration for all models
Consistent request/response format
Built-in cost tracking

5.5.2 Provider Variance

Epoch AI's analysis found that the same model served by different providers can produce different scores. Causes include:

Quantization differences between providers
Different inference engines and hardware
Chat template handling differences
Token limit and rate limit differences

This means FaithBench scores reflect model performance as served through OpenRouter, which may differ from scores obtained through direct API access or other providers. Our scores should be interpreted with this uncertainty in mind.

5.5.3 Reproducibility Plan

When funded, we plan to:

Validate a subset of model scores via direct provider APIs (OpenAI, Anthropic, Google)
Quantify and report the magnitude of provider variance for FaithBench specifically
Select the provider configuration that produces scores closest to direct-API baselines

5.6 Error Handling & Retry Policy

5.6.1 Test Model Execution

Test model API calls are managed by a workpool with the following configuration:

Parameter	Value	Rationale
Max parallelism	50 concurrent actions	Stays within Convex Starter tier limits
Retry attempts	3	Standard resilience for API unreliability
Initial backoff	1,000ms	Allows transient errors to clear
Backoff multiplier	2×	Exponential: 1s → 2s → 4s
Max backoff	60,000ms	Prevents excessive wait times

5.6.2 Judge Model Execution

Judge model calls have an additional resilience layer:

If the primary judge (google/gemini-3-flash-preview) returns an error, the system automatically falls back to the secondary judge (openai/gpt-4o-mini)
The judge model used is recorded per test result for traceability
If both judges fail after retries, the test case is marked as failed and excluded from scoring

5.6.3 Error Classification

Error Type	Retryable	Handling
Rate limit (429)	Yes	Exponential backoff, respect Retry-After header
Server error (5xx)	Yes	Exponential backoff
Timeout (408)	Yes	Exponential backoff
Authentication (401)	No	Fail immediately, flag for operator review
Content filtered (403)	No	Record as filtered, exclude from scoring
Model not found (404)	No	Fail immediately
Invalid request (400)	No	Fail immediately
Parse failure	No	Record raw response, mark as judge error

5.6.4 Failed Test Cases

Test cases that fail after all retries are:

Recorded with error metadata (error type, attempts, final error message)
Excluded from aggregate score calculations
Reported in run summary statistics so users can see the failure rate

5.7 Statistical Methods

5.7.1 Bootstrap Confidence Intervals

We report 95% confidence intervals using bootstrap resampling:

Iterations: 1,000 bootstrap samples
Method: Percentile method
Stratification: By dimension to ensure representative sampling

Confidence intervals are reported for:

Overall scores
Dimension-level scores
Tradition-specific scores

5.7.2 Inter-Rater Reliability

We will calculate and report (once human calibration is complete):

Metric	Target	Interpretation
Krippendorff's α	≥ 0.67	Acceptable reliability (Krippendorff, 2004)
Cohen's κ	≥ 0.61	Substantial agreement (Landis & Koch, 1977)
% Exact Agreement	≥ 70%	Practical reliability threshold

IRR calculated between:

LLM judge and human expert panel
Multiple LLM judges (cross-validation)
Multiple human experts (gold standard)

5.7.3 Multiple Comparison Correction

When comparing multiple models:

Bonferroni correction for family-wise error rate
Report both corrected and uncorrected p-values
Effect sizes (Cohen's d) for practical significance

6. Validation

6.1 Human Expert Calibration

Warning

Human expert calibration is in progress. The protocol below describes our planned methodology. Current evaluations use LLM-as-judge scoring (cross-validation planned). IRR metrics will be published once calibration is complete.

6.1.1 Expert Panel Composition

Tradition	Minimum Experts	Qualifications
Reformed	3	M.Div.+ from confessional seminary
Catholic	3	Advanced degree in Catholic theology
Orthodox	3	Theological training in Orthodox tradition
Evangelical	3	Faculty/pastoral experience
Pentecostal (planned)	2	Academic credentials + tradition familiarity

6.1.2 Calibration Protocol

Gold standard creation: Experts independently score 50 responses
Adjudication: Disagreements resolved through discussion
IRR calculation: Krippendorff's α between experts
Judge calibration: Compare LLM judge to expert majority vote
Threshold: LLM judge must achieve α ≥ 0.67 vs. human panel

6.1.3 Calibration Dataset

Characteristic	Requirement
Size	50 responses minimum per dimension
Stratification	By difficulty, tradition, model
Selection	Random + edge cases identified in pilot
Refresh	25% new cases each evaluation cycle

6.2 Bias Analysis

Warning

Bias testing protocols are implemented but not yet executed at scale. The targets below represent our validation criteria. Results will be published in a future methodology update.

6.2.1 Position Bias

Protocol:

Run all pairwise comparisons in both orders (A vs B, B vs A)
Calculate preference reversal rate
Target: < 10% reversal rate

6.2.2 Verbosity Bias

Protocol:

Calculate Pearson correlation between response length and score
Target: |r| < 0.3 (weak or no correlation)
If violated, implement length-normalization

6.2.3 Tradition Fairness

Protocol:

Cross-tradition grading: Reformed experts grade Catholic responses and vice versa
Measure systematic score differences
Target: No tradition systematically scored > 0.5 points different by out-group experts

6.3 Sensitivity Analysis

6.3.1 Weight Perturbation

Test robustness by:

Varying dimension weights ± 5%
Measuring rank-order stability across models
Reporting sensitivity coefficients

6.3.2 Judge Model Comparison

Cross-validate using:

Multiple judge models (Claude, GPT-5, Gemini)
Report agreement rates
Flag dimensions with high inter-judge variance

6.4 Validation Status (v1.0)

Component	Status	Notes
Test cases	✅ Complete	413 cases across 6 dimensions
Difficulty calibration	✅ Complete	Easy/Medium deactivated due to ceiling effects
LLM judge	✅ Operational	Gemini 3 Flash (single judge; cross-validation planned)
Bootstrap CI	✅ Operational	1000 iterations, 95% confidence
Human calibration	🔄 In progress	Expert recruitment underway
IRR metrics	⏳ Planned	Requires human calibration completion
Bias testing	⏳ Planned	Tooling complete, awaiting scale data
Delphi weights	⏳ Planned	Scheduled for v2.0

6.5 Maturity Classification

To clearly distinguish what is conceptual, implemented, and validated, we provide the following maturity classification:

Component	Conceptual	Implemented	Validated
Construct definition (6 dimensions)	—	Yes	Pending expert review
Test case corpus (413 cases)	—	Yes	LLM-screened, not human-validated
Scoring rubrics with exemplars	—	Yes	Provisional (no IRR data)
LLM-as-judge scoring	—	Yes	Operational, not calibrated against humans
Bootstrap confidence intervals	—	Yes	Captures sampling uncertainty only
Difficulty weighting	—	Yes	Empirically adjusted, not IRT-calibrated
Dimension weights	—	Yes	Designer priors, not Delphi-validated
Human expert calibration	Yes	In progress	No
Cross-validation judge	Yes	No	No
IRT difficulty calibration	Yes	No	No
Delphi weight derivation	Yes	No	No
Bias testing (position/verbosity/tradition)	Yes	Tooling ready	No

A reader should never have to guess whether a component of FaithBench is live, partial, or aspirational. This table is the definitive reference.

7. Traditions Evaluated

Doctrinal Framework: Covenant theology, TULIP, Westminster Standards

Key Competencies Tested:

Sola scriptura application
Predestination and election
Covenant of grace structure
Perseverance of the saints
Reformed hermeneutics

Authoritative Sources: Westminster Confession, Heidelberg Catechism, Canons of Dort, Three Forms of Unity

8. Current Limitations

Warning

FaithBench v1.0 has significant methodological limitations that users should weigh when interpreting results. We document these transparently as part of our commitment to scientific integrity.

8.1 Single LLM Judge Without Human Inter-Rater Reliability

All scoring in v1.0 is performed by a single LLM judge (Gemini 3 Flash Preview). No human inter-rater reliability (IRR) data has been collected yet. This means:

We cannot currently quantify how well our automated scores align with expert human judgment
The reliability of scores across dimensions is assumed but not empirically validated
Dimensions requiring deep theological nuance (e.g., doctrinal precision, hermeneutical reasoning) may be scored less reliably than more factual dimensions (e.g., historical accuracy)

Human expert calibration is in progress (see Section 6.1), but until IRR metrics are published, all scores should be treated as preliminary automated assessments.

8.2 Self-Preference Bias Unmitigated

Self-preference bias—where LLM judges rate outputs from their own model family more favorably—is documented in the literature at 5–15% inflation (see Section 5.3.2). In v1.0:

Gemini model scores may be inflated when judged by Gemini 3 Flash
No cross-validation judge is currently operational (GPT-5 cross-validation is planned)
We flag this limitation but do not yet adjust scores to compensate

Users should interpret Gemini-family model scores with particular caution until cross-validation is implemented.

8.3 Single Provider (OpenRouter)

All models are accessed through OpenRouter. Epoch AI documents that provider choice can significantly affect scores due to quantization, inference engine differences, and chat template handling (see Section 5.5). This means:

Our scores reflect performance as served through OpenRouter, not necessarily through direct provider APIs
The magnitude of provider variance for FaithBench specifically is unquantified
Scores are not directly comparable to evaluations run through other providers

8.4 Single-Run Scores

Each model is evaluated once per benchmark run. Epoch AI recommends 4–32 runs as standard practice. Single-run evaluation means:

Run-to-run variance is unquantified
Even with temperature 0.7 for test models and temperature 0 for the judge, some variance exists from provider-side factors
Our bootstrap CIs capture sampling uncertainty (which test cases) but not run-to-run uncertainty

We plan to implement multi-run averaging (minimum 4 runs per model) when funded.

8.5 Bootstrap CIs Do Not Account for Judge Error

Our bootstrap CIs (1,000 iterations, 95% confidence) quantify sampling uncertainty—the variability from which test cases a model happens to encounter. They do not account for:

Systematic judge error (if the LLM judge consistently misjudges a category)
Judge variance (if the same response would receive different scores on re-evaluation)
Construct validity uncertainty (whether our rubrics capture what we intend to measure)

The reported CIs are therefore narrower than true uncertainty. They should be interpreted as lower bounds on the actual confidence interval.

8.6 Expert-Item Selection Circularity

Expert-level questions are pre-screened against frontier LLMs (Section 5.1.4): questions where any model scores >70% are rejected as too easy. This creates a selection effect:

The expert tier is defined partly by what current models get wrong
This may conflate "genuinely hard theological content" with "content that happens to confuse current LLMs"
As models improve, items may need recalibration—but the original selection bias persists in historical comparisons

We plan to address this with Item Response Theory (IRT) calibration in v2.0, which would provide model-independent difficulty estimates.

8.7 Held-Out Rotation Without Equating

Held-out test cases rotate semi-annually to prevent data contamination (Section 5.1.3). However:

No test equating procedure is currently applied across rotations
Scores from different rotation periods may not be directly comparable
A model's score could change between periods due to item difficulty shifts, not actual capability changes

Planned mitigation: anchor items (a fixed subset of questions present in all rotations) to enable cross-period equating.

9. Testing Configuration

9.1 Rationale: Controlled Comparison

Important

FaithBench testing parameters are chosen for controlled comparison, not to simulate how real users interact with AI models. Real-world performance may differ significantly.

Our testing configuration prioritizes controlled comparison:

Parameter	Setting	Why This Setting	Real-World Difference
Temperature	0.7	Balances reproducibility with natural variation	Users may use different temperatures; some applications use 0 or 1.0
Reasoning/Thinking	Disabled	Controls for reasoning capability differences; tests base knowledge	Many users enable reasoning modes; models with thinking enabled would likely score higher
System prompts	Minimal	Avoids biasing responses toward any tradition	Real applications often include system prompts that improve domain performance
Max tokens	2,000	Sufficient for theological responses	Some applications constrain or expand response length

What this means for users: FaithBench scores represent a controlled baseline of model capability. Models with reasoning enabled, appropriate system prompts, or retrieval-augmented generation (RAG) may perform significantly better in practice. Our scores should not be interpreted as "what this model will do when you ask it a theological question"—they measure base theological knowledge under standardized conditions.

We plan to add "thinking-enabled" leaderboard variants to show the performance delta between base and reasoning-augmented configurations.

10. Public Data Access

10.1 What's Available

Signed-in users can view the following for each benchmark run:

Public test cases: The question prompts for the public 50% of the test set
Model responses: The full text each model generated
Judge reasoning: The LLM judge's dimension-by-dimension scores and justifications
Aggregate scores: Overall and per-dimension scores with confidence intervals

10.2 What's Held Back (and Why)

The held-out 50% of the test set is not publicly visible. This is standard practice in benchmarking:

MMLU maintains held-out test sets to prevent training data contamination
GPQA withholds questions to maintain benchmark validity
SWE-bench uses private test instances

If prompts and reference answers are fully public, model developers can (intentionally or inadvertently) train on them, rendering the benchmark meaningless. The public/held-out split balances transparency with benchmark integrity.

10.3 Academic Access

Researchers seeking access to the full test set for academic purposes may contact us at hello@faithbench.com. We will evaluate requests based on:

Affiliation with a recognized research institution
Clear research purpose
Agreement not to use test cases for model training

11. Benchmarking Is Hard

Note

This section contextualizes FaithBench's limitations within the broader benchmarking ecosystem. We believe honest positioning builds more trust than overstated claims.

11.1 The State of AI Benchmarking

Epoch AI's "Why Benchmarking Is Hard" documents pervasive challenges that affect all AI benchmarks:

Prompt sensitivity: GPQA-Diamond scores range 74%–80% depending on prompt template alone
Provider variance: The same model served by different providers produces different scores
Temperature inconsistency: Different organizations use temperatures ranging from 0.0 to 1.0
LLM-as-judge effects: Judge model choice has "sizable impact" on results
Single-run insufficiency: Standard practice is 4–32 runs averaged; single runs are considered insufficient

As Epoch AI puts it: "Basically everyone is doing their own thing."

11.2 Where FaithBench Stands

FaithBench is an early-stage benchmark. We are not yet at the rigor level of established benchmarks like MMLU or GPQA, and we don't claim to be. Our primary sources of uncertainty are:

Judge variance: Single LLM judge, potential self-preference bias
Provider variance: OpenRouter-mediated access, unquantified provider effects
Run-to-run variance: Single-run evaluation
Validation gap: No human IRR data yet

What we have done is make all of this transparent—including our exact judge prompts, rubric weights, model configs, and error handling. We believe transparency about limitations is more valuable than hiding them behind confident-sounding scores.

11.3 Our Hardening Roadmap

We have a phased plan to address these limitations:

Phase 1 (current): Full transparency and honest disclosure of all artifacts and limitations
Phase 2 (~$500 funding): Multi-judge evaluation, multi-run averaging, provider diversification, prompt sensitivity analysis
Phase 3 (~$2,000+ funding): Human inter-rater reliability, IRT calibration, bias auditing, full reproducibility package

Phase 1 makes our limitations transparent. Phase 2 quantifies them. Phase 3 mitigates them.

12. Development Status

The following table summarizes what is operational in v1.0 versus what is planned for future versions:

Component	Status	Version	Notes
Test case corpus (413 cases)	✅ Done	v1.0	6 dimensions, 4 difficulty levels
Difficulty calibration	✅ Done	v1.0	Easy/Medium deactivated due to ceiling effects
LLM-as-judge scoring	✅ Done	v1.0	Single judge (Gemini 3 Flash Preview)
Bootstrap confidence intervals	✅ Done	v1.0	1,000 iterations, 95% CI
Scoring rubrics with exemplars	✅ Done	v1.0	All 6 dimensions documented
Judge prompt published	✅ Done	v1.0	Exact system prompt on this page
Rubric weights published	✅ Done	v1.0	All sub-dimension weights on this page
Model execution config published	✅ Done	v1.0	Temperature, max tokens, provider
Error handling documented	✅ Done	v1.0	Retry policy, fallback judge, error classification
Provider limitation disclosed	✅ Done	v1.0	OpenRouter-only, Epoch AI citation
Scoring calculation documented	✅ Done	v1.0	Normalization, weighting, difficulty multipliers
Hardening roadmap published	✅ Done	v1.0	Phased plan with funding requirements
Open evaluation code	✅ Done	v1.0	MIT license
Cross-validation judge	⏳ Planned	v2.0	GPT-5 as secondary judge
Multi-run averaging	⏳ Planned	v2.0	Minimum 4 runs per model
Provider diversification	⏳ Planned	v2.0	Direct API validation for subset
Human expert calibration	🔄 In progress	v2.0	Expert recruitment underway
Inter-rater reliability metrics	⏳ Planned	v2.0	Requires human calibration
Position/verbosity/tradition bias testing	⏳ Planned	v2.0	Tooling complete, awaiting scale data
Delphi expert weight derivation	⏳ Planned	v2.0	5+ theologians per tradition
Item Response Theory calibration	⏳ Planned	v2.0	Model-independent difficulty estimates
Test equating across rotations	⏳ Planned	v2.0	Anchor items for cross-period comparison
Thinking-enabled leaderboard	⏳ Planned	v2.0	Reasoning mode performance delta
Non-Western tradition expansion	⏳ Planned	v2.0+	African, Asian, Latin American frameworks
Multi-lingual evaluation	⏳ Planned	v2.0+	German, Spanish, Korean theological discourse
Pentecostal tradition	🔄 In progress	v2.0	Expert recruitment and test case development

13. Tradition Scope and Definitions

13.1 Evangelical Tradition Boundaries

FaithBench's "Evangelical" tradition category follows the Bebbington Quadrilateral (Bebbington, 1989), which defines Evangelicalism by four characteristics:

Biblicism: High regard for biblical authority (operationalized via the Chicago Statement on Biblical Inerrancy)
Crucicentrism: Focus on Christ's atoning work on the cross (substitutionary atonement emphasis)
Conversionism: Emphasis on personal conversion experience ("born again")
Activism: Commitment to evangelism and missionary effort (Great Commission emphasis)

What is included: Conservative evangelical positions on Scripture, atonement, conversion, and mission as represented in the Chicago Statement, evangelical systematic theologies (e.g., Grudem, Erickson), and major evangelical confessional documents.

What is excluded: Mainline Protestant positions, progressive evangelical positions, and prosperity gospel teachings. These may be evaluated under separate tradition categories in future versions.

Relationship to Reformed: There is significant overlap between Evangelical and Reformed categories. The distinction: Reformed questions test confessional precision (Westminster Standards, covenant theology, TULIP) while Evangelical questions test broader evangelical distinctives that cross confessional lines.

13.2 Operational Definitions

"Google-Proof" Questions

A question is "Google-proof" when it cannot be answered correctly by performing a single web search and reading the top results. Operationally, this means:

The answer requires synthesizing information from 3+ distinct sources that do not appear together on any single web page
Simple keyword searches return partial or misleading information
Correct answers require domain expertise to evaluate and integrate conflicting sources

This is modeled on the GPQA benchmark (Rein et al., 2023), which demonstrated that non-expert validators with unrestricted web access still scored only 34% on questions experts answered at 65%.

"Multi-Hop" Reasoning

A question requires "multi-hop" reasoning when answering correctly requires:

Retrieving multiple distinct facts from different areas of theological knowledge
Connecting those facts through logical or theological reasoning
Synthesizing a conclusion that is not explicitly stated in any single source

Example: "Compare Athanasius's and Arius's interpretations of Colossians 1:15, explain the grammatical argument for each reading, and trace how this debate influenced the Nicene formulation."

This requires: (1) knowledge of the Arian controversy, (2) Greek grammar of the genitive construction, (3) patristic textual arguments, and (4) conciliar history—four distinct knowledge domains integrated into a single answer.

13.3 Tradition Category Asymmetry

The four tradition categories in v1.0 are not perfectly parallel analytic units:

Tradition	Category Type	Scope
Catholic	Ecclesial tradition	Global communion with defined magisterial authority
Orthodox	Ecclesial tradition	Family of autocephalous churches with shared liturgical and patristic heritage
Reformed	Confessional tradition	Defined by specific confessional documents (Westminster, Dort, Belgic)
Evangelical	Sociological-theological movement	Bounded modern movement with fuzzy edges, defined operationally via Bebbington Quadrilateral

This asymmetry is acknowledged and pragmatic for v1.0. Catholic and Orthodox represent broad ecclesial bodies; Reformed represents a confessional position within Protestantism; Evangelical represents a cross-denominational movement. These are not co-equal analytic units in a strict taxonomic sense.

We chose these four because they represent the major evaluative frameworks within Western Christianity where theological accuracy can be meaningfully assessed against identifiable standards. Future versions may refine this taxonomy as expert input and empirical results reveal where the current categories conflate meaningfully distinct positions or where finer distinctions are needed.

14. Scope Limitations

Important

FaithBench measures theological knowledge and reasoning accuracy. It does not measure spiritual wisdom, pastoral sensitivity, or appropriateness for ministry contexts.

14.1 Scope Limitations

What FaithBench measures:

Factual accuracy about theological positions
Reasoning within established frameworks
Faithful representation of denominational views
Biblical language competency

What FaithBench does not measure:

Spiritual edification value
Pastoral appropriateness
Alignment with any tradition's values
Suitability for worship or counseling

14.2 Methodological Limitations

Limitation	Impact	Mitigation
Western Christian focus	Non-Western traditions underrepresented	Planned expansion in v2.0
English-language evaluation	May miss non-English theological nuance	Original language competency tested separately
LLM judge variability	Scoring inconsistency possible	IRR monitoring, human calibration
Weight subjectivity	Dimension weights influence rankings	Sensitivity analysis, expert Delphi planned
Sample size	Statistical power limits	Bootstrap CI to quantify uncertainty
Expert question coverage	Initial set focuses on Western Christian traditions	Expand to cover more traditions in v2.0
Reasoning/thinking disabled	Reasoning models may underperform vs. their potential	Planned thinking-enabled leaderboard variants

14.3 Known Biases

Bias Type	Current Status	Planned Mitigation
Position bias	Untested	Position reversal protocol
Verbosity bias	Untested	Length correlation analysis
Tradition bias	Untested	Cross-tradition expert grading
Difficulty confounding	Partial mitigation	IRT difficulty calibration

14.4 Generalizability

Results may not generalize to:

Non-Christian religious traditions
Non-academic theological contexts
Languages other than English
Highly specialized sub-fields (e.g., Syriac patristics)

15. Technical Configuration Reference

This section provides a complete reference of all production configuration values. These are the exact values used in scoring—not aspirational or planned values.

15.1 Judge Configuration

Parameter	Value
Primary judge model	`google/gemini-3-flash-preview`
Fallback judge model	`openai/gpt-4o-mini`
Judge temperature	0
Judge max tokens	16,000
Output format	JSON (`response_format: { type: "json_object" }`)
Fallback trigger	Primary judge HTTP error

15.2 Test Model Configuration

Parameter	Value
Temperature	0.7
Max tokens	2,000
Provider	OpenRouter (`https://openrouter.ai/api/v1/chat/completions`)
System prompt	Minimal (question-only)

15.3 Execution Configuration

Parameter	Value
Max parallelism	50 concurrent actions
Retry attempts	3
Initial backoff	1,000ms
Backoff multiplier	2× (1s → 2s → 4s)
Max backoff	60,000ms

15.4 Statistical Configuration

Parameter	Value
Bootstrap iterations	1,000
Confidence level	95%
Difficulty weights	Easy: 1.0, Medium: 1.5, Hard: 2.0, Expert: 3.0

15.5 Scoring Rubric Weights (Production Values)

These are the exact sub-dimension weights from packages/convex/convex/scoring.ts:

Textual Analysis: lexicalAccuracy (35%), translationFidelity (35%), linguisticReasoning (20%), sourceHandling (10%)

Hermeneutical Reasoning: interpretiveMethod (35%), genreAwareness (25%), contextualAnalysis (25%), canonicalIntegration (15%)

Doctrinal Precision: doctrinalAccuracy (35%), traditionFidelity (35%), nuanceRecognition (20%), sourceGrounding (10%)

Historical Theology: historicalAccuracy (40%), developmentAwareness (30%), patristicKnowledge (20%), historiographicalMethod (10%)

Apologetics: logicalValidity (35%), evidenceUsage (30%), objectionHandling (25%), persuasiveClarity (10%)

Intertextual Reasoning: crossReferenceAccuracy (40%), typologicalRecognition (30%), allusionDetection (20%), thematicIntegration (10%)

16. Reproducibility Statement

16.1 Open Resources

Resource	Availability
Public test cases (50%)	Viewable by signed-in users
Held-out test cases (50%)	Available to researchers on request (hello@faithbench.com)
Evaluation code	MIT license
Judge prompts	Published on this page (Section 5.3.3)
Scoring rubrics	Published on this page (Sections 4.4, 15.5)
Model configurations	Published on this page (Section 15)
Hardening roadmap	Published in repository (`docs/roadmap-benchmarking-hardening.md`)

16.2 Version Control

Evaluation methodology versioned (current: v1.0)
Test case sets versioned with changelogs
Results include methodology version for reproducibility

17. Future Work

Multi-judge evaluation: Add GPT-5 as secondary judge, report inter-judge agreement
Multi-run averaging: 4+ runs per model to quantify run-to-run variance
Provider diversification: Validate scores via direct provider APIs
Human IRR: Krippendorff's α between LLM judge and theologian panel
Delphi weight derivation: Formal expert consensus methodology
Non-Western traditions: African, Asian, Latin American theological frameworks
Item response theory: Difficulty calibration using IRT models
Multi-lingual evaluation: German, Spanish, Korean theological discourse
Fine-grained sub-traditions: Distinguish within broad categories (e.g., Old Princeton vs. Dutch Reformed)
Temporal analysis: Track model improvement over time
Thinking-enabled leaderboard: Show performance delta with reasoning modes

References

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. AIES.

Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in NLP? NAACL.

Epoch AI. (2025). Why benchmarking is hard. Epoch AI Substack. https://epochai.substack.com/p/why-benchmarking-is-hard

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR.

Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Koreeda, Y. (2023). Holistic evaluation of language models. TMLR.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2024). G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP.

Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks.

Bebbington, D. W. (1989). Evangelicalism in Modern Britain: A History from the 1730s to the 1980s. Unwin Hyman.

Rein, D., Hou, B., Stickland, A., Petty, J., Pang, R. Y., Dirani, J., ... & Paulos, E. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint. https://arxiv.org/abs/2311.12022

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.

Citation

@misc{faithbench2026,
  title={FaithBench: Toward Tradition-Aware Evaluation of Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

Contributing

We welcome contributions from:

Theologians: Test case development, rubric refinement, expert calibration
AI Researchers: Evaluation methodology, statistical analysis, bias testing
Practitioners: Real-world use case scenarios, pastoral context feedback

Contact: hello@faithbench.com