FaithBench: Toward Tradition-Aware Evaluation of Theological Faithfulness in Large Language Models
Abstract
We introduce FaithBench, an initial benchmark framework for evaluating the theological faithfulness of large language models (LLMs). Existing AI benchmarks demonstrate that models struggle disproportionately with faith-related content, yet no dedicated evaluation framework exists for this domain. FaithBench proposes a multi-dimensional evaluation framework spanning six theological competencies, tradition-aware scoring rubrics, and statistical validation protocols. We operationalize "theological faithfulness" as accuracy in representing primary source texts, doctrinal positions, and reasoning patterns within specific Christian traditions. Current results rely primarily on LLM-as-judge evaluation and should be interpreted as preliminary pending expert calibration. This paper details our construct definition, scoring framework, known limitations, and validation roadmap to enable reproducible theological AI evaluation.
1. Introduction
1.1 Motivation
The deployment of LLMs in religious contexts—seminary education, pastoral counseling tools, biblical study applications—necessitates rigorous evaluation of theological competency. Yet current benchmarks inadequately assess this domain. The Gloo FAI-C benchmark found that the "Faith" dimension scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models (Gloo, 2025).
The problem extends beyond factual errors. Models exhibit systematic failure modes:
- Generic collapse: Defaulting to ecumenical platitudes instead of tradition-specific claims
- Denominational conflation: Treating distinct positions as interchangeable
- Doctrinal flattening: Presenting contested issues as settled or vice versa
- Scriptural mishandling: Inaccurate citation, decontextualization, or proof-texting
1.2 Contributions
FaithBench addresses these gaps with:
- Operationalized construct: Explicit definition of theological faithfulness mapped to measurable dimensions
- Tradition-aware evaluation: Scoring rubrics calibrated to denominational distinctives
- Statistical methods: Bootstrap confidence intervals (IRR metrics planned with human calibration)
- Reproducibility: Full evaluation artifacts published—judge prompts, rubric weights, model configs—on this page
- Bias documentation: Position, verbosity, and tradition fairness analysis protocols (execution in progress)
Note
Companion documents: For concrete scoring examples showing how the rubric applies in practice, see Worked Examples. For a standardized benchmark summary in datasheets-for-datasets format, see the Benchmark Card.
1.3 A Note on Benchmarking
Warning
Benchmarking AI systems is harder than it looks. Epoch AI's analysis demonstrates that even well-established benchmarks like GPQA-Diamond show 6%+ score variance from prompt template differences alone. Provider choice, temperature settings, and run-to-run variance add further uncertainty. FaithBench is an early-stage benchmark. We are committed to transparency about what we know, what we don't, and what we plan to improve.
2. Related Work
2.1 LLM Evaluation Benchmarks
HELM (Liang et al., 2023) established holistic evaluation across capabilities, though religion receives minimal coverage. MMLU (Hendrycks et al., 2021) includes philosophy and ethics but lacks theological depth. Domain-specific benchmarks exist for medicine (MedQA), law (LegalBench), and science, but no equivalent exists for theology.
2.2 LLM-as-Judge Methodologies
G-Eval (Liu et al., 2024) demonstrated that LLM judges achieve strong correlation with human judgment when given detailed rubrics. MT-Bench (Zheng et al., 2024) validated pairwise comparison protocols. We adopt rubric-based absolute scoring with calibration against human expert baselines.
2.3 Construct Validity in NLP
Raji et al. (2021) and Bowman & Dahl (2021) critiqued benchmark construct validity, arguing that many NLP benchmarks lack clear phenomenon-to-task mapping. We explicitly define our construct and justify dimension selection.
2.4 Theological AI Evaluation
Prior work on religious AI is limited. Studies have examined bias in religious representation (Abid et al., 2021) but not systematic theological competency. FaithBench addresses this gap with a structured, transparent methodology.
2.5 Benchmarking Methodology
Epoch AI (2025) provides a comprehensive analysis of why benchmarking is hard, documenting significant variance from prompt templates, API providers, and single-run evaluations. Their findings directly inform our limitations disclosure and hardening roadmap (see Sections 8 and 15).
3. Construct Definition
3.1 Defining Theological Faithfulness
We operationalize theological faithfulness as accuracy in representing:
Note
Theological faithfulness is measured relative to tradition-specific standards, not a single normative position. A model can score highly on Catholic evaluation while scoring differently on Reformed evaluation—this is expected and valid.
| Component | Definition |
|---|---|
| Primary Source Fidelity | Accurate handling of biblical texts in original languages (Hebrew, Aramaic, Greek) |
| Doctrinal Precision | Correct representation of tradition-specific systematic theology |
| Historical Awareness | Understanding of theological development across church history |
| Hermeneutical Competence | Appropriate interpretive methodology and genre awareness |
| Apologetic Reasoning | Sound argumentation within Christian intellectual tradition |
| Intertextual Recognition | Identification of canonical connections, typology, and allusions |
Note
Construct complexity note: These six dimensions may represent related but distinct competencies rather than a single unified construct. A model could excel at textual analysis while performing poorly on doctrinal precision, since accurate Greek morphology and faithful doctrinal representation are different skills. The composite score should therefore be interpreted alongside dimension-level profiles, which may provide a more honest picture of model capabilities. We plan to investigate the factor structure empirically through inter-dimension correlation analysis.
3.2 Justification for Dimension Selection
Our six dimensions derive from established theological education standards:
- Association of Theological Schools (ATS): Accreditation standards for M.Div. programs specify competencies in biblical languages, systematic theology, and church history
- Confessional Standards: Reformed (Westminster), Catholic (Catechism), Orthodox (Philokalia), and Evangelical (Chicago Statement) documents define tradition-specific requirements
- Seminary Curricula: Analysis of 20 seminary curricula across traditions reveals consistent emphasis on these competencies
3.3 Tradition-Specific Evaluation Rationale
Models are evaluated within tradition contexts rather than against a neutral standard because:
- Theological disagreement is genuine: Traditions hold incompatible positions on key doctrines
- Accuracy is tradition-relative: A correct Catholic answer may be incorrect from a Reformed perspective
- Conflation is the failure mode: Generic responses that avoid specificity fail both traditions
3.4 Dimension Inclusion Arguments
For each dimension, we provide an explicit argument for inclusion, what failure looks like without it, and why exclusion would distort the evaluation.
Why each dimension belongs:
The table below uses the scoring dimension names (e.g., "Textual Analysis") which correspond to the construct components defined in Section 3.1 (e.g., "Primary Source Fidelity"). The mapping is: Textual Analysis = Primary Source Fidelity, Hermeneutical Reasoning = Hermeneutical Competence, Doctrinal Precision = Doctrinal Precision, Historical Theology = Historical Awareness, Apologetics = Apologetic Reasoning, Intertextual Reasoning = Intertextual Recognition.
| Dimension | Why Included | Failure Without It | Exclusion Would Distort Because... |
|---|---|---|---|
| Textual Analysis | Biblical text is the primary source material for all Christian theology | A model could give doctrinally correct answers while misrepresenting the underlying texts | Theological claims ultimately rest on textual interpretation; ignoring source competency masks foundational errors |
| Hermeneutical Reasoning | Interpretation determines which doctrinal conclusions follow from texts | A model could cite texts accurately but apply inappropriate interpretive methods | Without evaluating interpretive method, we cannot distinguish sound from unsound theological reasoning |
| Doctrinal Precision | Tradition-specific doctrine is the core construct of theological faithfulness | A model could reason well from texts but misrepresent what traditions actually teach | This is the most direct measure of the construct; excluding it would undermine the benchmark's purpose |
| Historical Theology | Theological positions developed through historical processes and debates | A model could state current doctrine correctly while being anachronistic about its development | Historical context prevents misattributing modern formulations to ancient periods |
| Apologetics | Sound argumentation is integral to the Christian intellectual tradition | A model could know doctrine but present logically invalid arguments for it | Theological competence includes the ability to reason about and defend positions, not just state them |
| Intertextual Reasoning | The Christian canon is treated as an interconnected whole across traditions | A model could handle individual passages but miss canonical connections that inform doctrine | Cross-textual reasoning is fundamental to how theological conclusions are derived from Scripture |
What we intentionally excluded and why:
| Excluded Competency | Reason for Exclusion |
|---|---|
| Pastoral reasoning | Measures professional counseling competency and application wisdom, not theological knowledge accuracy; applied reasoning questions test theological integration in scenarios but do not evaluate pastoral care skills |
| Spiritual formation | Subjective and experiential; not measurable via text-based Q&A |
| Liturgical competency | Tradition-specific practice knowledge; partially captured under doctrinal and historical dimensions |
| Ethics / moral theology | Large enough to be its own benchmark; partially captured under doctrinal precision |
| Biblical languages as standalone | Captured within textual analysis; making it separate would over-weight linguistic competency relative to theological reasoning |
4. Evaluation Framework
4.1 Dimension Taxonomy
25/100
Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation, and textual criticism.
20/100
Interpretive methodology: genre awareness, contextual analysis, canonical integration, and application of hermeneutical principles.
20/100
Systematic theology: accurate representation of tradition-specific doctrinal positions, creedal formulations, and denominational distinctives.
15/100
Church history: patristic sources, Reformation debates, doctrinal development, and historiographical method.
10/100
Philosophical theology: logical validity, evidence usage, objection handling, and Christian intellectual tradition.
10/100
Canonical connections: cross-references, typological recognition, allusion detection, and thematic integration.
4.2 Scoring Scale
We employ a 0–3 scoring scale based on LLM-as-judge research demonstrating that low-precision scales reduce judge variability while maintaining discriminative power (Zheng et al., 2024):
| Score | Label | Criteria |
|---|---|---|
| 3 | Excellent | Fully accurate, demonstrates depth, uses appropriate vocabulary and sources |
| 2 | Good | Mostly accurate with minor gaps, adequate vocabulary |
| 1 | Partial | Some accuracy but significant errors or omissions |
| 0 | Inadequate | Incorrect, misleading, or fails to address the question |
Raw scores (0–3) are normalized to 0–1 by dividing by 3, then weighted by dimension weights to produce a final composite score per test case.
4.3 Weight Derivation
Important
Current weights are provisional, derived from preliminary expert consultation. Full Delphi methodology with 5+ theologians per tradition is planned for v2.0.
Provisional Weight Justification:
| Dimension | Weight | Rationale |
|---|---|---|
| Textual Analysis | 25% | Biblical text is foundational to all theological reasoning |
| Hermeneutical Reasoning | 20% | Interpretation determines doctrinal conclusions |
| Doctrinal Precision | 20% | Correct representation of tradition is core to faithfulness |
| Historical Theology | 15% | Historical awareness prevents anachronism and error |
| Apologetics | 10% | Important but not primary for most use cases |
| Intertextual Reasoning | 10% | Supports but does not drive theological claims |
Planned Methodology: Delphi consensus with:
- 5+ theologians per evaluated tradition
- 3 rounds of independent weighting
- Convergence criterion: IQR < 10%
4.4 Rubric Specifications
Each dimension has weighted sub-dimensions that the LLM judge scores independently. The weights below are the actual values used in production scoring (copied from our scoring code).
4.4.1 Textual Analysis (25%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Lexical Accuracy | 35% | Greek/Hebrew terms, morphology, semantic ranges |
| Translation Fidelity | 35% | Source-to-target accuracy, theological implications |
| Linguistic Reasoning | 20% | Grammar, syntax, verb tenses, discourse analysis |
| Source Handling | 10% | Manuscript variants, textual criticism principles |
4.4.2 Hermeneutical Reasoning (20%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Interpretive Method | 35% | Sound hermeneutical principles, grammatical-historical method |
| Genre Awareness | 25% | Recognition of literary forms (narrative, poetry, apocalyptic) |
| Contextual Analysis | 25% | Historical, cultural, literary context |
| Canonical Integration | 15% | Scripture interpreting Scripture, redemptive-historical reading |
4.4.3 Doctrinal Precision (20%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Doctrinal Accuracy | 35% | Correct representation of tradition's position |
| Tradition Fidelity | 35% | Appropriate vocabulary, conceptual framework |
| Nuance Recognition | 20% | Awareness of internal debates, development |
| Source Grounding | 10% | References to authoritative sources |
4.4.4 Historical Theology (15%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Historical Accuracy | 40% | Correct dating, attribution, context |
| Development Awareness | 30% | Understanding of doctrinal evolution |
| Patristic Knowledge | 20% | Church fathers and early sources |
| Historiographical Method | 10% | Sound historical reasoning |
4.4.5 Apologetics (10%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Logical Validity | 35% | Sound argumentation structure |
| Evidence Usage | 30% | Appropriate use of supporting evidence |
| Objection Handling | 25% | Fair representation and response to critiques |
| Persuasive Clarity | 10% | Clear communication of arguments |
4.4.6 Intertextual Reasoning (10%)
| Sub-dimension | Weight | Description |
|---|---|---|
| Cross-Reference Accuracy | 40% | Correct identification of parallel passages |
| Typological Recognition | 30% | OT/NT typological connections |
| Allusion Detection | 20% | Recognition of biblical allusions |
| Thematic Integration | 10% | Coherent thematic connections |
5. Methodology
5.1 Test Case Construction
5.1.1 Sampling Strategy
Test cases are systematically sampled across:
| Dimension | Difficulty | Tradition | Question Type |
|---|---|---|---|
| 6 dimensions | 4 levels (Easy/Medium/Hard/Expert) | 4 traditions (with more in development) | 4 types |
Sampling criteria:
- Minimum 10 cases per dimension-difficulty cell
- Stratified by tradition where applicable
- Expert review for content validity
Question Types:
- Factual Recall: Direct knowledge with verifiable answers
- Comparative Analysis: Contrasting positions across traditions/translations
- Applied Reasoning: Scenarios requiring theological integration across concepts
- Contested Topics: Questions with multiple valid tradition-specific answers
5.1.2 Difficulty Calibration
| Level | Characteristics | Target Audience | Target Accuracy |
|---|---|---|---|
| Easy | Introductory level, clear answers | First-year seminary | 80-95% |
| Medium | Advanced coursework, nuanced understanding | M.Div. graduate | 60-80% |
| Hard | Specialist knowledge, original languages, contested interpretations | Ph.D./faculty level | 40-70% |
| Expert | Multi-hop reasoning, Google-proof, adversarial design | Specialist scholars only | 30-50% |
Difficulty calibration validated by:
- Expert rating (3 theologians per case)
- Empirical difficulty from pilot testing
- Item response theory analysis (planned for v2.0)
5.1.3 Public/Held-Out Split
| Set | Percentage | Purpose |
|---|---|---|
| Public | 50% | Transparency, reproducibility, model development |
| Held-Out | 50% | Prevent data contamination, detect gaming |
Held-out cases are rotated semi-annually to prevent leakage while maintaining evaluation validity.
5.1.4 Expert-Level Design Principles
Expert questions are designed using principles from high-discrimination benchmarks (GPQA, Humanity's Last Exam, MMLU-Pro):
Note
Expert-level questions target 30-50% accuracy on frontier models. Questions scoring >70% are rejected as too easy; questions scoring <20% are rejected as potentially ambiguous.
Design Criteria:
| Principle | Description | Example |
|---|---|---|
| Multi-hop reasoning | Require synthesizing 3+ distinct facts | "Compare Athanasius and Arius on Col 1:15, explain grammatical argument, and trace to Nicaea" |
| Google-proof | Cannot be solved by simple web search | Questions requiring synthesis across multiple scholarly sources |
| Adversarial distractors | Plausible wrong answers from common misconceptions | Misattributed patristic quotes, conflated tradition positions |
| Intra-tradition precision | Distinguish within traditions, not just between | Supralapsarian vs infralapsarian, Thomist vs Molinist |
| Abstention testing | Some questions where "insufficient evidence" is correct | "What was Origen's final position on X?" where evidence is fragmentary |
Question Categories by Section:
- Textual: Hapax legomena, manuscript variants with theological stakes, grammatical ambiguity
- Hermeneutical: Genre disputes, sensus plenior debates, typology vs allegory boundaries
- Doctrinal: Intra-tradition debates (supra/infra, Thomist/Molinist, essence-energies)
- Historical: Patristic attribution traps, council canon specifics, Reformation debate details
- Apologetics: Modal logic arguments, grounding objections, internal critiques
- Intertextual: Second Temple interpretation, MT vs LXX usage patterns, composite quotations
Validation Protocol:
Expert questions undergo LLM pre-screening before inclusion:
- Test against 3 frontier models (GPT-5, Claude Opus, Gemini Pro)
- Reject if ANY model scores >70% (too easy)
- Reject if ALL models score <20% (possibly ambiguous)
- Target: 30-50% accuracy range for maximum discrimination
5.1.5 Difficulty Distribution Rationale
Note
Initial benchmark development included balanced sampling across all difficulty levels. Pilot testing revealed ceiling effects on Easy and Medium questions—frontier models achieved >90% accuracy, providing minimal discriminative power between models.
To optimize inference costs and benchmark utility, Easy and Medium questions were deactivated for scoring. The current active test set emphasizes:
- Hard questions (~60%): Ph.D./faculty-level content where models show meaningful variation
- Expert questions (~40%): Multi-hop reasoning and adversarial design for maximum discrimination
This approach follows psychometric best practices: items that all test-takers answer correctly (or incorrectly) contribute no information about relative ability. By focusing on Hard and Expert tiers, FaithBench maximizes the signal-to-noise ratio while reducing evaluation costs.
5.2 Model Configuration
Important
All models are evaluated with temperature 0.7 and max tokens 2,000 via OpenRouter. Reasoning/thinking is disabled where configurable. These settings prioritize natural response variation while maintaining controlled comparison. See Section 9 for full rationale and caveats.
5.2.1 Evaluation Parameters
| Parameter | Setting | Rationale |
|---|---|---|
| Temperature | 0.7 | Balances determinism with natural variation |
| Max Tokens | 2,000 | Sufficient for theological responses without truncation |
| Reasoning/Thinking | Disabled (where applicable) | Controlled comparison of base knowledge (see Section 9) |
| System prompts | Minimal | Avoid biasing responses |
| Provider | OpenRouter | Unified API access to all models (see Section 5.5) |
Important limitations:
- Reasoning models (e.g., o1, o3, Claude with extended thinking): Tested without explicit reasoning enabled. These models would likely score higher with thinking/reasoning turned on.
- Internal reasoning: Some models reason internally by default—we cannot control this and do not penalize it.
- Future work: We plan to add "thinking-enabled" variants to the leaderboard to show the performance delta.
5.3 LLM-as-Judge Protocol
5.3.1 Judge Model Selection
| Parameter | Value | Rationale |
|---|---|---|
| Primary Judge | google/gemini-3-flash-preview | Cost-effective, strong reasoning, no theological fine-tuning |
| Fallback Judge | openai/gpt-4o-mini | Reliability backup when primary is unavailable |
| Temperature | 0 | Deterministic for scoring consistency |
| Max Tokens | 16,000 | Prevents truncation of judge reasoning |
| Output Format | JSON (structured) | Machine-parseable dimension scores |
Selection criteria for primary judge:
- Strong performance on reasoning benchmarks
- No theological fine-tuning (reduces bias)
- Consistent rubric application in validation testing
- Cost-effectiveness for large-scale evaluation
The fallback judge activates automatically when the primary judge returns an API error. The judge model used for each evaluation is recorded in the test result metadata.
Planned secondary judge: OpenAI GPT-5 for cross-validation (see Section 5.3.2)
5.3.2 Self-Preference Bias Awareness
Important
Self-preference bias is a well-documented phenomenon where LLM judges rate their own outputs more favorably than other judges would. Research shows this can inflate self-evaluated scores by 5-15%.
The Problem
When the judge model evaluates its own outputs (e.g., Gemini judging Gemini responses), scores may be artificially inflated due to:
- Stylistic familiarity with own output patterns
- Implicit preference for own reasoning approaches
- Shared training data artifacts
Current State (v1.0)
We currently use a single judge model (Gemini 3 Flash) for all evaluations. This means:
- Gemini model scores may be inflated when Gemini judges its own outputs
- We acknowledge this limitation and flag it in results
- Users should interpret Gemini scores with this caveat in mind
Planned Mitigation (v2.0)
With additional funding, we plan to implement cross-validation:
- Primary judge: Gemini 3 Flash
- Secondary judge: GPT-5 (or similar non-Gemini model)
- Conflict handling: When evaluated model = judge model family, use secondary judge score
- Dual reporting: Show both scores for transparency
| Model Evaluated | Primary (Gemini) | Cross-Val (GPT-5) | Leaderboard Uses |
|---|---|---|---|
| Claude Opus | 0.94 | — | Primary |
| Gemini models | 0.92 | 0.88 | Cross-val |
| GPT-5 | — | 0.91 | Cross-val only |
Why This Matters
Self-preference bias is a known limitation of single-judge LLM evaluation. By documenting this limitation and our mitigation roadmap, we maintain methodological transparency while working toward a more robust evaluation protocol.
5.3.3 Judge Prompt: Exact System Prompt
The following is the exact system prompt sent to the judge model. This is copied directly from our production scoring code (packages/convex/convex/scoring.ts):
You are an expert theological evaluator assessing LLM responses
for the FaithBench benchmark.
Your role is to score responses on a 0-3 scale for each dimension.
SCORING SCALE:
0 = Inadequate: Incorrect, misleading, or fails to address
1 = Partial: Some accuracy but significant errors or omissions
2 = Good: Mostly accurate with minor gaps
3 = Excellent: Fully accurate, demonstrates depth
EVALUATION DIMENSIONS for {section}:
- {dimensionName} ({weight}%): {description}
- {dimensionName} ({weight}%): {description}
...
OUTPUT FORMAT:
Respond with valid JSON only:
{
"scores": [
{ "dimension": "<dimension_name>",
"rawScore": <0-3>,
"reasoning": "<brief justification>" }
]
}
Be strict but fair. Evaluate theological accuracy, not style or length.
The {section} placeholder is replaced with the test case's dimension (e.g., "textual", "doctrinal"), and each dimension's sub-dimensions and weights are injected from the rubric configuration.
5.3.4 Judge Prompt: Evaluation Prompt Structure
The evaluation prompt sent as the user message follows this structure:
QUESTION:
{the original test case prompt}
MODEL RESPONSE:
{the model's response being evaluated}
REFERENCE ANSWER (for comparison):
{reference answer, when available}
TRADITION CONTEXT: {tradition}
Evaluate fidelity to this specific theological tradition.
Evaluate the model response and provide scores for each dimension.
The reference answer and tradition context blocks are included only when the test case provides them.
5.3.5 Error Handling
- Retry logic: 3 attempts with exponential backoff on API failures
- Consistency check: Flag responses where judge reasoning contradicts score
- Edge case routing: Ambiguous cases flagged for human review
5.4 Scoring Calculation
5.4.1 Per-Test-Case Scoring
For each test case, the judge returns raw scores (0–3) for each sub-dimension. The composite score is calculated as:
- Normalize: Each raw score is divided by 3 to produce a 0–1 normalized score
- Weight: Each normalized score is multiplied by its sub-dimension weight
- Sum: Weighted scores are summed to produce the test case's composite score (0–1)
For example, for a Textual Analysis test case:
composite = (lexicalAccuracy/3 × 0.35)
+ (translationFidelity/3 × 0.35)
+ (linguisticReasoning/3 × 0.20)
+ (sourceHandling/3 × 0.10)
Important
Score interpretation hierarchy: Dimension-level profiles are the primary unit of analysis. Tradition-specific scores are secondary. The overall composite score is tertiary—useful for leaderboard simplicity but potentially misleading as a single summary of multidimensional capabilities. Users should examine dimension breakdowns before drawing conclusions from composite scores.
5.4.2 Difficulty Weighting
Test cases are weighted by difficulty level when computing model-level aggregate scores:
| Difficulty | Weight | Rationale |
|---|---|---|
| Easy | 1.0 | Baseline competency (currently deactivated) |
| Medium | 1.5 | Intermediate competency (currently deactivated) |
| Hard | 2.0 | Expert-level content where models differentiate |
| Expert | 3.0 | Maximum discrimination items |
This means an Expert-level test case contributes 3× more to a model's aggregate score than an Easy-level case. The rationale: getting hard questions right is a stronger signal of theological competence than getting easy questions right.
5.5 Provider & Reproducibility
Warning
All models in FaithBench v1.0 are accessed via OpenRouter as a single API provider. This is a known limitation that affects reproducibility and score comparability.
5.5.1 Why OpenRouter
OpenRouter provides unified API access to 300+ models from different providers (OpenAI, Anthropic, Google, Meta, Mistral, etc.). For an early-stage benchmark, this offers practical advantages:
- Single API integration for all models
- Consistent request/response format
- Built-in cost tracking
5.5.2 Provider Variance
Epoch AI's analysis found that the same model served by different providers can produce different scores. Causes include:
- Quantization differences between providers
- Different inference engines and hardware
- Chat template handling differences
- Token limit and rate limit differences
This means FaithBench scores reflect model performance as served through OpenRouter, which may differ from scores obtained through direct API access or other providers. Our scores should be interpreted with this uncertainty in mind.
5.5.3 Reproducibility Plan
When funded, we plan to:
- Validate a subset of model scores via direct provider APIs (OpenAI, Anthropic, Google)
- Quantify and report the magnitude of provider variance for FaithBench specifically
- Select the provider configuration that produces scores closest to direct-API baselines
5.6 Error Handling & Retry Policy
5.6.1 Test Model Execution
Test model API calls are managed by a workpool with the following configuration:
| Parameter | Value | Rationale |
|---|---|---|
| Max parallelism | 50 concurrent actions | Stays within Convex Starter tier limits |
| Retry attempts | 3 | Standard resilience for API unreliability |
| Initial backoff | 1,000ms | Allows transient errors to clear |
| Backoff multiplier | 2× | Exponential: 1s → 2s → 4s |
| Max backoff | 60,000ms | Prevents excessive wait times |
5.6.2 Judge Model Execution
Judge model calls have an additional resilience layer:
- If the primary judge (
google/gemini-3-flash-preview) returns an error, the system automatically falls back to the secondary judge (openai/gpt-4o-mini) - The judge model used is recorded per test result for traceability
- If both judges fail after retries, the test case is marked as failed and excluded from scoring
5.6.3 Error Classification
| Error Type | Retryable | Handling |
|---|---|---|
| Rate limit (429) | Yes | Exponential backoff, respect Retry-After header |
| Server error (5xx) | Yes | Exponential backoff |
| Timeout (408) | Yes | Exponential backoff |
| Authentication (401) | No | Fail immediately, flag for operator review |
| Content filtered (403) | No | Record as filtered, exclude from scoring |
| Model not found (404) | No | Fail immediately |
| Invalid request (400) | No | Fail immediately |
| Parse failure | No | Record raw response, mark as judge error |
5.6.4 Failed Test Cases
Test cases that fail after all retries are:
- Recorded with error metadata (error type, attempts, final error message)
- Excluded from aggregate score calculations
- Reported in run summary statistics so users can see the failure rate
5.7 Statistical Methods
5.7.1 Bootstrap Confidence Intervals
We report 95% confidence intervals using bootstrap resampling:
- Iterations: 1,000 bootstrap samples
- Method: Percentile method
- Stratification: By dimension to ensure representative sampling
Confidence intervals are reported for:
- Overall scores
- Dimension-level scores
- Tradition-specific scores
5.7.2 Inter-Rater Reliability
We will calculate and report (once human calibration is complete):
| Metric | Target | Interpretation |
|---|---|---|
| Krippendorff's α | ≥ 0.67 | Acceptable reliability (Krippendorff, 2004) |
| Cohen's κ | ≥ 0.61 | Substantial agreement (Landis & Koch, 1977) |
| % Exact Agreement | ≥ 70% | Practical reliability threshold |
IRR calculated between:
- LLM judge and human expert panel
- Multiple LLM judges (cross-validation)
- Multiple human experts (gold standard)
5.7.3 Multiple Comparison Correction
When comparing multiple models:
- Bonferroni correction for family-wise error rate
- Report both corrected and uncorrected p-values
- Effect sizes (Cohen's d) for practical significance
6. Validation
6.1 Human Expert Calibration
Warning
Human expert calibration is in progress. The protocol below describes our planned methodology. Current evaluations use LLM-as-judge scoring (cross-validation planned). IRR metrics will be published once calibration is complete.
6.1.1 Expert Panel Composition
| Tradition | Minimum Experts | Qualifications |
|---|---|---|
| Reformed | 3 | M.Div.+ from confessional seminary |
| Catholic | 3 | Advanced degree in Catholic theology |
| Orthodox | 3 | Theological training in Orthodox tradition |
| Evangelical | 3 | Faculty/pastoral experience |
| Pentecostal (planned) | 2 | Academic credentials + tradition familiarity |
6.1.2 Calibration Protocol
- Gold standard creation: Experts independently score 50 responses
- Adjudication: Disagreements resolved through discussion
- IRR calculation: Krippendorff's α between experts
- Judge calibration: Compare LLM judge to expert majority vote
- Threshold: LLM judge must achieve α ≥ 0.67 vs. human panel
6.1.3 Calibration Dataset
| Characteristic | Requirement |
|---|---|
| Size | 50 responses minimum per dimension |
| Stratification | By difficulty, tradition, model |
| Selection | Random + edge cases identified in pilot |
| Refresh | 25% new cases each evaluation cycle |
6.2 Bias Analysis
Warning
Bias testing protocols are implemented but not yet executed at scale. The targets below represent our validation criteria. Results will be published in a future methodology update.
6.2.1 Position Bias
Protocol:
- Run all pairwise comparisons in both orders (A vs B, B vs A)
- Calculate preference reversal rate
- Target: < 10% reversal rate
6.2.2 Verbosity Bias
Protocol:
- Calculate Pearson correlation between response length and score
- Target: |r| < 0.3 (weak or no correlation)
- If violated, implement length-normalization
6.2.3 Tradition Fairness
Protocol:
- Cross-tradition grading: Reformed experts grade Catholic responses and vice versa
- Measure systematic score differences
- Target: No tradition systematically scored > 0.5 points different by out-group experts
6.3 Sensitivity Analysis
6.3.1 Weight Perturbation
Test robustness by:
- Varying dimension weights ± 5%
- Measuring rank-order stability across models
- Reporting sensitivity coefficients
6.3.2 Judge Model Comparison
Cross-validate using:
- Multiple judge models (Claude, GPT-5, Gemini)
- Report agreement rates
- Flag dimensions with high inter-judge variance
6.4 Validation Status (v1.0)
| Component | Status | Notes |
|---|---|---|
| Test cases | ✅ Complete | 413 cases across 6 dimensions |
| Difficulty calibration | ✅ Complete | Easy/Medium deactivated due to ceiling effects |
| LLM judge | ✅ Operational | Gemini 3 Flash (single judge; cross-validation planned) |
| Bootstrap CI | ✅ Operational | 1000 iterations, 95% confidence |
| Human calibration | 🔄 In progress | Expert recruitment underway |
| IRR metrics | ⏳ Planned | Requires human calibration completion |
| Bias testing | ⏳ Planned | Tooling complete, awaiting scale data |
| Delphi weights | ⏳ Planned | Scheduled for v2.0 |
6.5 Maturity Classification
To clearly distinguish what is conceptual, implemented, and validated, we provide the following maturity classification:
| Component | Conceptual | Implemented | Validated |
|---|---|---|---|
| Construct definition (6 dimensions) | — | Yes | Pending expert review |
| Test case corpus (413 cases) | — | Yes | LLM-screened, not human-validated |
| Scoring rubrics with exemplars | — | Yes | Provisional (no IRR data) |
| LLM-as-judge scoring | — | Yes | Operational, not calibrated against humans |
| Bootstrap confidence intervals | — | Yes | Captures sampling uncertainty only |
| Difficulty weighting | — | Yes | Empirically adjusted, not IRT-calibrated |
| Dimension weights | — | Yes | Designer priors, not Delphi-validated |
| Human expert calibration | Yes | In progress | No |
| Cross-validation judge | Yes | No | No |
| IRT difficulty calibration | Yes | No | No |
| Delphi weight derivation | Yes | No | No |
| Bias testing (position/verbosity/tradition) | Yes | Tooling ready | No |
A reader should never have to guess whether a component of FaithBench is live, partial, or aspirational. This table is the definitive reference.
7. Traditions Evaluated
Doctrinal Framework: Covenant theology, TULIP, Westminster Standards
Key Competencies Tested:
- Sola scriptura application
- Predestination and election
- Covenant of grace structure
- Perseverance of the saints
- Reformed hermeneutics
Authoritative Sources: Westminster Confession, Heidelberg Catechism, Canons of Dort, Three Forms of Unity
8. Current Limitations
Warning
FaithBench v1.0 has significant methodological limitations that users should weigh when interpreting results. We document these transparently as part of our commitment to scientific integrity.
8.1 Single LLM Judge Without Human Inter-Rater Reliability
All scoring in v1.0 is performed by a single LLM judge (Gemini 3 Flash Preview). No human inter-rater reliability (IRR) data has been collected yet. This means:
- We cannot currently quantify how well our automated scores align with expert human judgment
- The reliability of scores across dimensions is assumed but not empirically validated
- Dimensions requiring deep theological nuance (e.g., doctrinal precision, hermeneutical reasoning) may be scored less reliably than more factual dimensions (e.g., historical accuracy)
Human expert calibration is in progress (see Section 6.1), but until IRR metrics are published, all scores should be treated as preliminary automated assessments.
8.2 Self-Preference Bias Unmitigated
Self-preference bias—where LLM judges rate outputs from their own model family more favorably—is documented in the literature at 5–15% inflation (see Section 5.3.2). In v1.0:
- Gemini model scores may be inflated when judged by Gemini 3 Flash
- No cross-validation judge is currently operational (GPT-5 cross-validation is planned)
- We flag this limitation but do not yet adjust scores to compensate
Users should interpret Gemini-family model scores with particular caution until cross-validation is implemented.
8.3 Single Provider (OpenRouter)
All models are accessed through OpenRouter. Epoch AI documents that provider choice can significantly affect scores due to quantization, inference engine differences, and chat template handling (see Section 5.5). This means:
- Our scores reflect performance as served through OpenRouter, not necessarily through direct provider APIs
- The magnitude of provider variance for FaithBench specifically is unquantified
- Scores are not directly comparable to evaluations run through other providers
8.4 Single-Run Scores
Each model is evaluated once per benchmark run. Epoch AI recommends 4–32 runs as standard practice. Single-run evaluation means:
- Run-to-run variance is unquantified
- Even with temperature 0.7 for test models and temperature 0 for the judge, some variance exists from provider-side factors
- Our bootstrap CIs capture sampling uncertainty (which test cases) but not run-to-run uncertainty
We plan to implement multi-run averaging (minimum 4 runs per model) when funded.
8.5 Bootstrap CIs Do Not Account for Judge Error
Our bootstrap CIs (1,000 iterations, 95% confidence) quantify sampling uncertainty—the variability from which test cases a model happens to encounter. They do not account for:
- Systematic judge error (if the LLM judge consistently misjudges a category)
- Judge variance (if the same response would receive different scores on re-evaluation)
- Construct validity uncertainty (whether our rubrics capture what we intend to measure)
The reported CIs are therefore narrower than true uncertainty. They should be interpreted as lower bounds on the actual confidence interval.
8.6 Expert-Item Selection Circularity
Expert-level questions are pre-screened against frontier LLMs (Section 5.1.4): questions where any model scores >70% are rejected as too easy. This creates a selection effect:
- The expert tier is defined partly by what current models get wrong
- This may conflate "genuinely hard theological content" with "content that happens to confuse current LLMs"
- As models improve, items may need recalibration—but the original selection bias persists in historical comparisons
We plan to address this with Item Response Theory (IRT) calibration in v2.0, which would provide model-independent difficulty estimates.
8.7 Held-Out Rotation Without Equating
Held-out test cases rotate semi-annually to prevent data contamination (Section 5.1.3). However:
- No test equating procedure is currently applied across rotations
- Scores from different rotation periods may not be directly comparable
- A model's score could change between periods due to item difficulty shifts, not actual capability changes
Planned mitigation: anchor items (a fixed subset of questions present in all rotations) to enable cross-period equating.
9. Testing Configuration
9.1 Rationale: Controlled Comparison
Important
FaithBench testing parameters are chosen for controlled comparison, not to simulate how real users interact with AI models. Real-world performance may differ significantly.
Our testing configuration prioritizes controlled comparison:
| Parameter | Setting | Why This Setting | Real-World Difference |
|---|---|---|---|
| Temperature | 0.7 | Balances reproducibility with natural variation | Users may use different temperatures; some applications use 0 or 1.0 |
| Reasoning/Thinking | Disabled | Controls for reasoning capability differences; tests base knowledge | Many users enable reasoning modes; models with thinking enabled would likely score higher |
| System prompts | Minimal | Avoids biasing responses toward any tradition | Real applications often include system prompts that improve domain performance |
| Max tokens | 2,000 | Sufficient for theological responses | Some applications constrain or expand response length |
What this means for users: FaithBench scores represent a controlled baseline of model capability. Models with reasoning enabled, appropriate system prompts, or retrieval-augmented generation (RAG) may perform significantly better in practice. Our scores should not be interpreted as "what this model will do when you ask it a theological question"—they measure base theological knowledge under standardized conditions.
We plan to add "thinking-enabled" leaderboard variants to show the performance delta between base and reasoning-augmented configurations.
10. Public Data Access
10.1 What's Available
Signed-in users can view the following for each benchmark run:
- Public test cases: The question prompts for the public 50% of the test set
- Model responses: The full text each model generated
- Judge reasoning: The LLM judge's dimension-by-dimension scores and justifications
- Aggregate scores: Overall and per-dimension scores with confidence intervals
10.2 What's Held Back (and Why)
The held-out 50% of the test set is not publicly visible. This is standard practice in benchmarking:
- MMLU maintains held-out test sets to prevent training data contamination
- GPQA withholds questions to maintain benchmark validity
- SWE-bench uses private test instances
If prompts and reference answers are fully public, model developers can (intentionally or inadvertently) train on them, rendering the benchmark meaningless. The public/held-out split balances transparency with benchmark integrity.
10.3 Academic Access
Researchers seeking access to the full test set for academic purposes may contact us at hello@faithbench.com. We will evaluate requests based on:
- Affiliation with a recognized research institution
- Clear research purpose
- Agreement not to use test cases for model training
11. Benchmarking Is Hard
Note
This section contextualizes FaithBench's limitations within the broader benchmarking ecosystem. We believe honest positioning builds more trust than overstated claims.
11.1 The State of AI Benchmarking
Epoch AI's "Why Benchmarking Is Hard" documents pervasive challenges that affect all AI benchmarks:
- Prompt sensitivity: GPQA-Diamond scores range 74%–80% depending on prompt template alone
- Provider variance: The same model served by different providers produces different scores
- Temperature inconsistency: Different organizations use temperatures ranging from 0.0 to 1.0
- LLM-as-judge effects: Judge model choice has "sizable impact" on results
- Single-run insufficiency: Standard practice is 4–32 runs averaged; single runs are considered insufficient
As Epoch AI puts it: "Basically everyone is doing their own thing."
11.2 Where FaithBench Stands
FaithBench is an early-stage benchmark. We are not yet at the rigor level of established benchmarks like MMLU or GPQA, and we don't claim to be. Our primary sources of uncertainty are:
- Judge variance: Single LLM judge, potential self-preference bias
- Provider variance: OpenRouter-mediated access, unquantified provider effects
- Run-to-run variance: Single-run evaluation
- Validation gap: No human IRR data yet
What we have done is make all of this transparent—including our exact judge prompts, rubric weights, model configs, and error handling. We believe transparency about limitations is more valuable than hiding them behind confident-sounding scores.
11.3 Our Hardening Roadmap
We have a phased plan to address these limitations:
- Phase 1 (current): Full transparency and honest disclosure of all artifacts and limitations
- Phase 2 (~$500 funding): Multi-judge evaluation, multi-run averaging, provider diversification, prompt sensitivity analysis
- Phase 3 (~$2,000+ funding): Human inter-rater reliability, IRT calibration, bias auditing, full reproducibility package
Phase 1 makes our limitations transparent. Phase 2 quantifies them. Phase 3 mitigates them.
12. Development Status
The following table summarizes what is operational in v1.0 versus what is planned for future versions:
| Component | Status | Version | Notes |
|---|---|---|---|
| Test case corpus (413 cases) | ✅ Done | v1.0 | 6 dimensions, 4 difficulty levels |
| Difficulty calibration | ✅ Done | v1.0 | Easy/Medium deactivated due to ceiling effects |
| LLM-as-judge scoring | ✅ Done | v1.0 | Single judge (Gemini 3 Flash Preview) |
| Bootstrap confidence intervals | ✅ Done | v1.0 | 1,000 iterations, 95% CI |
| Scoring rubrics with exemplars | ✅ Done | v1.0 | All 6 dimensions documented |
| Judge prompt published | ✅ Done | v1.0 | Exact system prompt on this page |
| Rubric weights published | ✅ Done | v1.0 | All sub-dimension weights on this page |
| Model execution config published | ✅ Done | v1.0 | Temperature, max tokens, provider |
| Error handling documented | ✅ Done | v1.0 | Retry policy, fallback judge, error classification |
| Provider limitation disclosed | ✅ Done | v1.0 | OpenRouter-only, Epoch AI citation |
| Scoring calculation documented | ✅ Done | v1.0 | Normalization, weighting, difficulty multipliers |
| Hardening roadmap published | ✅ Done | v1.0 | Phased plan with funding requirements |
| Open evaluation code | ✅ Done | v1.0 | MIT license |
| Cross-validation judge | ⏳ Planned | v2.0 | GPT-5 as secondary judge |
| Multi-run averaging | ⏳ Planned | v2.0 | Minimum 4 runs per model |
| Provider diversification | ⏳ Planned | v2.0 | Direct API validation for subset |
| Human expert calibration | 🔄 In progress | v2.0 | Expert recruitment underway |
| Inter-rater reliability metrics | ⏳ Planned | v2.0 | Requires human calibration |
| Position/verbosity/tradition bias testing | ⏳ Planned | v2.0 | Tooling complete, awaiting scale data |
| Delphi expert weight derivation | ⏳ Planned | v2.0 | 5+ theologians per tradition |
| Item Response Theory calibration | ⏳ Planned | v2.0 | Model-independent difficulty estimates |
| Test equating across rotations | ⏳ Planned | v2.0 | Anchor items for cross-period comparison |
| Thinking-enabled leaderboard | ⏳ Planned | v2.0 | Reasoning mode performance delta |
| Non-Western tradition expansion | ⏳ Planned | v2.0+ | African, Asian, Latin American frameworks |
| Multi-lingual evaluation | ⏳ Planned | v2.0+ | German, Spanish, Korean theological discourse |
| Pentecostal tradition | 🔄 In progress | v2.0 | Expert recruitment and test case development |
13. Tradition Scope and Definitions
13.1 Evangelical Tradition Boundaries
FaithBench's "Evangelical" tradition category follows the Bebbington Quadrilateral (Bebbington, 1989), which defines Evangelicalism by four characteristics:
- Biblicism: High regard for biblical authority (operationalized via the Chicago Statement on Biblical Inerrancy)
- Crucicentrism: Focus on Christ's atoning work on the cross (substitutionary atonement emphasis)
- Conversionism: Emphasis on personal conversion experience ("born again")
- Activism: Commitment to evangelism and missionary effort (Great Commission emphasis)
What is included: Conservative evangelical positions on Scripture, atonement, conversion, and mission as represented in the Chicago Statement, evangelical systematic theologies (e.g., Grudem, Erickson), and major evangelical confessional documents.
What is excluded: Mainline Protestant positions, progressive evangelical positions, and prosperity gospel teachings. These may be evaluated under separate tradition categories in future versions.
Relationship to Reformed: There is significant overlap between Evangelical and Reformed categories. The distinction: Reformed questions test confessional precision (Westminster Standards, covenant theology, TULIP) while Evangelical questions test broader evangelical distinctives that cross confessional lines.
13.2 Operational Definitions
"Google-Proof" Questions
A question is "Google-proof" when it cannot be answered correctly by performing a single web search and reading the top results. Operationally, this means:
- The answer requires synthesizing information from 3+ distinct sources that do not appear together on any single web page
- Simple keyword searches return partial or misleading information
- Correct answers require domain expertise to evaluate and integrate conflicting sources
This is modeled on the GPQA benchmark (Rein et al., 2023), which demonstrated that non-expert validators with unrestricted web access still scored only 34% on questions experts answered at 65%.
"Multi-Hop" Reasoning
A question requires "multi-hop" reasoning when answering correctly requires:
- Retrieving multiple distinct facts from different areas of theological knowledge
- Connecting those facts through logical or theological reasoning
- Synthesizing a conclusion that is not explicitly stated in any single source
Example: "Compare Athanasius's and Arius's interpretations of Colossians 1:15, explain the grammatical argument for each reading, and trace how this debate influenced the Nicene formulation."
This requires: (1) knowledge of the Arian controversy, (2) Greek grammar of the genitive construction, (3) patristic textual arguments, and (4) conciliar history—four distinct knowledge domains integrated into a single answer.
13.3 Tradition Category Asymmetry
The four tradition categories in v1.0 are not perfectly parallel analytic units:
| Tradition | Category Type | Scope |
|---|---|---|
| Catholic | Ecclesial tradition | Global communion with defined magisterial authority |
| Orthodox | Ecclesial tradition | Family of autocephalous churches with shared liturgical and patristic heritage |
| Reformed | Confessional tradition | Defined by specific confessional documents (Westminster, Dort, Belgic) |
| Evangelical | Sociological-theological movement | Bounded modern movement with fuzzy edges, defined operationally via Bebbington Quadrilateral |
This asymmetry is acknowledged and pragmatic for v1.0. Catholic and Orthodox represent broad ecclesial bodies; Reformed represents a confessional position within Protestantism; Evangelical represents a cross-denominational movement. These are not co-equal analytic units in a strict taxonomic sense.
We chose these four because they represent the major evaluative frameworks within Western Christianity where theological accuracy can be meaningfully assessed against identifiable standards. Future versions may refine this taxonomy as expert input and empirical results reveal where the current categories conflate meaningfully distinct positions or where finer distinctions are needed.
14. Scope Limitations
Important
FaithBench measures theological knowledge and reasoning accuracy. It does not measure spiritual wisdom, pastoral sensitivity, or appropriateness for ministry contexts.
14.1 Scope Limitations
What FaithBench measures:
- Factual accuracy about theological positions
- Reasoning within established frameworks
- Faithful representation of denominational views
- Biblical language competency
What FaithBench does not measure:
- Spiritual edification value
- Pastoral appropriateness
- Alignment with any tradition's values
- Suitability for worship or counseling
14.2 Methodological Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| Western Christian focus | Non-Western traditions underrepresented | Planned expansion in v2.0 |
| English-language evaluation | May miss non-English theological nuance | Original language competency tested separately |
| LLM judge variability | Scoring inconsistency possible | IRR monitoring, human calibration |
| Weight subjectivity | Dimension weights influence rankings | Sensitivity analysis, expert Delphi planned |
| Sample size | Statistical power limits | Bootstrap CI to quantify uncertainty |
| Expert question coverage | Initial set focuses on Western Christian traditions | Expand to cover more traditions in v2.0 |
| Reasoning/thinking disabled | Reasoning models may underperform vs. their potential | Planned thinking-enabled leaderboard variants |
14.3 Known Biases
| Bias Type | Current Status | Planned Mitigation |
|---|---|---|
| Position bias | Untested | Position reversal protocol |
| Verbosity bias | Untested | Length correlation analysis |
| Tradition bias | Untested | Cross-tradition expert grading |
| Difficulty confounding | Partial mitigation | IRT difficulty calibration |
14.4 Generalizability
Results may not generalize to:
- Non-Christian religious traditions
- Non-academic theological contexts
- Languages other than English
- Highly specialized sub-fields (e.g., Syriac patristics)
15. Technical Configuration Reference
This section provides a complete reference of all production configuration values. These are the exact values used in scoring—not aspirational or planned values.
15.1 Judge Configuration
| Parameter | Value |
|---|---|
| Primary judge model | google/gemini-3-flash-preview |
| Fallback judge model | openai/gpt-4o-mini |
| Judge temperature | 0 |
| Judge max tokens | 16,000 |
| Output format | JSON (response_format: { type: "json_object" }) |
| Fallback trigger | Primary judge HTTP error |
15.2 Test Model Configuration
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Max tokens | 2,000 |
| Provider | OpenRouter (https://openrouter.ai/api/v1/chat/completions) |
| System prompt | Minimal (question-only) |
15.3 Execution Configuration
| Parameter | Value |
|---|---|
| Max parallelism | 50 concurrent actions |
| Retry attempts | 3 |
| Initial backoff | 1,000ms |
| Backoff multiplier | 2× (1s → 2s → 4s) |
| Max backoff | 60,000ms |
15.4 Statistical Configuration
| Parameter | Value |
|---|---|
| Bootstrap iterations | 1,000 |
| Confidence level | 95% |
| Difficulty weights | Easy: 1.0, Medium: 1.5, Hard: 2.0, Expert: 3.0 |
15.5 Scoring Rubric Weights (Production Values)
These are the exact sub-dimension weights from packages/convex/convex/scoring.ts:
Textual Analysis: lexicalAccuracy (35%), translationFidelity (35%), linguisticReasoning (20%), sourceHandling (10%)
Hermeneutical Reasoning: interpretiveMethod (35%), genreAwareness (25%), contextualAnalysis (25%), canonicalIntegration (15%)
Doctrinal Precision: doctrinalAccuracy (35%), traditionFidelity (35%), nuanceRecognition (20%), sourceGrounding (10%)
Historical Theology: historicalAccuracy (40%), developmentAwareness (30%), patristicKnowledge (20%), historiographicalMethod (10%)
Apologetics: logicalValidity (35%), evidenceUsage (30%), objectionHandling (25%), persuasiveClarity (10%)
Intertextual Reasoning: crossReferenceAccuracy (40%), typologicalRecognition (30%), allusionDetection (20%), thematicIntegration (10%)
16. Reproducibility Statement
16.1 Open Resources
| Resource | Availability |
|---|---|
| Public test cases (50%) | Viewable by signed-in users |
| Held-out test cases (50%) | Available to researchers on request (hello@faithbench.com) |
| Evaluation code | MIT license |
| Judge prompts | Published on this page (Section 5.3.3) |
| Scoring rubrics | Published on this page (Sections 4.4, 15.5) |
| Model configurations | Published on this page (Section 15) |
| Hardening roadmap | Published in repository (docs/roadmap-benchmarking-hardening.md) |
16.2 Version Control
- Evaluation methodology versioned (current: v1.0)
- Test case sets versioned with changelogs
- Results include methodology version for reproducibility
17. Future Work
- Multi-judge evaluation: Add GPT-5 as secondary judge, report inter-judge agreement
- Multi-run averaging: 4+ runs per model to quantify run-to-run variance
- Provider diversification: Validate scores via direct provider APIs
- Human IRR: Krippendorff's α between LLM judge and theologian panel
- Delphi weight derivation: Formal expert consensus methodology
- Non-Western traditions: African, Asian, Latin American theological frameworks
- Item response theory: Difficulty calibration using IRT models
- Multi-lingual evaluation: German, Spanish, Korean theological discourse
- Fine-grained sub-traditions: Distinguish within broad categories (e.g., Old Princeton vs. Dutch Reformed)
- Temporal analysis: Track model improvement over time
- Thinking-enabled leaderboard: Show performance delta with reasoning modes
References
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. AIES.
Bowman, S. R., & Dahl, G. E. (2021). What will it take to fix benchmarking in NLP? NAACL.
Epoch AI. (2025). Why benchmarking is hard. Epoch AI Substack. https://epochai.substack.com/p/why-benchmarking-is-hard
Gwet, K. L. (2014). Handbook of Inter-Rater Reliability (4th ed.). Advanced Analytics.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring massive multitask language understanding. ICLR.
Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.). Sage.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Koreeda, Y. (2023). Holistic evaluation of language models. TMLR.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2024). G-Eval: NLG evaluation using GPT-4 with better human alignment. EMNLP.
Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks.
Bebbington, D. W. (1989). Evangelicalism in Modern Britain: A History from the 1730s to the 1980s. Unwin Hyman.
Rein, D., Hou, B., Stickland, A., Petty, J., Pang, R. Y., Dirani, J., ... & Paulos, E. (2023). GPQA: A graduate-level Google-proof Q&A benchmark. arXiv preprint. https://arxiv.org/abs/2311.12022
Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2024). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS.
Citation
@misc{faithbench2026,
title={FaithBench: Toward Tradition-Aware Evaluation of Theological
Faithfulness in Large Language Models},
author={FaithBench Team},
year={2026},
url={https://faithbench.com},
note={Methodology v1.0}
}
Contributing
We welcome contributions from:
- Theologians: Test case development, rubric refinement, expert calibration
- AI Researchers: Evaluation methodology, statistical analysis, bias testing
- Practitioners: Real-world use case scenarios, pastoral context feedback
Contact: hello@faithbench.com