Introducing FaithBench: Measuring What AI Gets Wrong About Theology

Ask GPT-5 to explain quantum mechanics and you'll get a competent answer. Ask it to distinguish the Reformed doctrine of definite atonement from the Arminian position on unlimited atonement—using the actual confessional sources—and things fall apart fast.

The Gloo FAI-C benchmark found that "Faith" scored lowest among all seven evaluation categories, averaging just 48/100 across frontier models. Not math. Not coding. Faith.

This isn't a niche problem. AI tools are already being used in seminary education, pastoral counseling apps, and Bible study platforms. Millions of people ask AI theological questions every day. The answers they get are often confident, articulate, and wrong in ways that matter.

The Problem: AI Flattens Theological Diversity

When models encounter theological questions, they default to a predictable set of failure modes:

Generic collapse. Ask about the Eucharist and you'll get "Christians believe communion is important" instead of the sharp distinctions between transubstantiation, real presence, and memorialism that actually define denominational identity.

Denominational conflation. Models routinely treat Reformed and Arminian positions as interchangeable, merge Catholic and Orthodox perspectives on papal authority, and present contested issues as settled consensus.

Scriptural mishandling. Inaccurate citation, decontextualized proof-texting, and inability to engage with the original Hebrew and Greek texts at the level seminary graduates would expect.

These aren't random errors. They're systematic—and they stem from training data that overrepresents popular-level summaries and underrepresents primary theological sources.

What FaithBench Is

FaithBench is a benchmark that measures how accurately AI models represent theological knowledge across Christian traditions. Not how "spiritual" the answers feel. Not how pastorally appropriate. How accurate.

We evaluate across six dimensions:

Textual Analysis

25/100

Biblical language competency: Greek and Hebrew lexical accuracy, morphological analysis, translation evaluation.

Hermeneutical Reasoning

20/100

Interpretive methodology: genre awareness, contextual analysis, canonical integration.

Doctrinal Precision

20/100

Systematic theology: tradition-specific doctrinal positions, creedal formulations, denominational distinctives.

Historical Theology

15/100

Church history: patristic sources, Reformation debates, doctrinal development.

Apologetics

10/100

Philosophical theology: logical validity, evidence usage, objection handling.

Intertextual Reasoning

10/100

Canonical connections: cross-references, typological recognition, allusion detection.

How It Works

Tradition-aware evaluation. A correct Catholic answer may be incorrect from a Reformed perspective. We evaluate models within specific traditions—Reformed, Catholic, Orthodox, Evangelical—because theological accuracy is tradition-relative. Generic responses that avoid specificity fail every tradition.

Difficulty tiers. Test cases range from seminary-introductory to specialist-scholar level. Easy and medium questions showed ceiling effects (models scored 90%+), so active evaluation focuses on hard and expert tiers where models actually differentiate.

Statistical rigor. Bootstrap confidence intervals (1,000 iterations, 95% CI) on every score. When two models' intervals overlap, we don't claim one is better.

Full transparency. Our judge prompts, rubric weights, model configurations, and scoring code are all published in our methodology. We document what we know, what we don't, and what we plan to improve.

What We've Found

The results are striking. Even the best models show significant gaps:

Expert-tier questions (specialist-scholar level) push all models below 60% accuracy. These are the questions that matter most for theological AI applications.
Orthodox theology is consistently the hardest for models. The essence-energies distinction, hesychasm, and Palamite theology expose a Western-centric training data bias.
Doctrinal precision varies dramatically by tradition. Models handle broad evangelical concepts better than confessional Reformed or Catholic nuances.
Biblical language competency separates the top models from the rest. Accurately engaging with Greek and Hebrew morphology is where most models break down.

The full results are on our leaderboard, with per-dimension and per-tradition breakdowns available to signed-in users.

For Practitioners

If you're building or evaluating AI tools for theological contexts:

Don't trust benchmark headlines. A model scoring 85% "overall" may score 40% on the specific theological domain you care about. Check per-tradition and per-dimension scores.
Test with your tradition's hardest questions. Easy questions don't differentiate models. The questions that matter are the ones where your tradition's distinctives are at stake.
Browse our test cases. The public portion of our test set is browsable. Use these as a starting point for your own evaluation.
RAG helps, but doesn't solve everything. Retrieval-augmented generation improves factual accuracy but doesn't fix hermeneutical reasoning or doctrinal precision.

For Researchers

FaithBench is designed for reproducibility:

Open methodology. Every scoring rubric, judge prompt, and configuration parameter is published.
Public test cases. 50% of test cases are public for transparency. 50% are held out to prevent contamination.
Citable. Use the BibTeX below or on our methodology page.

@misc{faithbench2026,
  title={FaithBench: Toward Tradition-Aware Evaluation of Theological
         Faithfulness in Large Language Models},
  author={FaithBench Team},
  year={2026},
  url={https://faithbench.com},
  note={Methodology v1.0}
}

We welcome contributions: test case development, rubric refinement, expert calibration, statistical methodology. See our contributor page or partnership page.

What's Honest

We're transparent about what FaithBench doesn't do yet:

Single LLM judge (Gemini 3 Flash). Human expert calibration is in progress but not yet complete.
Single provider (OpenRouter). Provider variance is documented but unquantified.
Single-run evaluation. Multi-run averaging is planned but not yet funded.
Western Christian focus. Non-Western theological traditions are planned for v2.0.

These are real limitations. We document them because honest positioning builds more trust than overstated claims. Our hardening roadmap details the phased plan to address each one.

What's Next

Human expert calibration: Theologian panels scoring responses alongside our LLM judge
Cross-validation judge: Secondary judge model to detect and correct self-preference bias
Tradition expansion: Pentecostal tradition in development; non-Western traditions planned
Thinking-enabled leaderboard: Showing how reasoning modes change model performance
Multi-run averaging: 4+ runs per model for quantified run-to-run variance

Get Involved

FaithBench is a community-driven project. We need:

Theologians across all traditions for test case development and expert calibration
AI researchers for methodology review and statistical analysis
Practitioners building AI tools for theological contexts

Partner with us or join the conversation on Discord.