Why Theology Breaks Benchmarks

Medical AI achieves 96% accuracy on licensing exams—OpenAI's o1-preview on the MedQA benchmark (Chen et al., 2024). Legal benchmarks explicitly focus on "objectively correct" answers (Guha et al., 2023). But theological evaluation tells a different story—Gloo's FAI-C benchmark found models scored just 48/100 on faith, the lowest of all seven dimensions tested (Gloo, 2025).

That gap suggests a fundamental mismatch between how benchmarks work and how theology works.

The Ground Truth Problem

Every benchmark needs a ground truth—the "correct" answer against which model outputs are measured. Different domains handle this differently:

Domain	Ground Truth Source	Handles Interpretation?
Medical (MedQA)	Clinical guidelines, peer review	Excluded by design
Legal (LegalBench)	Statutes, case precedent	Deliberately avoided
Theology	Scripture + 2000 years of tradition	The entire point

The LegalBench paper states it explicitly: "LEGALBENCH focuses entirely on the former setting (unambiguous answers), and all tasks are considered to have objectively 'correct' answers" (Guha et al., 2023, p. 3). The benchmark acknowledges it is "not helpful for evaluating legal reasoning involving degrees of correctness or tasks where 'reasonable minds may differ.'"

MedQA draws from clinical standards where disagreement can be resolved empirically—run the trial, measure the outcome. Even legal benchmarks can appeal to statutory text or binding precedent.

Theology has no such luxury.

Important

Medical and legal benchmarks achieve high agreement by excluding interpretive questions. Theology cannot exclude interpretation without ceasing to be theology.

The Same Text, Read Differently

This isn't abstract. Consider how different traditions read identical verses.

"Unless you eat the flesh of the Son of Man and drink his blood, you have no life in you."

Catholic: Transubstantiation. The bread and wine become Christ's body and blood. The Greek trogein (to gnaw, chew) indicates physical consumption. This is literal.

Lutheran: Sacramental union. Christ is present "in, with, and under" the elements—not transubstantiation, but real presence.

Reformed: Spiritual presence. Calvin's middle path—the elements remain bread and wine, but Christ is truly present spiritually to faith.

Zwinglian: Pure memorial. "Do this in remembrance of me" defines the purpose. The elements symbolize; nothing metaphysical occurs.

Four readings. One verse. Each internally coherent. Each with centuries of sophisticated theological defense.

Sit with those examples. Each tradition brings coherent hermeneutical frameworks. Each cites careful scholarship. Each would score their own reading as "correct" and the others as mistaken.

Which answer should the benchmark accept?

Hermeneutical Frameworks

The root issue goes deeper than individual verses. Different traditions bring fundamentally different frameworks for how to read.

These aren't just reading strategies. They're entire epistemological frameworks—different answers to "what counts as understanding?"

As Kevin Vanhoozer notes in Mere Christian Hermeneutics (2024): "A cynical observer might say that the one thing Christians have never agreed on is how to interpret the Bible, or even on the meaning of the 'literal sense.'"

Why This Matters for AI

So what happens when you train an AI on all of Christian history simultaneously?

You get what we call the "linguistic average" problem. The model produces outputs that reflect the statistical center of its training data—a mushy middle that belongs to no actual tradition.

This shows up constantly in evaluation:

"God" becomes "a higher power"
"Prayer" becomes "mindfulness practice"
Specific confessional claims flatten into generic spirituality

The model isn't wrong in the sense of factual error. It's wrong in the sense of theological incoherence—synthesizing positions that actual believers would never hold together.

The real-world failures compound quickly:

Father Justin, an AI "priest" launched by Catholic Answers in April 2024, told users he was a real ordained minister and advised that Gatorade could be used for infant baptism. The chatbot was "defrocked" within two days of launch (Tech Times, 2024; The Register, 2024).

Hallucinated Scripture: A fabricated "trans-affirming" Bible verse went viral on social media. The Advocate cited it before discovering the passage doesn't exist (Religion News Service, 2023).

Islamic fabrications: AI systems have confidently cited "Majmoo' Fatawa Ibn Baaz (3/295)"—a reference that doesn't exist in that collection (Halevi, 2024).

Jewish AI failures: "ChatGPT will just completely make up a Gemara," reports a student at Yeshiva University (Forward, 2023).

As Beth Singler, anthropologist of AI and religion at the University of Zurich, has documented: chatbot technologies prioritize responses for conversational flow over precision—a significant problem for religions that rely heavily on textual sources and doctrinal accuracy (Singler, 2017; 2024).

Warning

These aren't edge cases. They represent a pattern across religious AI applications. According to Pew Research, 73% of Americans oppose using AI for spiritual guidance—yet adoption continues to grow (Pew Research Center, 2024).

The "So What?"

Here's the reframe: Theology isn't failing benchmarks because it's broken. It's "failing" because the benchmark paradigm assumes one correct answer—and theology is interpretive by design.

The plurality is the point.

When Catholics, Reformed, and Orthodox read the same verse differently, that's not a bug to fix. It's the lived reality of 2,000 years of faithful disagreement. The Stanford Encyclopedia puts it starkly: "The disagreement in theology about countless doctrinal points... made it only too obvious for a philosophical observer that there is no hermeneutical path toward establishing a full consensus about the interpretation of any biblical text whatsoever."

AI that erases distinctives doesn't serve anyone. A Catholic seeking sacramental theology doesn't want a Reformed answer. A Reformed believer asking about covenant theology doesn't want Catholic soteriology. And neither wants the AI's invented synthesis of both.

What we actually need: AI systems that are tradition-aware, not tradition-erasing. Systems that can say "From a Reformed perspective..." or "Catholics would argue..." rather than synthesizing a fake consensus.

Note

FaithBench's approach: Tradition fidelity as a core sub-dimension of doctrinal scoring. Expert annotators matched to traditions. Evaluating competence within frameworks, not despite them. The goal isn't to pick winners in theological debates—it's to measure whether models can faithfully represent the positions that actual communities hold.

As Rabbi Yehuda Hausman put it after testing AI for Jewish learning: "It became ever more obvious that I was dealing with an 'AI parrot' with no real understanding."

John Piper's response to AI-generated sermons: "Frankly, I'm appalled at the thought—appalled."

A Conservative Judaism rabbi, reflecting on the limits of computation: "ChatGPT cannot weep when a verse pierces the soul."

Theology resists benchmarks because it was never meant to be compressed into retrievable facts. It's a living tradition—or rather, many living traditions—of interpretation, debate, and faithful reading across centuries.

Any benchmark that ignores this isn't measuring theological competence. It's measuring something else entirely.

Note

Note on preliminary data: FaithBench scores cited in this post are from v1.0, which uses a single LLM judge (Gemini 3 Flash) without human inter-rater reliability validation. Scores should be treated as provisional automated assessments. See our methodology for full details on current limitations.

References

Chen, Z., et al. (2024). Toward expert-level medical question answering with large language models. Nature Medicine. https://doi.org/10.1038/s41591-024-03423-7

Gloo. (2025, December 15). Gloo unveils the first benchmark exposing how AI misses Christian worldview and values [Press release]. https://gloo.com/press/releases/gloo-unveils-the-first-benchmark-exposing-how-ai-misses-christian-worldview-and-values

Guha, N., et al. (2023). LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. Proceedings of NeurIPS 2023. https://arxiv.org/abs/2308.11462

Pew Research Center. (2024). Americans' views on AI and religion. https://www.pewresearch.org

Singler, B. (2017). An introduction to artificial intelligence and religion for the religious studies scholar. Implicit Religion, 20(3), 215-231.

Singler, B. (2024). Religion and AI: An introduction. Cambridge University Press.

Tech Times. (2024, May 2). AI priest gets demoted after saying babies can be baptized with Gatorade. https://www.techtimes.com/articles/304222/20240502/ai-priest-demoted-saying-babies-baptized-gatorade.htm

The Register. (2024, May 3). AI Catholic 'priest' defrocked over interesting advice. https://www.theregister.com/2024/05/03/ai_catholic_priest/

Vanhoozer, K. J. (2024). Mere Christian hermeneutics: Transfiguring what it means to read the Bible theologically. Zondervan.