Back to Blog
methodologybenchmarksinterpretationhermeneuticsJanuary 29, 2026

Why Theology Breaks Benchmarks

By FaithBench Research

Medical AI hits 96% accuracy. Legal benchmarks have "objectively correct" answers. Theological benchmarks? 48/100 on faith. Here's why.


Medical AI achieves 96% accuracy on licensing exams—OpenAI's o1-preview on the MedQA benchmark (Chen et al., 2024). Legal benchmarks explicitly focus on "objectively correct" answers (Guha et al., 2023). But theological evaluation tells a different story—Gloo's FAI-C benchmark found models scored just 48/100 on faith, the lowest of all seven dimensions tested (Gloo, 2025).

That's not a measurement error. It's a fundamental mismatch between how benchmarks work and how theology works.

The Ground Truth Problem

Every benchmark needs a ground truth—the "correct" answer against which model outputs are measured. Different domains handle this differently:

DomainGround Truth SourceHandles Interpretation?
Medical (MedQA)Clinical guidelines, peer reviewExcluded by design
Legal (LegalBench)Statutes, case precedentDeliberately avoided
TheologyScripture + 2000 years of traditionThe entire point

The LegalBench paper states it explicitly: "LEGALBENCH focuses entirely on the former setting (unambiguous answers), and all tasks are considered to have objectively 'correct' answers" (Guha et al., 2023, p. 3). The benchmark acknowledges it is "not helpful for evaluating legal reasoning involving degrees of correctness or tasks where 'reasonable minds may differ.'"

MedQA draws from clinical standards where disagreement can be resolved empirically—run the trial, measure the outcome. Even legal benchmarks can appeal to statutory text or binding precedent.

Theology has no such luxury.

The Same Text, Read Differently

This isn't abstract. Consider how different traditions read identical verses.

"Unless you eat the flesh of the Son of Man and drink his blood, you have no life in you."

Catholic: Transubstantiation. The bread and wine become Christ's body and blood. The Greek trogein (to gnaw, chew) indicates physical consumption. This is literal.

Lutheran: Sacramental union. Christ is present "in, with, and under" the elements—not transubstantiation, but real presence.

Reformed: Spiritual presence. Calvin's middle path—the elements remain bread and wine, but Christ is truly present spiritually to faith.

Zwinglian: Pure memorial. "Do this in remembrance of me" defines the purpose. The elements symbolize; nothing metaphysical occurs.

Four readings. One verse. Each internally coherent. Each with centuries of sophisticated theological defense.

Sit with those examples. Each tradition brings coherent hermeneutical frameworks. Each cites careful scholarship. Each would score their own reading as "correct" and the others as mistaken.

Which answer should the benchmark accept?

Hermeneutical Frameworks

The root issue goes deeper than individual verses. Different traditions bring fundamentally different frameworks for how to read.

These aren't just reading strategies. They're entire epistemological frameworks—different answers to "what counts as understanding?"

As Kevin Vanhoozer notes in Mere Christian Hermeneutics (2024): "A cynical observer might say that the one thing Christians have never agreed on is how to interpret the Bible, or even on the meaning of the 'literal sense.'"

Why This Matters for AI

So what happens when you train an AI on all of Christian history simultaneously?

You get what we call the "linguistic average" problem. The model produces outputs that reflect the statistical center of its training data—a mushy middle that belongs to no actual tradition.

This shows up constantly in evaluation:

  • "God" becomes "a higher power"
  • "Prayer" becomes "mindfulness practice"
  • Specific confessional claims flatten into generic spirituality

The model isn't wrong in the sense of factual error. It's wrong in the sense of theological incoherence—synthesizing positions that actual believers would never hold together.

The real-world failures compound quickly:

Father Justin, an AI "priest" launched by Catholic Answers in April 2024, told users he was a real ordained minister and advised that Gatorade could be used for infant baptism. The chatbot was "defrocked" within two days of launch (Tech Times, 2024; The Register, 2024).

Hallucinated Scripture: A fabricated "trans-affirming" Bible verse went viral on social media. The Advocate cited it before discovering the passage doesn't exist (Religion News Service, 2023).

Islamic fabrications: AI systems have confidently cited "Majmoo' Fatawa Ibn Baaz (3/295)"—a reference that doesn't exist in that collection (Halevi, 2024).

Jewish AI failures: "ChatGPT will just completely make up a Gemara," reports a student at Yeshiva University (Forward, 2023).

As Beth Singler, anthropologist of AI and religion at the University of Zurich, has documented: chatbot technologies prioritize responses for conversational flow over precision—a significant problem for religions that rely heavily on textual sources and doctrinal accuracy (Singler, 2017; 2024).

The "So What?"

Here's the reframe: Theology isn't failing benchmarks because it's broken. It's "failing" because the benchmark paradigm assumes one correct answer—and theology is interpretive by design.

The plurality is the point.

When Catholics, Reformed, and Orthodox read the same verse differently, that's not a bug to fix. It's the lived reality of 2,000 years of faithful disagreement. The Stanford Encyclopedia puts it starkly: "The disagreement in theology about countless doctrinal points... made it only too obvious for a philosophical observer that there is no hermeneutical path toward establishing a full consensus about the interpretation of any biblical text whatsoever."

AI that erases distinctives doesn't serve anyone. A Catholic seeking sacramental theology doesn't want a Reformed answer. A Reformed believer asking about covenant theology doesn't want Catholic soteriology. And neither wants the AI's invented synthesis of both.

What we actually need: AI systems that are tradition-aware, not tradition-erasing. Systems that can say "From a Reformed perspective..." or "Catholics would argue..." rather than synthesizing a fake consensus.

As Rabbi Yehuda Hausman put it after testing AI for Jewish learning: "It became ever more obvious that I was dealing with an 'AI parrot' with no real understanding."

John Piper's response to AI-generated sermons: "Frankly, I'm appalled at the thought—appalled."

A Conservative Judaism rabbi, reflecting on the limits of computation: "ChatGPT cannot weep when a verse pierces the soul."

Theology resists benchmarks because it was never meant to be compressed into retrievable facts. It's a living tradition—or rather, many living traditions—of interpretation, debate, and faithful reading across centuries.

Any benchmark that ignores this isn't measuring theological competence. It's measuring something else entirely.


References

Chen, Z., et al. (2024). Toward expert-level medical question answering with large language models. Nature Medicine. https://doi.org/10.1038/s41591-024-03423-7

Gloo. (2025, December 15). Gloo unveils the first benchmark exposing how AI misses Christian worldview and values [Press release]. https://gloo.com/press/releases/gloo-unveils-the-first-benchmark-exposing-how-ai-misses-christian-worldview-and-values

Guha, N., et al. (2023). LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models. Proceedings of NeurIPS 2023. https://arxiv.org/abs/2308.11462

Pew Research Center. (2024). Americans' views on AI and religion. https://www.pewresearch.org

Singler, B. (2017). An introduction to artificial intelligence and religion for the religious studies scholar. Implicit Religion, 20(3), 215-231.

Singler, B. (2024). Religion and AI: An introduction. Cambridge University Press.

Tech Times. (2024, May 2). AI priest gets demoted after saying babies can be baptized with Gatorade. https://www.techtimes.com/articles/304222/20240502/ai-priest-demoted-saying-babies-baptized-gatorade.htm

The Register. (2024, May 3). AI Catholic 'priest' defrocked over interesting advice. https://www.theregister.com/2024/05/03/ai_catholic_priest/

Vanhoozer, K. J. (2024). Mere Christian hermeneutics: Transfiguring what it means to read the Bible theologically. Zondervan.