Why Testing AI on Theology Is So Hard

AI can pass medical licensing exams at 96% accuracy. Legal benchmarks have clear right answers based on statutes and precedent.

Theology? AI scores 48% on faith questions.

This isn't because theology is harder. It's because theology works differently.

The Ground Truth Problem

Every test needs a "right answer" to grade against.

Medical tests: Right answers come from clinical trials and peer-reviewed research. Run the study, measure the outcome, update the guideline.

Legal tests: Right answers come from statutes and court precedent. The law says what the law says.

Theology: Right answers... according to whom?

Consider this verse: "Unless you eat the flesh of the Son of Man and drink his blood, you have no life in you" (John 6:53).

Catholics say the bread literally becomes Christ's body
Lutherans say Christ is present "in, with, and under" the bread
Reformed Christians say Christ is spiritually present to faith
Others say it's purely symbolic

Four readings. One verse. Each position has centuries of careful scholarship behind it.

Which answer should the test accept as "correct"?

Why AI Produces Mush

When you train AI on all Christian traditions equally, you get the average of all positions.

The average is mush.

"God" becomes "higher power." "Prayer" becomes "mindfulness." Specific claims flatten into generic spirituality.

AI's output doesn't represent any actual tradition. It represents the statistical center of everything AI has read—which is nobody's faith.

Real Failures

This isn't theoretical. AI has:

Invented Bible verses that don't exist
Advised Gatorade for infant baptism (the "Father Justin" AI disaster)
Completely made up Jewish texts that scholars immediately recognized as fake
Confidently cited Islamic sources that don't exist

One rabbi tested AI for Jewish learning and concluded: "It became obvious I was dealing with an AI parrot with no real understanding."

Why This Actually Matters

The disagreement in theology isn't a bug. It's the point.

When Catholics, Reformed Christians, and Orthodox believers read the same verse differently, that's not a problem to solve. That's 2,000 years of faithful people wrestling with hard questions.

AI that erases these differences doesn't serve anyone:

A Catholic asking about the Eucharist doesn't want a Reformed answer
A Reformed believer asking about predestination doesn't want a Catholic answer
Neither wants AI's made-up synthesis of both

What Good AI Would Look Like

Instead of pretending neutrality, useful theological AI would:

Identify the tradition: "From a Catholic perspective..." or "Reformed Christians would say..."
Acknowledge disagreement: "This is debated between traditions"
Know its limits: "I can describe what traditions teach, but I can't tell you which is correct"

What You Can Do

Don't expect AI to settle theological questions. The disagreements are real and meaningful.
Ask for tradition-specific answers. "What do Catholics teach about X?" is better than "What's true about X?"
Verify everything. If AI cites a verse, book, or source, look it up.
Use AI as a starting point, not an authority. It can help you find questions to explore, not answers to trust.

Theology resists benchmarks because it was never meant to be compressed into testable facts. It's living traditions of interpretation—and AI shouldn't be allowed to flatten that into generic spirituality.

Want the full methodology discussion? Read the technical version.