Back to Everyday Articles

Go deeper

Read the full technical version

methodologybeginnerFebruary 1, 2026

How We Test the Best AI on Hard Theology Questions

By FaithBench Research

When every AI model aces the easy questions, you need harder tests. Here's how FaithBench designs questions that separate the best from the rest.


On easy theological questions, top AI models all score above 80%.

When every model aces the test, the test isn't useful anymore.

This is a known problem in AI evaluation. Tests that were hard in 2021 became easy by 2024. If you want to measure which AI is actually best at theology, you need harder questions.

What Makes a Question "Expert Level"?

Our expert questions follow specific design principles:

1. Require Multiple Steps

Easy question: "What is the Trinity?"

Expert question: "The word epiousios in the Lord's Prayer appears nowhere else in Greek literature. List the three main scholarly interpretations and the patristic evidence for each."

The expert question requires Greek language knowledge, familiarity with Church Fathers, and the ability to compare scholarly positions—all integrated together.

2. Can't Be Googled

A simple web search won't help. Expert questions require synthesizing information from multiple scholarly sources that don't appear together anywhere online.

3. Include Traps

We test common AI mistakes:

  • Misattributed quotes: "In essentials unity..." is often credited to Augustine. It's actually from Rupertus Meldenius.
  • Confused positions: Questions that require distinguishing Reformed from Lutheran, not just Protestant from Catholic
  • Modern assumptions: Questions where importing contemporary ideas produces wrong answers

4. Test Within Traditions

Not just "What do Calvinists believe?" but "Did the Synod of Dort take a supralapsarian or infralapsarian position?"

The first question is general knowledge. The second tests whether AI actually understands the internal debates within Reformed theology.

5. Test When NOT to Answer

Some questions have "we don't know" or "scholars disagree" as the correct answer.

Example: "What was Origen's final position on universal restoration? Did he recant?"

The correct answer acknowledges that evidence is fragmentary and scholars continue debating. AI that confidently gives a definitive answer is wrong.

6. Pre-Tested Against AI

Before including any expert question, we test it against the top AI models. If any model scores above 70%, we reject or revise the question.

What We're Testing

Our 100+ expert questions cover:

AreaWhat It Tests
TextualGreek/Hebrew knowledge, manuscript variants
HermeneuticalHow to interpret Scripture
DoctrinalDeep theological debates within traditions
HistoricalChurch history precision
ApologeticsPhilosophical arguments for faith
IntertextualHow Scripture interprets Scripture

Why This Matters

Easy tests don't tell you much. When GPT-5, Claude, and Gemini all score 85-92%, you can't tell which is actually best for theological work.

Expert-level testing spreads the scores across 30-60%, revealing real differences between models.

For anyone using AI for serious theological research—seminarians, pastors, scholars—knowing which models handle difficult questions matters.

Example Expert Questions

Textual: "What is the meaning of epiousios in the Lord's Prayer? This word appears nowhere else in Greek literature. List the three main scholarly interpretations and patristic evidence for each."

Doctrinal: "Distinguish supralapsarianism from infralapsarianism. Which position did the Synod of Dort take? Which did Turretin favor? Which does the Westminster Confession imply?"

Historical: "The phrase 'In essentials unity, in non-essentials liberty, in all things charity' is often attributed to Augustine. Who actually wrote it and in what context?"

What You Can Do

  1. Check the leaderboard for expert-tier scores when they're published

  2. Choose models accordingly for serious theological work

  3. Don't assume all AI is equal just because they seem similar on easy questions

Expert questions separate the truly capable from the merely passable. For theological accuracy, that distinction matters.


Want the full methodology details? Read the technical version.