Go deeper
Read the full technical version
How We Test the Best AI on Hard Theology Questions
By FaithBench Research
When every AI model aces the easy questions, you need harder tests. Here's how FaithBench designs questions that separate the best from the rest.
On easy theological questions, top AI models all score above 80%.
When every model aces the test, the test isn't useful anymore.
This is a known problem in AI evaluation. Tests that were hard in 2021 became easy by 2024. If you want to measure which AI is actually best at theology, you need harder questions.
What Makes a Question "Expert Level"?
Our expert questions follow specific design principles:
1. Require Multiple Steps
Easy question: "What is the Trinity?"
Expert question: "The word epiousios in the Lord's Prayer appears nowhere else in Greek literature. List the three main scholarly interpretations and the patristic evidence for each."
The expert question requires Greek language knowledge, familiarity with Church Fathers, and the ability to compare scholarly positions—all integrated together.
2. Can't Be Googled
A simple web search won't help. Expert questions require synthesizing information from multiple scholarly sources that don't appear together anywhere online.
3. Include Traps
We test common AI mistakes:
- Misattributed quotes: "In essentials unity..." is often credited to Augustine. It's actually from Rupertus Meldenius.
- Confused positions: Questions that require distinguishing Reformed from Lutheran, not just Protestant from Catholic
- Modern assumptions: Questions where importing contemporary ideas produces wrong answers
4. Test Within Traditions
Not just "What do Calvinists believe?" but "Did the Synod of Dort take a supralapsarian or infralapsarian position?"
The first question is general knowledge. The second tests whether AI actually understands the internal debates within Reformed theology.
5. Test When NOT to Answer
Some questions have "we don't know" or "scholars disagree" as the correct answer.
Example: "What was Origen's final position on universal restoration? Did he recant?"
The correct answer acknowledges that evidence is fragmentary and scholars continue debating. AI that confidently gives a definitive answer is wrong.
6. Pre-Tested Against AI
Before including any expert question, we test it against the top AI models. If any model scores above 70%, we reject or revise the question.
What We're Testing
Our 100+ expert questions cover:
| Area | What It Tests |
|---|---|
| Textual | Greek/Hebrew knowledge, manuscript variants |
| Hermeneutical | How to interpret Scripture |
| Doctrinal | Deep theological debates within traditions |
| Historical | Church history precision |
| Apologetics | Philosophical arguments for faith |
| Intertextual | How Scripture interprets Scripture |
Why This Matters
Easy tests don't tell you much. When GPT-5, Claude, and Gemini all score 85-92%, you can't tell which is actually best for theological work.
Expert-level testing spreads the scores across 30-60%, revealing real differences between models.
For anyone using AI for serious theological research—seminarians, pastors, scholars—knowing which models handle difficult questions matters.
Example Expert Questions
Textual: "What is the meaning of epiousios in the Lord's Prayer? This word appears nowhere else in Greek literature. List the three main scholarly interpretations and patristic evidence for each."
Doctrinal: "Distinguish supralapsarianism from infralapsarianism. Which position did the Synod of Dort take? Which did Turretin favor? Which does the Westminster Confession imply?"
Historical: "The phrase 'In essentials unity, in non-essentials liberty, in all things charity' is often attributed to Augustine. Who actually wrote it and in what context?"
What You Can Do
-
Check the leaderboard for expert-tier scores when they're published
-
Choose models accordingly for serious theological work
-
Don't assume all AI is equal just because they seem similar on easy questions
Expert questions separate the truly capable from the merely passable. For theological accuracy, that distinction matters.
Want the full methodology details? Read the technical version.