Share with anyone
Read the plain-language version
Introducing Expert-Level Tests: Where Frontier Models Meet Their Match
By FaithBench Research
FaithBench adds 100+ expert-level questions designed using GPQA and Humanity's Last Exam principles to discriminate between the best AI models.
The Saturation Problem
Our analysis revealed a troubling pattern: on easy and medium difficulty questions, frontier models like GPT-5, Claude Opus, and Gemini Pro all score above 80%. When every model aces the test, the benchmark loses its power to discriminate.
This is a known problem in AI evaluation. MMLU went from challenging (50% accuracy in 2021) to saturated (90%+ in 2024) in just three years. The same fate awaits any benchmark that doesn't evolve.
Learning from the Best
We studied how leading benchmarks maintain discrimination:
GPQA (Graduate-Level Google-Proof Q&A)
- PhD-level questions that experts answer correctly but non-experts fail even with web access
- Target: 65% expert accuracy, 34% non-expert accuracy
Humanity's Last Exam
- Questions pre-screened against frontier LLMs—rejected if models answer correctly
- Result: <10% accuracy for best models
MMLU-Pro
- Expanded from 4 to 10 answer choices
- Added multi-step reasoning requirements
- Frontier models dropped ~15 percentage points vs standard MMLU
Our Expert-Level Design Principles
FaithBench expert questions follow six principles:
1. Multi-Hop Reasoning
Questions require synthesizing 3+ distinct facts. Example:
"In Colossians 1:15, is the genitive 'pases ktiseos' partitive or comparative? Provide grammatical evidence, cite Athanasius's argument in Contra Arianos, and explain why Arians and Nicenes interpreted this differently."
This requires: (1) Greek grammar knowledge, (2) patristic familiarity, (3) historical theology awareness—all integrated.
2. Google-Proof Design
Simple web searches won't help. Questions require synthesis across multiple scholarly sources that don't appear together anywhere online.
3. Adversarial Distractors
We exploit common LLM failure modes:
- Misattributed quotes: "Augustine said 'In essentials unity...'" (Actually Rupertus Meldenius)
- Conflated positions: Questions that require distinguishing Reformed from Lutheran, not just Protestant from Catholic
- Anachronistic traps: Questions where importing modern concepts produces wrong answers
4. Intra-Tradition Precision
Not just "What do Calvinists believe?" but "Did the Synod of Dort take a supralapsarian or infralapsarian position?"
5. Abstention Testing
Some questions have "insufficient evidence" as the correct answer. We test whether models know when NOT to answer—a critical skill for theological applications.
Example: "What was Origen's final, definitive position on apokatastasis? Did he recant before death?"
The correct answer acknowledges that evidence is fragmentary and scholars continue to debate.
6. LLM Pre-Screening
Every expert question is tested against GPT-5, Claude Opus, and Gemini Pro before inclusion. If any model scores >70%, we reject or revise the question.
What This Means for the Leaderboard
With expert-level tests, we expect:
- Greater spread: Top models currently cluster at 85-92%. Expert tier should spread them across 30-60%.
- True discrimination: The best models will separate from the pack.
- Reduced gaming: Held-out expert questions prevent training contamination.
Categories Covered
Our 100+ expert questions span all six dimensions:
| Section | Expert Questions | Example Category |
|---|---|---|
| Textual | 20 | Hapax legomena, manuscript variants with theological stakes |
| Hermeneutical | 20 | Genre disputes, sensus plenior, typology boundaries |
| Doctrinal | 25 | Intra-tradition debates (Thomist/Molinist, essence-energies) |
| Historical | 15 | Patristic attribution, council canons, Reformation specifics |
| Apologetics | 15 | Modal logic arguments, grounding objections |
| Intertextual | 15 | Second Temple interpretation, MT vs LXX patterns |
Plus 5-10 abstention questions where appropriate epistemic humility is tested.
Sample Expert Questions
Textual
"What is the meaning of epiousios in the Lord's Prayer? This word appears nowhere else in Greek literature before the NT. List the three main scholarly interpretations and the patristic evidence for each."
Doctrinal
"Distinguish supralapsarianism from infralapsarianism. Which position did the Synod of Dort take? Which did Turretin favor? Which does the Westminster Confession imply?"
Historical
"The phrase 'In essentials unity, in non-essentials liberty, in all things charity' is often attributed to Augustine. Who actually wrote it and in what context?"
Apologetics
"Explain the 'grounding objection' to Molinism. How do Molinists like Thomas Flint respond to the claim that counterfactuals of creaturely freedom lack truth-makers?"
Coming Soon
Expert-level results will appear on the leaderboard once we've completed validation. We'll publish:
- Discrimination index by difficulty tier
- Model accuracy distributions
- Analysis of which categories challenge which models
Stay tuned for insights into where even the best models struggle with theological reasoning.
Note
Note on preliminary data: FaithBench scores cited in this post are from v1.0, which uses a single LLM judge (Gemini 3 Flash) without human inter-rater reliability validation. Expert-level question design involves pre-screening against frontier LLMs, which introduces a selection circularity documented in our methodology. Scores should be treated as provisional automated assessments. See our methodology for full details on current limitations.
Questions about our methodology? See our full methodology documentation.