Introducing Expert-Level Tests: Where Frontier Models Meet Their Match

The Saturation Problem

Our analysis revealed a troubling pattern: on easy and medium difficulty questions, frontier models like GPT-5, Claude Opus, and Gemini Pro all score above 80%. When every model aces the test, the benchmark loses its power to discriminate.

This is a known problem in AI evaluation. MMLU went from challenging (50% accuracy in 2021) to saturated (90%+ in 2024) in just three years. The same fate awaits any benchmark that doesn't evolve.

Learning from the Best

We studied how leading benchmarks maintain discrimination:

GPQA (Graduate-Level Google-Proof Q&A)

PhD-level questions that experts answer correctly but non-experts fail even with web access
Target: 65% expert accuracy, 34% non-expert accuracy

Humanity's Last Exam

Questions pre-screened against frontier LLMs—rejected if models answer correctly
Result: <10% accuracy for best models

MMLU-Pro

Expanded from 4 to 10 answer choices
Added multi-step reasoning requirements
Frontier models dropped ~15 percentage points vs standard MMLU

Our Expert-Level Design Principles

FaithBench expert questions follow six principles:

1. Multi-Hop Reasoning

Questions require synthesizing 3+ distinct facts. Example:

"In Colossians 1:15, is the genitive 'pases ktiseos' partitive or comparative? Provide grammatical evidence, cite Athanasius's argument in Contra Arianos, and explain why Arians and Nicenes interpreted this differently."

This requires: (1) Greek grammar knowledge, (2) patristic familiarity, (3) historical theology awareness—all integrated.

2. Google-Proof Design

Simple web searches won't help. Questions require synthesis across multiple scholarly sources that don't appear together anywhere online.

3. Adversarial Distractors

We exploit common LLM failure modes:

Misattributed quotes: "Augustine said 'In essentials unity...'" (Actually Rupertus Meldenius)
Conflated positions: Questions that require distinguishing Reformed from Lutheran, not just Protestant from Catholic
Anachronistic traps: Questions where importing modern concepts produces wrong answers

4. Intra-Tradition Precision

Not just "What do Calvinists believe?" but "Did the Synod of Dort take a supralapsarian or infralapsarian position?"

5. Abstention Testing

Some questions have "insufficient evidence" as the correct answer. We test whether models know when NOT to answer—a critical skill for theological applications.

Example: "What was Origen's final, definitive position on apokatastasis? Did he recant before death?"

The correct answer acknowledges that evidence is fragmentary and scholars continue to debate.

6. LLM Pre-Screening

Every expert question is tested against GPT-5, Claude Opus, and Gemini Pro before inclusion. If any model scores >70%, we reject or revise the question.

What This Means for the Leaderboard

With expert-level tests, we expect:

Greater spread: Top models currently cluster at 85-92%. Expert tier should spread them across 30-60%.
True discrimination: The best models will separate from the pack.
Reduced gaming: Held-out expert questions prevent training contamination.

Categories Covered

Our 100+ expert questions span all six dimensions:

Section	Expert Questions	Example Category
Textual	20	Hapax legomena, manuscript variants with theological stakes
Hermeneutical	20	Genre disputes, sensus plenior, typology boundaries
Doctrinal	25	Intra-tradition debates (Thomist/Molinist, essence-energies)
Historical	15	Patristic attribution, council canons, Reformation specifics
Apologetics	15	Modal logic arguments, grounding objections
Intertextual	15	Second Temple interpretation, MT vs LXX patterns

Plus 5-10 abstention questions where appropriate epistemic humility is tested.

Sample Expert Questions

Textual

"What is the meaning of epiousios in the Lord's Prayer? This word appears nowhere else in Greek literature before the NT. List the three main scholarly interpretations and the patristic evidence for each."

Doctrinal

"Distinguish supralapsarianism from infralapsarianism. Which position did the Synod of Dort take? Which did Turretin favor? Which does the Westminster Confession imply?"

Historical

"The phrase 'In essentials unity, in non-essentials liberty, in all things charity' is often attributed to Augustine. Who actually wrote it and in what context?"

Apologetics

"Explain the 'grounding objection' to Molinism. How do Molinists like Thomas Flint respond to the claim that counterfactuals of creaturely freedom lack truth-makers?"

Coming Soon

Expert-level results will appear on the leaderboard once we've completed validation. We'll publish:

Discrimination index by difficulty tier
Model accuracy distributions
Analysis of which categories challenge which models

Stay tuned for insights into where even the best models struggle with theological reasoning.

Note

Note on preliminary data: FaithBench scores cited in this post are from v1.0, which uses a single LLM judge (Gemini 3 Flash) without human inter-rater reliability validation. Expert-level question design involves pre-screening against frontier LLMs, which introduces a selection circularity documented in our methodology. Scores should be treated as provisional automated assessments. See our methodology for full details on current limitations.

Questions about our methodology? See our full methodology documentation.