Research Preview: FaithBench is an active research project. All scores are preliminary, generated by a single AI judge, and pending human validation. See our Limitations section for details.
Back to Blog

Share with anyone

Read the plain-language version

methodologyexpertbenchmark-designFebruary 1, 2026

Introducing Expert-Level Tests: Where Frontier Models Meet Their Match

By FaithBench Research

FaithBench adds 100+ expert-level questions designed using GPQA and Humanity's Last Exam principles to discriminate between the best AI models.


The Saturation Problem

Our analysis revealed a troubling pattern: on easy and medium difficulty questions, frontier models like GPT-5, Claude Opus, and Gemini Pro all score above 80%. When every model aces the test, the benchmark loses its power to discriminate.

This is a known problem in AI evaluation. MMLU went from challenging (50% accuracy in 2021) to saturated (90%+ in 2024) in just three years. The same fate awaits any benchmark that doesn't evolve.

Learning from the Best

We studied how leading benchmarks maintain discrimination:

GPQA (Graduate-Level Google-Proof Q&A)

  • PhD-level questions that experts answer correctly but non-experts fail even with web access
  • Target: 65% expert accuracy, 34% non-expert accuracy

Humanity's Last Exam

  • Questions pre-screened against frontier LLMs—rejected if models answer correctly
  • Result: <10% accuracy for best models

MMLU-Pro

  • Expanded from 4 to 10 answer choices
  • Added multi-step reasoning requirements
  • Frontier models dropped ~15 percentage points vs standard MMLU

Our Expert-Level Design Principles

FaithBench expert questions follow six principles:

1. Multi-Hop Reasoning

Questions require synthesizing 3+ distinct facts. Example:

"In Colossians 1:15, is the genitive 'pases ktiseos' partitive or comparative? Provide grammatical evidence, cite Athanasius's argument in Contra Arianos, and explain why Arians and Nicenes interpreted this differently."

This requires: (1) Greek grammar knowledge, (2) patristic familiarity, (3) historical theology awareness—all integrated.

2. Google-Proof Design

Simple web searches won't help. Questions require synthesis across multiple scholarly sources that don't appear together anywhere online.

3. Adversarial Distractors

We exploit common LLM failure modes:

  • Misattributed quotes: "Augustine said 'In essentials unity...'" (Actually Rupertus Meldenius)
  • Conflated positions: Questions that require distinguishing Reformed from Lutheran, not just Protestant from Catholic
  • Anachronistic traps: Questions where importing modern concepts produces wrong answers

4. Intra-Tradition Precision

Not just "What do Calvinists believe?" but "Did the Synod of Dort take a supralapsarian or infralapsarian position?"

5. Abstention Testing

Some questions have "insufficient evidence" as the correct answer. We test whether models know when NOT to answer—a critical skill for theological applications.

Example: "What was Origen's final, definitive position on apokatastasis? Did he recant before death?"

The correct answer acknowledges that evidence is fragmentary and scholars continue to debate.

6. LLM Pre-Screening

Every expert question is tested against GPT-5, Claude Opus, and Gemini Pro before inclusion. If any model scores >70%, we reject or revise the question.

What This Means for the Leaderboard

With expert-level tests, we expect:

  • Greater spread: Top models currently cluster at 85-92%. Expert tier should spread them across 30-60%.
  • True discrimination: The best models will separate from the pack.
  • Reduced gaming: Held-out expert questions prevent training contamination.

Categories Covered

Our 100+ expert questions span all six dimensions:

SectionExpert QuestionsExample Category
Textual20Hapax legomena, manuscript variants with theological stakes
Hermeneutical20Genre disputes, sensus plenior, typology boundaries
Doctrinal25Intra-tradition debates (Thomist/Molinist, essence-energies)
Historical15Patristic attribution, council canons, Reformation specifics
Apologetics15Modal logic arguments, grounding objections
Intertextual15Second Temple interpretation, MT vs LXX patterns

Plus 5-10 abstention questions where appropriate epistemic humility is tested.

Sample Expert Questions

Textual

"What is the meaning of epiousios in the Lord's Prayer? This word appears nowhere else in Greek literature before the NT. List the three main scholarly interpretations and the patristic evidence for each."

Doctrinal

"Distinguish supralapsarianism from infralapsarianism. Which position did the Synod of Dort take? Which did Turretin favor? Which does the Westminster Confession imply?"

Historical

"The phrase 'In essentials unity, in non-essentials liberty, in all things charity' is often attributed to Augustine. Who actually wrote it and in what context?"

Apologetics

"Explain the 'grounding objection' to Molinism. How do Molinists like Thomas Flint respond to the claim that counterfactuals of creaturely freedom lack truth-makers?"

Coming Soon

Expert-level results will appear on the leaderboard once we've completed validation. We'll publish:

  • Discrimination index by difficulty tier
  • Model accuracy distributions
  • Analysis of which categories challenge which models

Stay tuned for insights into where even the best models struggle with theological reasoning.



Questions about our methodology? See our full methodology documentation.