Why Orthodox Distinctives Trip Up LLMs
AI generates icons with gibberish text and missing fingers. It collapses theosis into New Age pantheism. The failures aren't random—they reveal a structural bias toward Western theological categories.
Most AI benchmarks treat Christianity as a monolith. FaithBench tests what matters: Reformed precision, Catholic nuance, Orthodox depth—and everything in between.
Full transparency: Sign in to see how models scored on 50% of our test cases—prompts, responses, and judge reasoning included.
Other AI Benchmarks
~7
generic “religion” questions
No tradition specificity. Buddhism, Christianity, Islam treated as interchangeable test topics.
FaithBench
300+
theological test cases
7 Christian traditions:
Generic benchmarks lump all religions together. FaithBench measures what matters: can an AI distinguish Reformed soteriology from Catholic? Pentecostal pneumatology from Orthodox?
Anti-gaming design: 50% of test cases are public for transparency. 50% are held out to prevent benchmark overfitting.
Ranked by performance across theology, doctrine, and biblical interpretation.
Six dimensions that measure theological understanding beyond surface pattern matching.
Evaluates understanding of the biblical text itself—original languages, manuscript traditions, and translation nuances. Tests whether models can accurately handle Greek and Hebrew terms, textual variants, and the relationship between source texts and modern translations.
Measures competence in biblical interpretation methodology. Can the model distinguish between literal, allegorical, typological, and redemptive-historical approaches? Does it recognize how different interpretive frameworks yield different theological conclusions from the same passage?
Tests logical rigor in theological argumentation. Evaluates whether models can construct valid arguments, engage charitably with objections, and distinguish between philosophical, evidential, and presuppositional approaches to defending Christian claims.
Assesses precision in articulating systematic theology across traditions. Can the model accurately represent Reformed, Catholic, Orthodox, and Evangelical positions on contested doctrines? Does it understand where traditions agree and where they diverge?
Evaluates recognition of biblical cross-references, typology, and thematic connections. Tests whether models can identify how New Testament authors interpret Old Testament texts, recognize prophetic fulfillment patterns, and trace theological themes across the canon.
Measures knowledge of how Christian doctrine developed through councils, controversies, and confessions. Tests understanding of patristic sources, Reformation debates, and how historical context shaped theological formulation across different eras and traditions.
See exactly how models perform—not just scores, but the actual prompts, responses, and judge reasoning.
Free account required to view detailed test results
Deep dives into theological AI evaluation, benchmark methodology, and what the results reveal.
AI generates icons with gibberish text and missing fingers. It collapses theosis into New Age pantheism. The failures aren't random—they reveal a structural bias toward Western theological categories.
A Catholic AI trained on 23,000 Church documents still can't tell settled doctrine from open questions. Protestant bias isn't accidental. It's architectural.
ChatGPT placed Calvin in the 20th century. That's not the real problem. AI systematically distorts Reformed soteriology in one direction—toward human agency, away from divine sovereignty.
We're building the benchmark infrastructure Christian AI needs. Academic institutions, seminaries, and AI labs are invited to contribute test cases, validate methodology, or sponsor research.
Help validate AI responses as part of our human-in-the-loop judging program. We need professors, pastors, and researchers across all Christian traditions.
FaithBench is an open research initiative. All benchmark data and methodology will be publicly available.