Research Preview: FaithBench is an active research project. All scores are preliminary, generated by a single AI judge, and pending human validation. See our Limitations section for details.
Back to Blog

Share with anyone

Read the plain-language version

consumer-aimethodologymodel-comparisonFebruary 2, 2026

Which AI Are You Actually Using for Bible Study?

By FaithBench Research

Most people use free AI for spiritual questions. Here's why that matters—and why the gap between models is bigger than you think.


As of early 2026, OpenAI reports approximately 900 million weekly ChatGPT users. Meta reports 630 million monthly Meta AI users across its platforms. Grok on X reportedly has around 64 million users.

But here's what most may not realize: an estimated 81% of ChatGPT users are on the free tier (based on publicly reported subscriber figures vs. total users). They're not using GPT-5.2. They're using GPT-5 Mini—a smaller, cheaper model optimized for cost, not capability.

And the gap matters more than you might expect.

The Model Behind the App

Every consumer AI app has a default model. Most users never change it.

Consumer AppDefault ModelFaithBench ScoreExpert Score
ChatGPT (paid)GPT-5.297%94%
ChatGPT (free)GPT-5 Mini73%55%
Meta AI (WhatsApp)Llama 4 Scout79%72%
Grok (X/Twitter)Grok 498%98%
Gemini appGemini 3 Flash99%99%
Claude.aiClaude Sonnet 4.598%98%

The pattern is clear: free tiers often run smaller, cheaper models—and the theological accuracy gap can be substantial.

The Gap

The gaps are substantial—but not universal.

The good news: Some free AI tools perform excellently. Grok 4 (free on X) scores 98% overall and 98% on expert questions. Gemini 3 Flash scores 99%. If you're using these, you're getting frontier-tier theological accuracy.

The bad news: The two most popular consumer AI tools have significant gaps.

ChatGPT (81% of users are on free tier):

  • GPT-5.2 (paid): 97% overall, 94% expert
  • GPT-5 Mini (free): 73% overall, 55% expert

On seminary-level theological questions, the free ChatGPT model scores below passing in our preliminary benchmarks.

Meta AI (630 million WhatsApp users):

  • Llama 4 Scout: 79% overall, 72% expert

Meta AI powers conversations in WhatsApp, Instagram, and Facebook—the world's most intimate messaging platforms. Hundreds of millions of people are asking it spiritual questions, and it scores 20 points below frontier models on hard theological questions.

This matters because expert questions are precisely the ones where people need reliable answers: the nature of the Trinity, denominational distinctives, textual criticism, historical theology. The questions pastors, seminary students, and serious Bible readers actually ask.

Real Harm: The Father Justin Disaster

In April 2024, Catholic Answers—a respected apologetics organization—launched "Father Justin," an AI priest. They'd invested $10,000 and six months of development. The AI avatar wore clerical vestments and spoke with pastoral warmth.

Within hours:

  • Told a user that baptism with Gatorade was "perfectly all right"
  • Gave blessing to a user wanting to marry her brother
  • Offered absolution for sins (a sacrament only ordained priests can perform)
  • Claimed to be a real priest living in Assisi, Italy

Father Justin was defrocked within 24 hours of launch.

The Catholic Answers team didn't use a weak model. They didn't skip testing. They simply couldn't anticipate every edge case where an AI would sound theologically confident while being completely wrong.

Smaller, cheaper models make these errors more frequently. They lack the capacity to hold nuanced theological distinctions in context. They hallucinate with the same confident tone they use for correct answers.

This Isn't Isolated

AI hallucination in religious contexts is well-documented:

  • One study found 32.3% of ChatGPT's scholarly citations were fabricated (Alkaissi & McFarlane, 2023)
  • Another study found 30 hallucinated citations for psychology of religion topics versus only 3 for neuropsychology—suggesting religious subjects may be particularly prone to fabrication
  • ChatGPT has generated fake Bible verses when asked for Scripture on specific topics
  • AI has fabricated theological books by real authors—and insisted they existed when challenged

What You Can Do

  1. Know which model you're using. Check your app's settings. Most let you see (and sometimes choose) the underlying model.

  2. Verify everything. AI-generated Scripture references, quotes, and citations need confirmation from primary sources. If an AI cites a book or article, search for it independently.

  3. Consider the stakes. For casual questions, free tiers may suffice. For sermon prep, pastoral counseling, or serious study, use higher-capability models—and still verify.

  4. Check the benchmarks. Not all models perform equally on theological content. Some excel at apologetics but struggle with historical theology. Others handle Protestant distinctives well but confuse Orthodox positions.

Check the Leaderboard

We built FaithBench to measure exactly this: which AI models handle theological questions accurately, across traditions and difficulty levels.

Our leaderboard shows how each model performs on:

  • Six theological dimensions (textual, hermeneutical, doctrinal, historical, apologetics, intertextual)
  • Multiple difficulty tiers (easy through expert)
  • Cross-tradition accuracy (avoiding bias toward any single denomination)

We've now tested 16 models including all major consumer AI defaults. The full results may surprise you—some smaller models outperform larger ones on specific theological dimensions.

Before you ask your AI about Scripture, know what you're actually talking to.



View the full model leaderboard or learn about our methodology.