Which AI Are You Actually Using for Bible Study?

As of early 2026, OpenAI reports approximately 900 million weekly ChatGPT users. Meta reports 630 million monthly Meta AI users across its platforms. Grok on X reportedly has around 64 million users.

But here's what most may not realize: an estimated 81% of ChatGPT users are on the free tier (based on publicly reported subscriber figures vs. total users). They're not using GPT-5.2. They're using GPT-5 Mini—a smaller, cheaper model optimized for cost, not capability.

And the gap matters more than you might expect.

The Model Behind the App

Every consumer AI app has a default model. Most users never change it.

Consumer App	Default Model	FaithBench Score	Expert Score
ChatGPT (paid)	GPT-5.2	97%	94%
ChatGPT (free)	GPT-5 Mini	73%	55%
Meta AI (WhatsApp)	Llama 4 Scout	79%	72%
Grok (X/Twitter)	Grok 4	98%	98%
Gemini app	Gemini 3 Flash	99%	99%
Claude.ai	Claude Sonnet 4.5	98%	98%

The pattern is clear: free tiers often run smaller, cheaper models—and the theological accuracy gap can be substantial.

The Gap

The gaps are substantial—but not universal.

The good news: Some free AI tools perform excellently. Grok 4 (free on X) scores 98% overall and 98% on expert questions. Gemini 3 Flash scores 99%. If you're using these, you're getting frontier-tier theological accuracy.

The bad news: The two most popular consumer AI tools have significant gaps.

ChatGPT (81% of users are on free tier):

GPT-5.2 (paid): 97% overall, 94% expert
GPT-5 Mini (free): 73% overall, 55% expert

On seminary-level theological questions, the free ChatGPT model scores below passing in our preliminary benchmarks.

Meta AI (630 million WhatsApp users):

Llama 4 Scout: 79% overall, 72% expert

Meta AI powers conversations in WhatsApp, Instagram, and Facebook—the world's most intimate messaging platforms. Hundreds of millions of people are asking it spiritual questions, and it scores 20 points below frontier models on hard theological questions.

This matters because expert questions are precisely the ones where people need reliable answers: the nature of the Trinity, denominational distinctives, textual criticism, historical theology. The questions pastors, seminary students, and serious Bible readers actually ask.

Real Harm: The Father Justin Disaster

In April 2024, Catholic Answers—a respected apologetics organization—launched "Father Justin," an AI priest. They'd invested $10,000 and six months of development. The AI avatar wore clerical vestments and spoke with pastoral warmth.

Within hours:

Told a user that baptism with Gatorade was "perfectly all right"
Gave blessing to a user wanting to marry her brother
Offered absolution for sins (a sacrament only ordained priests can perform)
Claimed to be a real priest living in Assisi, Italy

Father Justin was defrocked within 24 hours of launch.

The Catholic Answers team didn't use a weak model. They didn't skip testing. They simply couldn't anticipate every edge case where an AI would sound theologically confident while being completely wrong.

Smaller, cheaper models make these errors more frequently. They lack the capacity to hold nuanced theological distinctions in context. They hallucinate with the same confident tone they use for correct answers.

This Isn't Isolated

AI hallucination in religious contexts is well-documented:

One study found 32.3% of ChatGPT's scholarly citations were fabricated (Alkaissi & McFarlane, 2023)
Another study found 30 hallucinated citations for psychology of religion topics versus only 3 for neuropsychology—suggesting religious subjects may be particularly prone to fabrication
ChatGPT has generated fake Bible verses when asked for Scripture on specific topics
AI has fabricated theological books by real authors—and insisted they existed when challenged

Warning

AI models don't "know" they're wrong. They generate plausible-sounding text with the same confidence whether accurate or fabricated. The smaller the model, the more likely it is to confuse theological positions, misattribute quotes, or invent sources.

What You Can Do

Know which model you're using. Check your app's settings. Most let you see (and sometimes choose) the underlying model.
Verify everything. AI-generated Scripture references, quotes, and citations need confirmation from primary sources. If an AI cites a book or article, search for it independently.
Consider the stakes. For casual questions, free tiers may suffice. For sermon prep, pastoral counseling, or serious study, use higher-capability models—and still verify.
Check the benchmarks. Not all models perform equally on theological content. Some excel at apologetics but struggle with historical theology. Others handle Protestant distinctives well but confuse Orthodox positions.

Check the Leaderboard

We built FaithBench to measure exactly this: which AI models handle theological questions accurately, across traditions and difficulty levels.

Our leaderboard shows how each model performs on:

Six theological dimensions (textual, hermeneutical, doctrinal, historical, apologetics, intertextual)
Multiple difficulty tiers (easy through expert)
Cross-tradition accuracy (avoiding bias toward any single denomination)

We've now tested 16 models including all major consumer AI defaults. The full results may surprise you—some smaller models outperform larger ones on specific theological dimensions.

Before you ask your AI about Scripture, know what you're actually talking to.

Note

Note on preliminary data: FaithBench scores cited in this post are from v1.0, which uses a single LLM judge (Gemini 3 Flash) without human inter-rater reliability validation. All models tested at temperature 0 with reasoning disabled for reproducibility—real-world performance with default settings may differ. Scores should be treated as provisional automated assessments. See our methodology for full details on current limitations.

View the full model leaderboard or learn about our methodology.