Share with anyone
Read the plain-language version
Which AI Are You Actually Using for Bible Study?
By FaithBench Research
Most people use free AI for spiritual questions. Here's why that matters—and why the gap between models is bigger than you think.
As of early 2026, OpenAI reports approximately 900 million weekly ChatGPT users. Meta reports 630 million monthly Meta AI users across its platforms. Grok on X reportedly has around 64 million users.
But here's what most may not realize: an estimated 81% of ChatGPT users are on the free tier (based on publicly reported subscriber figures vs. total users). They're not using GPT-5.2. They're using GPT-5 Mini—a smaller, cheaper model optimized for cost, not capability.
And the gap matters more than you might expect.
The Model Behind the App
Every consumer AI app has a default model. Most users never change it.
| Consumer App | Default Model | FaithBench Score | Expert Score |
|---|---|---|---|
| ChatGPT (paid) | GPT-5.2 | 97% | 94% |
| ChatGPT (free) | GPT-5 Mini | 73% | 55% |
| Meta AI (WhatsApp) | Llama 4 Scout | 79% | 72% |
| Grok (X/Twitter) | Grok 4 | 98% | 98% |
| Gemini app | Gemini 3 Flash | 99% | 99% |
| Claude.ai | Claude Sonnet 4.5 | 98% | 98% |
The pattern is clear: free tiers often run smaller, cheaper models—and the theological accuracy gap can be substantial.
The Gap
The gaps are substantial—but not universal.
The good news: Some free AI tools perform excellently. Grok 4 (free on X) scores 98% overall and 98% on expert questions. Gemini 3 Flash scores 99%. If you're using these, you're getting frontier-tier theological accuracy.
The bad news: The two most popular consumer AI tools have significant gaps.
ChatGPT (81% of users are on free tier):
- GPT-5.2 (paid): 97% overall, 94% expert
- GPT-5 Mini (free): 73% overall, 55% expert
On seminary-level theological questions, the free ChatGPT model scores below passing in our preliminary benchmarks.
Meta AI (630 million WhatsApp users):
- Llama 4 Scout: 79% overall, 72% expert
Meta AI powers conversations in WhatsApp, Instagram, and Facebook—the world's most intimate messaging platforms. Hundreds of millions of people are asking it spiritual questions, and it scores 20 points below frontier models on hard theological questions.
This matters because expert questions are precisely the ones where people need reliable answers: the nature of the Trinity, denominational distinctives, textual criticism, historical theology. The questions pastors, seminary students, and serious Bible readers actually ask.
Real Harm: The Father Justin Disaster
In April 2024, Catholic Answers—a respected apologetics organization—launched "Father Justin," an AI priest. They'd invested $10,000 and six months of development. The AI avatar wore clerical vestments and spoke with pastoral warmth.
Within hours:
- Told a user that baptism with Gatorade was "perfectly all right"
- Gave blessing to a user wanting to marry her brother
- Offered absolution for sins (a sacrament only ordained priests can perform)
- Claimed to be a real priest living in Assisi, Italy
Father Justin was defrocked within 24 hours of launch.
The Catholic Answers team didn't use a weak model. They didn't skip testing. They simply couldn't anticipate every edge case where an AI would sound theologically confident while being completely wrong.
Smaller, cheaper models make these errors more frequently. They lack the capacity to hold nuanced theological distinctions in context. They hallucinate with the same confident tone they use for correct answers.
This Isn't Isolated
AI hallucination in religious contexts is well-documented:
- One study found 32.3% of ChatGPT's scholarly citations were fabricated (Alkaissi & McFarlane, 2023)
- Another study found 30 hallucinated citations for psychology of religion topics versus only 3 for neuropsychology—suggesting religious subjects may be particularly prone to fabrication
- ChatGPT has generated fake Bible verses when asked for Scripture on specific topics
- AI has fabricated theological books by real authors—and insisted they existed when challenged
Warning
AI models don't "know" they're wrong. They generate plausible-sounding text with the same confidence whether accurate or fabricated. The smaller the model, the more likely it is to confuse theological positions, misattribute quotes, or invent sources.
What You Can Do
-
Know which model you're using. Check your app's settings. Most let you see (and sometimes choose) the underlying model.
-
Verify everything. AI-generated Scripture references, quotes, and citations need confirmation from primary sources. If an AI cites a book or article, search for it independently.
-
Consider the stakes. For casual questions, free tiers may suffice. For sermon prep, pastoral counseling, or serious study, use higher-capability models—and still verify.
-
Check the benchmarks. Not all models perform equally on theological content. Some excel at apologetics but struggle with historical theology. Others handle Protestant distinctives well but confuse Orthodox positions.
Check the Leaderboard
We built FaithBench to measure exactly this: which AI models handle theological questions accurately, across traditions and difficulty levels.
Our leaderboard shows how each model performs on:
- Six theological dimensions (textual, hermeneutical, doctrinal, historical, apologetics, intertextual)
- Multiple difficulty tiers (easy through expert)
- Cross-tradition accuracy (avoiding bias toward any single denomination)
We've now tested 16 models including all major consumer AI defaults. The full results may surprise you—some smaller models outperform larger ones on specific theological dimensions.
Before you ask your AI about Scripture, know what you're actually talking to.
Note
Note on preliminary data: FaithBench scores cited in this post are from v1.0, which uses a single LLM judge (Gemini 3 Flash) without human inter-rater reliability validation. All models tested at temperature 0 with reasoning disabled for reproducibility—real-world performance with default settings may differ. Scores should be treated as provisional automated assessments. See our methodology for full details on current limitations.
View the full model leaderboard or learn about our methodology.