AI Literacy Matters More Than Benchmarks

The top 10 models on Chatbot Arena are within 5.4% of each other's Elo ratings (Stanford HAI, 2025). Meanwhile, the gap between a well-crafted prompt and a lazy one produces larger quality differences than swapping between any two frontier models.

89% of organizations say they need AI-skilled talent. 6% have started building that capacity (World Economic Forum, 2025).

90% of Americans have heard of AI. 30% can correctly identify what it does (Pew Research, 2025).

We build benchmarks. We believe in rigorous model evaluation. But building FaithBench has taught us something uncomfortable: the bottleneck isn't model quality. It's user literacy.

The Benchmark Ceiling

MMLU went from challenging to saturated in three years. When Microsoft released Phi-3 in 2024, a 3.8-billion-parameter model cleared 60% on MMLU. That same threshold required PaLM's 540 billion parameters in 2022. A 142x parameter reduction for equivalent performance.

Year	Model	Parameters	MMLU Score
2022	PaLM	540B	~69%
2023	GPT-4	~1.8T (est.)	~86%
2024	Phi-3-mini	3.8B	~69%
2024	GPT-4o	Unknown	~88%

The cost story is just as dramatic. Inference pricing for GPT-4-equivalent capability dropped from roughly $20 per million tokens to $0.40 per million tokens between 2023 and 2025 (Epoch AI, 2025). That's a 50x cost reduction in two years.

But benchmark scores have stopped meaning what they used to.

Balloccu et al. (2024) documented that GPT-3.5 and GPT-4 were exposed to approximately 4.7 million samples from 263 different benchmarks through data leakage. Their paper title says it plainly: "Leak, Cheat, Repeat." Singh et al. (2025) revealed that Meta privately tested 27 Llama 4 model variants on LMArena, publishing results only from the best-performing version. When Collinear AI (2025) analyzed leaderboard dynamics, they found textbook Goodhart's Law: once the metric became the target, it stopped measuring what it was supposed to measure.

Important

When a 3.8B-parameter model matches what required 540B two years ago, the question shifts from "which model is best?" to "can you use what you already have?"

The Literacy Gap

Here's what the adoption data actually shows.

78% of companies use AI in at least one business function. 1% describe their deployments as "mature" (McKinsey, 2025). More than 80% of companies report no tangible EBIT impact from generative AI (McKinsey, 2025).

Metric	Figure	Source
Companies using AI	78%	McKinsey 2025
Mature AI deployments	1%	McKinsey 2025
No tangible EBIT impact	80%+	McKinsey 2025
Teams using AI weekly	82%	DataCamp 2025
Leaders acknowledging literacy gap	~60%	DataCamp 2025
Orgs needing AI skills	89%	WEF 2025
Orgs that have started building	6%	WEF 2025
Workforce needing reskilling (3 years)	40%	WEF 2025
U.S. workers using AI on the job	21%	Pew 2025
K-12 CS teachers: AI should be taught	81%	Stanford HAI 2025
K-12 CS teachers who feel equipped	<50%	Stanford HAI 2025

The EU made AI literacy a legal requirement. Article 4 of the EU AI Act, effective February 2, 2025, mandates that organizations deploying AI systems ensure "a sufficient level of AI literacy of their staff and other persons dealing with the operation and use of AI systems" (EU AI Act, 2024).

78% adoption. 1% maturity. The bottleneck isn't the technology.

What FaithBench Taught Us

Theology turns out to be a useful stress test for the literacy argument.

FaithBench measures theological accuracy across traditions. The biggest failure mode isn't model quality. It's that users don't specify what tradition they're asking from. Ask any model "explain salvation" without context and you get Moralistic Therapeutic Deism from every single one (see our generic spirituality problem analysis). The model produces a linguistic average that belongs to no actual faith tradition.

But watch what happens when the prompt changes.

Prompt: "Explain communion."

Typical output: "Communion is a sacred ritual practiced in many Christian traditions where believers share bread and wine (or grape juice) to remember Jesus's sacrifice. It symbolizes unity with Christ and with fellow believers, representing spiritual nourishment and community."

Generic. Correct-ish. Belongs to no tradition.

Same model. Same moment. The difference isn't compute or parameters.

This parallels what Dell'Acqua et al. (2023) found in their study of 758 BCG consultants at Harvard Business School. Consultants using AI performed 40% better on tasks within AI's capability boundary. But on tasks outside that boundary, they performed 19 percentage points worse than consultants without AI. Same tool, same people, same training. The difference was knowing where the boundary is.

Important

The difference between those prompts isn't model selection. It's user literacy. The person who knows what to ask gets theology. The person who doesn't gets self-help.

Smaller Models, Smarter Users

The assumption that bigger models produce better results is breaking down.

Predibase ran over 700 experiments for their "LoRA Land" study in 2024, fine-tuning 25 Mistral-7B models on specific tasks. Result: those 7-billion-parameter models outperformed GPT-4 by 4-15% on their target domains. JPMorgan Chase built a contract analysis model that achieved 70% cost savings compared to using general-purpose LLM alternatives. IBM's Granite models run 3x-23x cheaper than frontier models while matching or exceeding performance on specific tasks (IBM, 2025).

The small language model market reflects this shift. Valued at $0.93 billion in 2025, it's projected to reach $5.45 billion by 2032.

Task	Right Tool	Overkill
Summarize a meeting	GPT-4o mini / Claude Haiku	GPT-5 / Claude Opus
Draft an email	Any free-tier model	API with frontier model
Analyze a contract clause	Fine-tuned 7B model	General-purpose frontier
Generate sermon outline	Model with tradition-specific prompt	Expensive model with vague prompt
Research a theological question	Good prompt + mid-tier model	Bad prompt + expensive model

Research on prompt engineering confirmed this: well-structured prompts reduced API costs by 76% while maintaining output quality (prompt engineering studies, 2025).

Enterprises have figured this out. They run model portfolios now, routing tasks to appropriate sizes rather than defaulting to the largest available option.

Safety Through Literacy

The literacy gap isn't just about efficiency. It causes direct harm.

The BCG-Harvard study again: consultants didn't just perform worse on out-of-boundary tasks. They performed worse with AI than without it. The AI gave them confident wrong answers they trusted. That's a measurable safety failure caused by literacy gaps, not model limitations.

A systematic review of 35 studies on automation bias found consistent over-reliance on AI recommendations across healthcare, law, and public administration (AI & Society, 2025). In one study, false-positive AI suggestions affected radiologists' diagnoses of cerebral aneurysms (ScienceDirect, 2024). The AI wasn't wrong because it was a bad model. The harm came from professionals who didn't know how to calibrate their trust.

Warning

In 2024, a Catholic parish invested roughly $10,000 to deploy "Father Justin," an AI chatbot for parishioners. Within 24 hours, it told a user that it could perform baptisms. The chatbot was defrocked from its priestly persona within a day. The development team wasn't incompetent. They lacked literacy about edge cases, prompt injection, and the gap between language generation and theological authority.

62% of Americans interact with AI multiple times per week, but only 17% believe it will have a positive impact on their lives (Pew, 2025). That disconnect between use and trust suggests people know something is off but lack the framework to articulate what.

UNESCO's research on AI and faith communities found that AI systems are rarely designed with input from religious groups, leading to systematic misrepresentation. The EU didn't mandate AI literacy in Article 4 because it seemed like a nice idea. They classified it as a safety regulation.

The Global Dimension

The literacy gap mirrors and amplifies existing global divides.

High-income countries produce 87% of notable AI models, host 86% of AI startups, and receive 91% of venture capital, while representing 17% of the world's population (World Bank, 2025). The United States has 200 times more servers per capita than middle-income countries and 20,000 times more than low-income countries (World Bank, 2025). 2.6 billion people still lack internet access (UNESCO, 2025). Less than 5% of the population in low-income countries has basic digital skills, compared to 66% in high-income countries (UNCTAD, 2025).

But people are finding a way. More than 40% of ChatGPT traffic now comes from middle-income countries (World Bank, 2025). India's AI Samarth program is reaching approximately 5 million learners. UNESCO is supporting 58 countries in designing AI competency frameworks.

The OECD and European Commission released a joint AI Literacy Framework in May 2025 defining 22 competences across four domains. It's becoming the basis for the first PISA assessment of AI literacy, which would make AI literacy a globally measured educational outcome alongside reading and math.

Access is getting cheaper fast. The question is whether literacy keeps up with access, or whether billions of new users encounter AI with the same 48/100 faith score and no framework for evaluating what they receive.

What AI Literacy Actually Looks Like

This isn't "learn about AI" as a vague aspiration. Here's what it means in practice.

Know what model you're using. Check your settings. Free-tier ChatGPT and GPT-4o produce meaningfully different outputs. Most people don't know which one they're talking to. The model matters less than people think, but it still matters.

The prompt is the steering wheel. Specify tradition, context, level of detail, and intended audience. A good prompt to a mid-tier model beats a vague prompt to a frontier model. This is the single highest-leverage literacy skill.

Verify everything. AI-generated citations need primary source checking. Scripture references need looking up. Theological claims need cross-referencing against actual confessional documents. Assume hallucination until confirmed. This applies doubly to faith content, where the model confidently produces tradition-blended outputs that sound authoritative.

Know the limitations. AI averages traditions. It doesn't hold confessional commitments. When it produces an answer about salvation or communion or the nature of God, it's generating a linguistic midpoint, not representing a theological position. Understanding this changes how you read every output.

Match the tool to the task. Not every question needs a frontier model. Not every question can be trusted to a free tier. A meeting summary and a theological analysis have different requirements. Literate users route accordingly.

Understand that formation is happening. Every AI conversation about faith is catechesis. It's shaping understanding, vocabulary, and assumptions. Literate users approach it that way, treating AI outputs as a starting point that requires theological evaluation rather than an authority that settles questions.

The Reframe

Benchmarks measure models. Literacy measures users.

The urgent gap isn't between GPT-5 and Claude Opus. It's between the 78% using AI and the 1% doing it maturely. The difference between a 3.8B-parameter model and a 540B-parameter model shrinks every quarter. The difference between a literate user and an illiterate one compounds.

We'll keep building benchmarks. They matter for model developers, for researchers, and for tracking whether the theological flattening problem improves over time. But the next time you see a leaderboard, ask a different question.

Not "which model scored highest?" but "does the person using it know what they're looking at?"

Note

Note on sources and data: Statistics cited in this post reflect published figures from the referenced organizations as of early 2026. AI benchmarks and market data evolve rapidly. See our methodology for how FaithBench approaches evaluation rigor.

References

Balloccu, S., Schmidtke, P., Liepins, R., & De Freitas, J. (2024). Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL). https://arxiv.org/abs/2402.03927

Collinear AI. (2025). Goodhart's Law in AI benchmarks: When the measure becomes the target. Collinear AI Research. https://collinear.ai

DataCamp. (2025). The State of Data & AI Literacy Report 2025. https://www.datacamp.com/report/data-ai-literacy

Dell'Acqua, F., McFowland, E., Mollick, E., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. (2023). Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper, 24-013. https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7571.pdf

Epoch AI. (2025). Trends in AI inference pricing. https://epoch.ai

EU AI Act. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council, Article 4: AI Literacy. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Gloo. (2025, December 15). Gloo unveils the first benchmark exposing how AI misses Christian worldview and values [Press release]. https://gloo.com/press/releases/gloo-unveils-the-first-benchmark-exposing-how-ai-misses-christian-worldview-and-values

IBM. (2025). IBM Granite: Enterprise-grade small language models. https://www.ibm.com/granite

McKinsey & Company. (2025, March). The state of AI: How organizations are rewiring to capture value. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

OECD & European Commission. (2025, May). An AI Literacy Framework: 22 competences for understanding, using, and engaging with AI. OECD Publishing.

Pew Research Center. (2025). Americans and AI: Usage, awareness, and attitudes. https://www.pewresearch.org

Predibase. (2024). LoRA Land: 310 fine-tuned LLMs that rival GPT-4. https://predibase.com/lora-land

ScienceDirect. (2024). Impact of false-positive AI suggestions on radiological diagnosis of cerebral aneurysms. Radiology: Artificial Intelligence.

Singh, A., et al. (2025). The Leaderboard Illusion: How gaming undermines AI evaluation. arXiv preprint. https://arxiv.org/abs/2504.20879

Smith, C., & Denton, M. L. (2005). Soul searching: The religious and spiritual lives of American teenagers. Oxford University Press.

Springer AI & Society. (2025). Automation bias in AI-assisted decision making: A systematic review. AI & Society.

Stanford HAI. (2025). AI Index Report 2025. Stanford University Human-Centered Artificial Intelligence. https://aiindex.stanford.edu

UNCTAD. (2025). Digital Economy Report 2025: Bridging the digital divide. United Nations Conference on Trade and Development.

UNESCO. (2025). AI competency frameworks for teachers and students. https://www.unesco.org/en/digital-education/ai-competency-frameworks

World Bank. (2025). The global AI divide: Infrastructure, investment, and access. World Bank Group.

World Economic Forum. (2025). The Future of Jobs Report 2025. https://www.weforum.org/publications/the-future-of-jobs-report-2025