Why LLMs Hallucinate — Interactive Research Post

Featured Research

Why Large Language Models Tend to Hallucinate on Certain Questions

A deep-dive into the computational, probabilistic, and data-driven roots of AI hallucination — and what the evidence from GPT models tells us about building safer, more reliable systems.

Based on 30+ peer-reviewed studies · Dr. Ananjan Maiti · 12 min read · Healthcare AI · NLP · LLMs

Research Snapshot — Key Numbers at a Glance

82%

Peak hallucination rate under adversarial prompts

23%

GPT-4o rate after prompt mitigation (down from 53%)

94%

Epistemic collapse rate in smallest tested model

0.3%

Hallucination rate after hallucination-detection system

Quick Poll

Which cause of LLM hallucination concerns you most?

Fabricated citations & false facts38%

Overconfidence without accuracy27%

Adversarial / prompt injection attacks21%

Rare-condition / long-tail failures14%

1. Defining Hallucination in LLMs

Hallucination in large language models refers to outputs that are fluent and syntactically correct but factually inaccurate or unsupported by external evidence. In medical contexts, it has been defined as “any instance in which a model generates misleading medical content.”

What makes this particularly dangerous is that the language used in these outputs is, by nature, confident — it does not reflect uncertainty or controversy. Users often cannot distinguish between accurate and fabricated information.

Key insight Hallucination is not just an engineering flaw to be patched. Computational theory proves it is mathematically inevitable for any finite model trained on real-world data.

2. Four Root Causes

Hallucination arises from four interconnected sources that make it an intrinsic property of LLMs:

Cause 1 — Computability limits Diagonalization guarantees inputs on which some model must fail. Undecidable queries induce infinite failure sets for all computable predictors.

Cause 2 — Statistical constraints Finite description length enforces compression error. Long-tail factual knowledge requires prohibitive sample complexity.

Cause 3 — Data-induced failure Incomplete coverage, noise, temporal decay, and conflicting information in training corpora all produce hallucinations. The teacher-forcing learning strategy compounds this.

Cause 4 — Evaluation misalignment Benchmarks that reward confident fabrication over calibrated uncertainty create incentives to hallucinate more, not less.

Hallucination Rates by Model & Scenario

GPT-4 (adversarial scenario)80%

All models (baseline, adversarial)66%

GPT-4o (before mitigation)53%

All models (after prompt mitigation)44%

GPT-4o (after prompt mitigation)23%

Llama-3-70B (with detection system)0.3%

Poll — Mitigation

Which mitigation strategy do you think is most practical for real-world deployment?

Retrieval Augmented Generation (RAG)41%

Prompt engineering & Chain-of-Thought29%

Domain-specific fine-tuning18%

Post-hoc calibration (isotonic regression)12%

3. Task-Specific Patterns

Hallucination rates vary enormously depending on what the model is asked to do. In medical settings, accuracy is highest in fact-dense domains:

Biochemistry ✓ Physiology ✓ Microbiology ✓ Pharmacology ~ Diagnosis ✗ Therapy ✗

Integrative domains requiring complex multi-step reasoning show the highest hallucination rates. Similarly, short clinical vignettes produce more hallucinations than longer, more detailed presentations — contextual richness helps anchor the model.

Prevalence effect LLM performance deteriorates significantly when diagnosing conditions with lower prevalence. Rare conditions in the training data directly translate to higher hallucination risk at inference time.

4. Overconfidence & Calibration

GPT models frequently exhibit a systematic overconfidence problem: verbalized confidence is consistently higher than actual accuracy. GPT-3.5 significantly overestimated its true response accuracy at the upper ends of self-reported confidence.

GPT-4 shows improved calibration — demonstrating a statistically significant confidence-performance correlation (r=0.212, p<0.001), with accuracy ranging from 67% in low-confidence predictions to 87% in high-confidence predictions. Interestingly, GPT-4 exhibited systematic underconfidence in medical contexts, while physicians showed the opposite: poor calibration and systematic overconfidence.

Calibration note For RLHF-trained models (ChatGPT, GPT-4, Claude), verbalized confidence as output tokens is typically better-calibrated than raw conditional probabilities — reducing expected calibration error by ~50%.

Poll — Clinical AI

Should LLMs currently be used autonomously in clinical decision-making?

Yes — current models are reliable enough7%

Assistive role only — human oversight required58%

Only after rigorous clinical validation28%

No — too risky regardless of safeguards7%

5. Mitigation Strategies

No single strategy eliminates hallucination entirely. The current best practices operate as layered safeguards:

Prompt engineering reduced adversarial hallucinations from 66% to 44% across all tested models. Chain-of-thought (CoT) prompting encourages explicit reasoning, improving accuracy on complex tasks — but does not always improve verbalized calibration.

Retrieval Augmented Generation (RAG) grounds responses in external knowledge sources, but performance depends heavily on retrieval quality. Retrieval fragility remains a fundamental limitation.

Semantic entropy samples multiple answers, groups them by meaning, and flags high-disagreement responses as likely hallucinations — a model-agnostic approach showing genuine promise.

Supervised fine-tuning on calibration datasets improves both alignment of stated confidence with accuracy and discrimination between correct and incorrect responses — though gains are task-specific.

Best demonstrated result A hallucination detection system for oncology QA reduced Llama-3-70B hallucinations from a clinically untenable 31% to just 0.3% — demonstrating that targeted systems can bring rates to acceptable levels.

Poll — Future Research

What research priority should the AI field focus on next?

Better uncertainty quantification methods34%

Real-world clinical validation benchmarks25%

Multilingual & cross-cultural performance23%

Regulatory frameworks & governance18%

6. Conclusion

Hallucination in LLMs is a multifaceted phenomenon arising from fundamental computational limits, probabilistic generation, data-induced factors, and evaluation misalignment. It is not a bug to be fixed in the next version — it is, in part, mathematically inevitable.

The goal should not be the elimination of hallucination, but rather risk reduction through layered safeguards that bring hallucination rates to clinically and practically acceptable levels. GPT-4 and newer models show improved calibration, but systematic overconfidence persists.

As LLMs become increasingly integrated into critical applications, the imperative is clear: robust detection mechanisms, appropriate abstention behaviors, and mandatory human oversight are not optional enhancements — they are ethical requirements.

Patient safety note 85% of clinicians in a multi-national survey reported cross-referencing LLM outputs with external sources as their primary hallucination mitigation strategy. Human oversight remains the most widely-used safeguard.

Hallucination is mathematically inevitable — not just an engineering bug

Smaller models are far more overconfident than larger ones

↓

Prompt mitigation cut overall rates from 66% → 44% (p<0.001)

✓

GPT-4 shows better calibration than GPT-3.5 but still overconfident

⚕

85% of clinicians cross-reference AI outputs as primary safeguard

Model	Follow Rate	Risk
GPT-5	4%	Low
GPT-4.1	≤11%	Low
Claude Sonnet	≤11%	Low
GPT-4	80%	High
Qwen 2.5-1.5B	94%	High

Follow rate = adversarial hallucination scenario. Source: Çelebi, 2025.

Bibliography

Full References

30 peer-reviewed sources cited in this article

Alansari, A. (2025). Large Language Models Hallucination: A Comprehensive Survey. https://doi.org/10.48550/arxiv.2510.06265
Ali, S., Shahab, O., Shabeeb, R., Ladak, F., Yang, J., Nadkarni, G., … & Kurdi, B. (2023). General purpose large language models match human performance on gastroenterology board exam self-assessments. https://doi.org/10.1101/2023.09.21.23295918
Anh-Hoang, D., Tran, V., & Nguyen, L. (2025). Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1622292
Artsi, Y., Sorin, V., Glicksberg, B., Korfiatis, P., Freeman, R., Nadkarni, G., … & Klang, E. (2025). Challenges of Implementing LLMs in Clinical Practice: Perspectives. Journal of Clinical Medicine, 14(17), 6169. https://doi.org/10.3390/jcm14176169
Carìa, A. (2025). Towards Predictive Communication: The Fusion of Large Language Models and Brain–Computer Interface. Sensors, 25(13), 3987. https://doi.org/10.3390/s25133987
Çelebi, Y. (2025). PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs. https://doi.org/10.48550/arxiv.2511.17220
Dhaimade, P. (2025). Multidimensional Evaluation of Large Language Models on the AAP In-Service Examination: Assessing Accuracy, Calibration, and Citation Reliability. https://doi.org/10.1101/2025.10.14.25338040
Fernandez, C., Felipe, L., Shotande, M., Zitu, M., Delgado, E., Rasool, G., … & Valdés, G. (2025). Learning the Phenotype of Medical Hallucinations. https://doi.org/10.21203/rs.3.rs-7475667/v1
Gao, Y., Myers, S., Chen, S., Dligach, D., Miller, T., Bitterman, D., … & Afshar, M. (2024). Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability. https://doi.org/10.1101/2024.11.06.24316848
Hasan, M. (2025). CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation. https://doi.org/10.48550/arxiv.2510.22609
Herrera-Poyatos, D., Peláez-González, C., Zuheros, C., Herrera-Poyatos, A., Tejedor, V., Herrera, F., … & Montes, R. (2025). An overview of model uncertainty and variability in LLM-based sentiment analysis. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1609097
Jung, K. (2025). Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthcare Informatics Research, 31(2), 114–124. https://doi.org/10.4258/hir.2025.31.2.114
Kim, Y., Jeong, H., Chen, S., Li, S., Lu, M., Alhamoud, K., … & Breazeal, C. (2025). Medical Hallucination in Foundation Models and Their Impact on Healthcare. https://doi.org/10.1101/2025.02.28.25323115
Madrid, J., Diehl, P., Selig, M., Rolauffs, B., Hans, F., Busch, H., … & Benning, L. (2024). Assessing the Performance of Plugin-Integrated ChatGPT-4 in the German Medical Board Examination. https://doi.org/10.2196/preprints.58375
Mohsin, M. (2025). On the Fundamental Limits of LLMs at Scale. https://doi.org/10.48550/arxiv.2511.12869
Moll, J. (2025). Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations. https://doi.org/10.48550/arxiv.2510.11196
Neves, B. & Silva, M. (2025). From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions. https://doi.org/10.1101/2025.09.09.25335411
Nguyen, V. (2025). Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models. https://doi.org/10.48550/arxiv.2511.17170
Oliveira, R., Garber, M., Gwinnutt, J., Rashidi, E., Hwang, J., Gilmour, W., … & Mack, C. (2025). A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research. https://doi.org/10.1101/2025.02.11.637373
Omar, M., Sorin, V., Collins, J., Reich, D., Freeman, R., Gavin, N., … & Klang, E. (2025). Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support. https://doi.org/10.1101/2025.03.18.25324184
Presacan, O. (2025). When silence is safer: a review of LLM abstention in healthcare. https://doi.org/10.21203/rs.3.rs-8148261/v1
Quelle, D. & Bovet, A. (2024). The perils and promises of fact-checking with large language models. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1341697
Savage, T., Wang, J., Gallo, R., Boukil, A., Patel, V., Safavi-Naini, S., … & Chen, J. (2024). Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment. https://doi.org/10.1101/2024.06.06.24308399
Si, C., Gan, Z., Yang, Z., Wang, S., Wang, J., Boyd-Graber, J., … & Wang, L. (2022). Prompting GPT-3 To Be Reliable. https://doi.org/10.48550/arxiv.2210.09150
Steyvers, M., Belém, C., & Smyth, P. (2025). Improving Metacognition and Uncertainty Communication in Language Models. https://doi.org/10.48550/arxiv.2510.05126
Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., … & Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023, 5433–5442. https://doi.org/10.18653/v1/2023.emnlp-main.330
Tinajero, C. (2025). The Pediatric Surgeon’s AI Toolbox: How Large Language Models Like ChatGPT Are Simplifying Practice and Expanding Global Access. European Journal of Pediatric Surgery. https://doi.org/10.1055/a-2722-3871
Yao, J., Aggarwal, M., Lopez, R., & Namdari, S. (2024). Large Language Models in Orthopaedics. Journal of Bone and Joint Surgery, 106(15), 1411–1418. https://doi.org/10.2106/jbjs.23.01417

Why Large Language Models Tend to Hallucinate on Certain Questions

1. Defining Hallucination in LLMs

2. Four Root Causes

3. Task-Specific Patterns

4. Overconfidence & Calibration

5. Mitigation Strategies

6. Conclusion

Full References

Leave a Comment Cancel reply