Why Large Language Models Tend to Hallucinate on Certain Questions

Why LLMs Hallucinate — Interactive Research Post
Featured Research

Why Large Language Models Tend to Hallucinate on Certain Questions

A deep-dive into the computational, probabilistic, and data-driven roots of AI hallucination — and what the evidence from GPT models tells us about building safer, more reliable systems.

Based on 30+ peer-reviewed studies · Dr. Ananjan Maiti · 12 min read · Healthcare AI · NLP · LLMs
Research Snapshot — Key Numbers at a Glance
82%
Peak hallucination rate under adversarial prompts
23%
GPT-4o rate after prompt mitigation (down from 53%)
94%
Epistemic collapse rate in smallest tested model
0.3%
Hallucination rate after hallucination-detection system
Quick Poll
Which cause of LLM hallucination concerns you most?
Fabricated citations & false facts38%
Overconfidence without accuracy27%
Adversarial / prompt injection attacks21%
Rare-condition / long-tail failures14%

1. Defining Hallucination in LLMs

Hallucination in large language models refers to outputs that are fluent and syntactically correct but factually inaccurate or unsupported by external evidence. In medical contexts, it has been defined as “any instance in which a model generates misleading medical content.”

What makes this particularly dangerous is that the language used in these outputs is, by nature, confident — it does not reflect uncertainty or controversy. Users often cannot distinguish between accurate and fabricated information.

Key insight Hallucination is not just an engineering flaw to be patched. Computational theory proves it is mathematically inevitable for any finite model trained on real-world data.

2. Four Root Causes

Hallucination arises from four interconnected sources that make it an intrinsic property of LLMs:

Cause 1 — Computability limits Diagonalization guarantees inputs on which some model must fail. Undecidable queries induce infinite failure sets for all computable predictors.
Cause 2 — Statistical constraints Finite description length enforces compression error. Long-tail factual knowledge requires prohibitive sample complexity.
Cause 3 — Data-induced failure Incomplete coverage, noise, temporal decay, and conflicting information in training corpora all produce hallucinations. The teacher-forcing learning strategy compounds this.
Cause 4 — Evaluation misalignment Benchmarks that reward confident fabrication over calibrated uncertainty create incentives to hallucinate more, not less.
Hallucination Rates by Model & Scenario
GPT-4 (adversarial scenario)80%
All models (baseline, adversarial)66%
GPT-4o (before mitigation)53%
All models (after prompt mitigation)44%
GPT-4o (after prompt mitigation)23%
Llama-3-70B (with detection system)0.3%
Poll — Mitigation
Which mitigation strategy do you think is most practical for real-world deployment?
Retrieval Augmented Generation (RAG)41%
Prompt engineering & Chain-of-Thought29%
Domain-specific fine-tuning18%
Post-hoc calibration (isotonic regression)12%

3. Task-Specific Patterns

Hallucination rates vary enormously depending on what the model is asked to do. In medical settings, accuracy is highest in fact-dense domains:

Biochemistry ✓ Physiology ✓ Microbiology ✓ Pharmacology ~ Diagnosis ✗ Therapy ✗

Integrative domains requiring complex multi-step reasoning show the highest hallucination rates. Similarly, short clinical vignettes produce more hallucinations than longer, more detailed presentations — contextual richness helps anchor the model.

Prevalence effect LLM performance deteriorates significantly when diagnosing conditions with lower prevalence. Rare conditions in the training data directly translate to higher hallucination risk at inference time.

4. Overconfidence & Calibration

GPT models frequently exhibit a systematic overconfidence problem: verbalized confidence is consistently higher than actual accuracy. GPT-3.5 significantly overestimated its true response accuracy at the upper ends of self-reported confidence.

GPT-4 shows improved calibration — demonstrating a statistically significant confidence-performance correlation (r=0.212, p<0.001), with accuracy ranging from 67% in low-confidence predictions to 87% in high-confidence predictions. Interestingly, GPT-4 exhibited systematic underconfidence in medical contexts, while physicians showed the opposite: poor calibration and systematic overconfidence.

Calibration note For RLHF-trained models (ChatGPT, GPT-4, Claude), verbalized confidence as output tokens is typically better-calibrated than raw conditional probabilities — reducing expected calibration error by ~50%.
Poll — Clinical AI
Should LLMs currently be used autonomously in clinical decision-making?
Yes — current models are reliable enough7%
Assistive role only — human oversight required58%
Only after rigorous clinical validation28%
No — too risky regardless of safeguards7%

5. Mitigation Strategies

No single strategy eliminates hallucination entirely. The current best practices operate as layered safeguards:

Prompt engineering reduced adversarial hallucinations from 66% to 44% across all tested models. Chain-of-thought (CoT) prompting encourages explicit reasoning, improving accuracy on complex tasks — but does not always improve verbalized calibration.

Retrieval Augmented Generation (RAG) grounds responses in external knowledge sources, but performance depends heavily on retrieval quality. Retrieval fragility remains a fundamental limitation.

Semantic entropy samples multiple answers, groups them by meaning, and flags high-disagreement responses as likely hallucinations — a model-agnostic approach showing genuine promise.

Supervised fine-tuning on calibration datasets improves both alignment of stated confidence with accuracy and discrimination between correct and incorrect responses — though gains are task-specific.

Best demonstrated result A hallucination detection system for oncology QA reduced Llama-3-70B hallucinations from a clinically untenable 31% to just 0.3% — demonstrating that targeted systems can bring rates to acceptable levels.
Poll — Future Research
What research priority should the AI field focus on next?
Better uncertainty quantification methods34%
Real-world clinical validation benchmarks25%
Multilingual & cross-cultural performance23%
Regulatory frameworks & governance18%

6. Conclusion

Hallucination in LLMs is a multifaceted phenomenon arising from fundamental computational limits, probabilistic generation, data-induced factors, and evaluation misalignment. It is not a bug to be fixed in the next version — it is, in part, mathematically inevitable.

The goal should not be the elimination of hallucination, but rather risk reduction through layered safeguards that bring hallucination rates to clinically and practically acceptable levels. GPT-4 and newer models show improved calibration, but systematic overconfidence persists.

As LLMs become increasingly integrated into critical applications, the imperative is clear: robust detection mechanisms, appropriate abstention behaviors, and mandatory human oversight are not optional enhancements — they are ethical requirements.

Patient safety note 85% of clinicians in a multi-national survey reported cross-referencing LLM outputs with external sources as their primary hallucination mitigation strategy. Human oversight remains the most widely-used safeguard.
Bibliography

Full References

30 peer-reviewed sources cited in this article

  1. Alansari, A. (2025). Large Language Models Hallucination: A Comprehensive Survey. https://doi.org/10.48550/arxiv.2510.06265
  2. Ali, S., Shahab, O., Shabeeb, R., Ladak, F., Yang, J., Nadkarni, G., … & Kurdi, B. (2023). General purpose large language models match human performance on gastroenterology board exam self-assessments. https://doi.org/10.1101/2023.09.21.23295918
  3. Anh-Hoang, D., Tran, V., & Nguyen, L. (2025). Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1622292
  4. Artsi, Y., Sorin, V., Glicksberg, B., Korfiatis, P., Freeman, R., Nadkarni, G., … & Klang, E. (2025). Challenges of Implementing LLMs in Clinical Practice: Perspectives. Journal of Clinical Medicine, 14(17), 6169. https://doi.org/10.3390/jcm14176169
  5. Carìa, A. (2025). Towards Predictive Communication: The Fusion of Large Language Models and Brain–Computer Interface. Sensors, 25(13), 3987. https://doi.org/10.3390/s25133987
  6. Çelebi, Y. (2025). PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs. https://doi.org/10.48550/arxiv.2511.17220
  7. Dhaimade, P. (2025). Multidimensional Evaluation of Large Language Models on the AAP In-Service Examination: Assessing Accuracy, Calibration, and Citation Reliability. https://doi.org/10.1101/2025.10.14.25338040
  8. Fernandez, C., Felipe, L., Shotande, M., Zitu, M., Delgado, E., Rasool, G., … & Valdés, G. (2025). Learning the Phenotype of Medical Hallucinations. https://doi.org/10.21203/rs.3.rs-7475667/v1
  9. Gao, Y., Myers, S., Chen, S., Dligach, D., Miller, T., Bitterman, D., … & Afshar, M. (2024). Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability. https://doi.org/10.1101/2024.11.06.24316848
  10. Hasan, M. (2025). CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation. https://doi.org/10.48550/arxiv.2510.22609
  11. Herrera-Poyatos, D., Peláez-González, C., Zuheros, C., Herrera-Poyatos, A., Tejedor, V., Herrera, F., … & Montes, R. (2025). An overview of model uncertainty and variability in LLM-based sentiment analysis. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1609097
  12. Jung, K. (2025). Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthcare Informatics Research, 31(2), 114–124. https://doi.org/10.4258/hir.2025.31.2.114
  13. Kim, Y., Jeong, H., Chen, S., Li, S., Lu, M., Alhamoud, K., … & Breazeal, C. (2025). Medical Hallucination in Foundation Models and Their Impact on Healthcare. https://doi.org/10.1101/2025.02.28.25323115
  14. Madrid, J., Diehl, P., Selig, M., Rolauffs, B., Hans, F., Busch, H., … & Benning, L. (2024). Assessing the Performance of Plugin-Integrated ChatGPT-4 in the German Medical Board Examination. https://doi.org/10.2196/preprints.58375
  15. Mohsin, M. (2025). On the Fundamental Limits of LLMs at Scale. https://doi.org/10.48550/arxiv.2511.12869
  16. Moll, J. (2025). Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations. https://doi.org/10.48550/arxiv.2510.11196
  17. Neves, B. & Silva, M. (2025). From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions. https://doi.org/10.1101/2025.09.09.25335411
  18. Nguyen, V. (2025). Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models. https://doi.org/10.48550/arxiv.2511.17170
  19. Oliveira, R., Garber, M., Gwinnutt, J., Rashidi, E., Hwang, J., Gilmour, W., … & Mack, C. (2025). A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research. https://doi.org/10.1101/2025.02.11.637373
  20. Omar, M., Sorin, V., Collins, J., Reich, D., Freeman, R., Gavin, N., … & Klang, E. (2025). Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support. https://doi.org/10.1101/2025.03.18.25324184
  21. Presacan, O. (2025). When silence is safer: a review of LLM abstention in healthcare. https://doi.org/10.21203/rs.3.rs-8148261/v1
  22. Quelle, D. & Bovet, A. (2024). The perils and promises of fact-checking with large language models. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1341697
  23. Savage, T., Wang, J., Gallo, R., Boukil, A., Patel, V., Safavi-Naini, S., … & Chen, J. (2024). Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment. https://doi.org/10.1101/2024.06.06.24308399
  24. Si, C., Gan, Z., Yang, Z., Wang, S., Wang, J., Boyd-Graber, J., … & Wang, L. (2022). Prompting GPT-3 To Be Reliable. https://doi.org/10.48550/arxiv.2210.09150
  25. Steyvers, M., Belém, C., & Smyth, P. (2025). Improving Metacognition and Uncertainty Communication in Language Models. https://doi.org/10.48550/arxiv.2510.05126
  26. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., … & Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023, 5433–5442. https://doi.org/10.18653/v1/2023.emnlp-main.330
  27. Tinajero, C. (2025). The Pediatric Surgeon’s AI Toolbox: How Large Language Models Like ChatGPT Are Simplifying Practice and Expanding Global Access. European Journal of Pediatric Surgery. https://doi.org/10.1055/a-2722-3871
  28. Yao, J., Aggarwal, M., Lopez, R., & Namdari, S. (2024). Large Language Models in Orthopaedics. Journal of Bone and Joint Surgery, 106(15), 1411–1418. https://doi.org/10.2106/jbjs.23.01417
Poll results are illustrative and update dynamically on vote.  ·  Article based on 30+ peer-reviewed studies, 2022–2025.

Leave a Comment