Why Large Language Models Tend to Hallucinate on Certain Questions
A deep-dive into the computational, probabilistic, and data-driven roots of AI hallucination — and what the evidence from GPT models tells us about building safer, more reliable systems.
1. Defining Hallucination in LLMs
Hallucination in large language models refers to outputs that are fluent and syntactically correct but factually inaccurate or unsupported by external evidence. In medical contexts, it has been defined as “any instance in which a model generates misleading medical content.”
What makes this particularly dangerous is that the language used in these outputs is, by nature, confident — it does not reflect uncertainty or controversy. Users often cannot distinguish between accurate and fabricated information.
2. Four Root Causes
Hallucination arises from four interconnected sources that make it an intrinsic property of LLMs:
3. Task-Specific Patterns
Hallucination rates vary enormously depending on what the model is asked to do. In medical settings, accuracy is highest in fact-dense domains:
Integrative domains requiring complex multi-step reasoning show the highest hallucination rates. Similarly, short clinical vignettes produce more hallucinations than longer, more detailed presentations — contextual richness helps anchor the model.
4. Overconfidence & Calibration
GPT models frequently exhibit a systematic overconfidence problem: verbalized confidence is consistently higher than actual accuracy. GPT-3.5 significantly overestimated its true response accuracy at the upper ends of self-reported confidence.
GPT-4 shows improved calibration — demonstrating a statistically significant confidence-performance correlation (r=0.212, p<0.001), with accuracy ranging from 67% in low-confidence predictions to 87% in high-confidence predictions. Interestingly, GPT-4 exhibited systematic underconfidence in medical contexts, while physicians showed the opposite: poor calibration and systematic overconfidence.
5. Mitigation Strategies
No single strategy eliminates hallucination entirely. The current best practices operate as layered safeguards:
Prompt engineering reduced adversarial hallucinations from 66% to 44% across all tested models. Chain-of-thought (CoT) prompting encourages explicit reasoning, improving accuracy on complex tasks — but does not always improve verbalized calibration.
Retrieval Augmented Generation (RAG) grounds responses in external knowledge sources, but performance depends heavily on retrieval quality. Retrieval fragility remains a fundamental limitation.
Semantic entropy samples multiple answers, groups them by meaning, and flags high-disagreement responses as likely hallucinations — a model-agnostic approach showing genuine promise.
Supervised fine-tuning on calibration datasets improves both alignment of stated confidence with accuracy and discrimination between correct and incorrect responses — though gains are task-specific.
6. Conclusion
Hallucination in LLMs is a multifaceted phenomenon arising from fundamental computational limits, probabilistic generation, data-induced factors, and evaluation misalignment. It is not a bug to be fixed in the next version — it is, in part, mathematically inevitable.
The goal should not be the elimination of hallucination, but rather risk reduction through layered safeguards that bring hallucination rates to clinically and practically acceptable levels. GPT-4 and newer models show improved calibration, but systematic overconfidence persists.
As LLMs become increasingly integrated into critical applications, the imperative is clear: robust detection mechanisms, appropriate abstention behaviors, and mandatory human oversight are not optional enhancements — they are ethical requirements.
Full References
30 peer-reviewed sources cited in this article
- Alansari, A. (2025). Large Language Models Hallucination: A Comprehensive Survey. https://doi.org/10.48550/arxiv.2510.06265
- Ali, S., Shahab, O., Shabeeb, R., Ladak, F., Yang, J., Nadkarni, G., … & Kurdi, B. (2023). General purpose large language models match human performance on gastroenterology board exam self-assessments. https://doi.org/10.1101/2023.09.21.23295918
- Anh-Hoang, D., Tran, V., & Nguyen, L. (2025). Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1622292
- Artsi, Y., Sorin, V., Glicksberg, B., Korfiatis, P., Freeman, R., Nadkarni, G., … & Klang, E. (2025). Challenges of Implementing LLMs in Clinical Practice: Perspectives. Journal of Clinical Medicine, 14(17), 6169. https://doi.org/10.3390/jcm14176169
- Carìa, A. (2025). Towards Predictive Communication: The Fusion of Large Language Models and Brain–Computer Interface. Sensors, 25(13), 3987. https://doi.org/10.3390/s25133987
- Çelebi, Y. (2025). PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs. https://doi.org/10.48550/arxiv.2511.17220
- Dhaimade, P. (2025). Multidimensional Evaluation of Large Language Models on the AAP In-Service Examination: Assessing Accuracy, Calibration, and Citation Reliability. https://doi.org/10.1101/2025.10.14.25338040
- Fernandez, C., Felipe, L., Shotande, M., Zitu, M., Delgado, E., Rasool, G., … & Valdés, G. (2025). Learning the Phenotype of Medical Hallucinations. https://doi.org/10.21203/rs.3.rs-7475667/v1
- Gao, Y., Myers, S., Chen, S., Dligach, D., Miller, T., Bitterman, D., … & Afshar, M. (2024). Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability. https://doi.org/10.1101/2024.11.06.24316848
- Hasan, M. (2025). CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation. https://doi.org/10.48550/arxiv.2510.22609
- Herrera-Poyatos, D., Peláez-González, C., Zuheros, C., Herrera-Poyatos, A., Tejedor, V., Herrera, F., … & Montes, R. (2025). An overview of model uncertainty and variability in LLM-based sentiment analysis. Frontiers in Artificial Intelligence, 8. https://doi.org/10.3389/frai.2025.1609097
- Jung, K. (2025). Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthcare Informatics Research, 31(2), 114–124. https://doi.org/10.4258/hir.2025.31.2.114
- Kim, Y., Jeong, H., Chen, S., Li, S., Lu, M., Alhamoud, K., … & Breazeal, C. (2025). Medical Hallucination in Foundation Models and Their Impact on Healthcare. https://doi.org/10.1101/2025.02.28.25323115
- Madrid, J., Diehl, P., Selig, M., Rolauffs, B., Hans, F., Busch, H., … & Benning, L. (2024). Assessing the Performance of Plugin-Integrated ChatGPT-4 in the German Medical Board Examination. https://doi.org/10.2196/preprints.58375
- Mohsin, M. (2025). On the Fundamental Limits of LLMs at Scale. https://doi.org/10.48550/arxiv.2511.12869
- Moll, J. (2025). Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations. https://doi.org/10.48550/arxiv.2510.11196
- Neves, B. & Silva, M. (2025). From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions. https://doi.org/10.1101/2025.09.09.25335411
- Nguyen, V. (2025). Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models. https://doi.org/10.48550/arxiv.2511.17170
- Oliveira, R., Garber, M., Gwinnutt, J., Rashidi, E., Hwang, J., Gilmour, W., … & Mack, C. (2025). A Study of Calibration as a Measurement of Trustworthiness of Large Language Models in Biomedical Research. https://doi.org/10.1101/2025.02.11.637373
- Omar, M., Sorin, V., Collins, J., Reich, D., Freeman, R., Gavin, N., … & Klang, E. (2025). Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support. https://doi.org/10.1101/2025.03.18.25324184
- Presacan, O. (2025). When silence is safer: a review of LLM abstention in healthcare. https://doi.org/10.21203/rs.3.rs-8148261/v1
- Quelle, D. & Bovet, A. (2024). The perils and promises of fact-checking with large language models. Frontiers in Artificial Intelligence, 7. https://doi.org/10.3389/frai.2024.1341697
- Savage, T., Wang, J., Gallo, R., Boukil, A., Patel, V., Safavi-Naini, S., … & Chen, J. (2024). Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment. https://doi.org/10.1101/2024.06.06.24308399
- Si, C., Gan, Z., Yang, Z., Wang, S., Wang, J., Boyd-Graber, J., … & Wang, L. (2022). Prompting GPT-3 To Be Reliable. https://doi.org/10.48550/arxiv.2210.09150
- Steyvers, M., Belém, C., & Smyth, P. (2025). Improving Metacognition and Uncertainty Communication in Language Models. https://doi.org/10.48550/arxiv.2510.05126
- Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., … & Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023, 5433–5442. https://doi.org/10.18653/v1/2023.emnlp-main.330
- Tinajero, C. (2025). The Pediatric Surgeon’s AI Toolbox: How Large Language Models Like ChatGPT Are Simplifying Practice and Expanding Global Access. European Journal of Pediatric Surgery. https://doi.org/10.1055/a-2722-3871
- Yao, J., Aggarwal, M., Lopez, R., & Namdari, S. (2024). Large Language Models in Orthopaedics. Journal of Bone and Joint Surgery, 106(15), 1411–1418. https://doi.org/10.2106/jbjs.23.01417