Confidence Calibration in Large Language Models

11d ago · Global · primary source: export.arxiv.org

Large language models (LLMs) are consistently overconfident in their predictions, particularly when tackling difficult tasks, according to a new study on confidence calibration [1]. The research, which introduces a new evaluation test called LifeEval, found that a model's confidence typically exceeds its actual accuracy [1][2]. This tendency is moderated by a 'hard-easy effect,' where overconfidence is most pronounced on challenging tests, while easier tests can result in significant underconfidence [1][2]. This calibration problem is a critical barrier to deploying LLMs in high-stakes domains like healthcare and law, where reliable uncertainty quantification is essential for trust and safety [6]. Accurate confidence calibration allows for selective prediction, where low-confidence outputs can be flagged or deferred, mitigating risk [6]. Researchers are actively developing methods to address this issue. One approach uses a cascade system, where a smaller, cheaper model handles questions it is confident about, deferring harder ones to a more capable but expensive LLM, reducing costs by over 16% with minimal accuracy loss [3]. Other techniques involve training models with natural language critiques of their own confidence, a method shown to outperform even advanced teacher models like GPT-4o on complex reasoning tasks [4]. In specialized applications like entity matching, models based on architectures like RoBERTa also exhibit overconfidence, which can be reduced by up to 23.83% using established calibration techniques like Temperature Scaling [5]. The broader challenge stems from unique uncertainty sources in LLMs, including input ambiguity and reasoning path divergence, which extend beyond classical statistical uncertainty [6]. As these neural networks become more foundational to technology, improving the reliability of their self-assessed confidence remains a key research frontier [7].

research-paper

Context we found (8)

arxiv.org — https://arxiv.org/abs/2605.23909v1 ↗
We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, th…
arxiv.org — https://arxiv.org/abs/2603.03752v1 ↗
Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and …
arxiv.org — https://arxiv.org/abs/2510.24505v1 ↗
Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for a…
arxiv.org — https://arxiv.org/abs/2509.19557v2 ↗
This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temp…
arxiv.org — https://arxiv.org/abs/2503.15850v2 ↗
Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect response…
en.wikipedia.org — https://en.wikipedia.org/wiki/Large_language_model ↗
A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can generate, summarize, translate and parse text in many contexts, and are a foundational technology behind modern chatbo…
en.wikipedia.org — https://en.wikipedia.org/wiki/Calibration ↗
In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known accuracy, a device generating the qua…
en.wikipedia.org — https://en.wikipedia.org/wiki/Standard_wind_tunnel_models ↗
Standard wind tunnel models, also known as reference models, calibration models (French: maquettes d'étalonnage) or test check-standards are objects of relatively simple and precisely defined shapes, having known aerodynamic characteristics, that are tested in wind tunnels. Stan…

Sources

export.arxiv.org — Confidence Calibration in Large Language Models ↗

Spot something wrong? Report an issue