Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

15d ago · Global · primary source: export.arxiv.org

Multi-source synthesis by The Embedding Report from 2 sources. Every numeric and quoted claim traces to a cited source body (see methodology).

Researchers have proposed a new method, Reasoning-Conditioned Direct Preference Optimization (RC-DPO), to mitigate hallucinations in multimodal large reasoning models, which can produce unsupported or incorrect statements.

Multimodal Large Reasoning Models suffer from severe hallucinations, according to a study published on arXiv[1]. Hallucinations are a significant issue in language models, particularly in healthcare applications, as they limit the reliability of these models[2]. The new RC-DPO method models the Chain-of-Thought (CoT) as a condition for answer generation, effectively mitigating hallucinations and improving the reliability of the multimodal reasoning process[1]. This approach contrasts with existing training-based methods that typically optimize hallucinations through response-level direct preference optimization (DPO), treating CoT and the final answer as a single output. By explicitly formulating a CoT-oriented preference term, RC-DPO promotes answer-supportive reasoning chain alignment. Researchers employed Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples and attention-guided CoT token pruning to construct negative ones, further improving optimization. Extensive experiments across various models and benchmarks demonstrated the effectiveness of RC-DPO in mitigating hallucinations.

research-papersafety-research

Background sources we checked (2)
  • arxiv.org ↗ Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level …
  • en.wikipedia.org ↗ A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can generate, summarize, translate and parse text in many contexts, and are a foundational technology behind modern chatbo…

Sources cited (2)

  1. arxiv.org ↗ E
  2. arxiv.org ↗ E
Spot something wrong? Report an issue