IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

14d ago · Global · primary source: export.arxiv.org

Multi-source synthesis by The Embedding Report from 2 sources. Every numeric and quoted claim traces to a cited source body (see methodology).

Researchers have proposed two new methods, SHIFT and IRDS, for selecting data for reinforcement learning with verifiable rewards (RLVR) without requiring training or access to labels and rewards.

SHIFT, a one-shot, training-free selector, uses inference-time hidden-state dynamics to identify useful instances. It computes a reasoning-induced representation shift (RIRS) for each candidate instance and enforces coverage via a quality-weighted farthest-first CoreSet procedure^[1]. IRDS, on the other hand, selects RLVR training instances on a sparse autoencoder cluster basis and uses a verifier-coupled coverage objective to choose instances that the model fails on but can still learn from^[2]. Both methods aim to improve the efficiency and accuracy of RLVR. According to the research, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines across mathematical reasoning and medical QA benchmarks under ultra-low budgets. IRDS achieves the highest overall accuracy in experiments on three instruction-tuned models and six math reasoning benchmarks, improving accuracy by 3.9 percentage points^[2] on the Qwen models and 0.5 percentage points on Llama-3.1-8B^[2]. The paper presenting these findings was submitted on 27 May 2026^[1]^[2].

model-releaseresearch-papersafety-researchbenchmarkinfrastructure

Sources cited (2)

arxiv.org ↗ E
arxiv.org ↗ E

Spot something wrong? Report an issue