> resis

|

DeepConf: Deep Think with Confidence

Published on 08/09/2025

Executive Summary

  • DeepConf performs test-time filtering of low‑quality reasoning traces using model‑internal confidence, improving accuracy and cost [1].
  • Combines token‑level and group (sliding‑window) confidence to estimate local reasoning reliability [1].
  • Supports offline and online modes; enables confidence‑weighted majority voting and early‑stop filtering [1].
  • On AIME 2025, DeepConf@512 attains up to 99.9% accuracy and reduces generated tokens by up to 84.7% versus standard parallel thinking at the same budget [1].

Glossary

Token confidence: token probability (from logprobs) used as a reliability proxy [1].

Group confidence: aggregated confidence over a sliding window of adjacent tokens [1].

Tail/lowest‑group: tail statistics or minimum‑group confidence for a trace [1].

Top‑x% filter: keep only traces within the desired confidence quantile [1].

What DeepConf is and why it matters

DeepConf is a test‑time method that scores reasoning quality via internal confidence signals, discarding weak paths early and focusing budget on promising ones [1]. In multi‑trace settings (e.g., self‑consistency), this yields stronger decisions with fewer tokens [1].

How it works

Token and group confidence

Confidence is computed per token from model logprobs and aggregated over sliding windows to obtain more stable, local group confidence [1]. Statistics such as bottom‑10% groups, tail confidence, and lowest‑group confidence capture bottlenecks in the trace [1].

Offline vs online

Offline: generate multiple full traces, score them by confidence, and apply confidence‑weighted majority voting [1]. Online: during generation, apply sliding‑window confidence filtering and early‑stop weak traces to save tokens [1].

Operational choices

  • Weighted voting: responses are averaged/weighted by estimated confidence [1].
  • Filtering: progressively drop traces below adaptive thresholds (e.g., quantiles) [1].
  • Consensus τ: stop when consensus across traces exceeds τ to avoid further generation [1].
N parallel traces Sliding window → group conf. Top‑x% filter Consensus ≥ τ → stop
Figure 1: Early‑stop via group confidence and consensus τ [1].

Key results

On AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and up to −84.7% generated tokens relative to standard parallel thinking at equal budget [1]. Other evaluated tasks show similar trends of large token savings with controlled accuracy trade‑offs when increasing filter strength [1].

Comparison

MethodBudget KToken (×10^8)Accuracy %Notes
DeepConf‑low (top‑10%)51299.9AIME; ↓84.7% tokens vs standard [1]
DeepConf‑high (top‑90%)512~99–100Higher coverage; smaller savings [1]
Majority Voting512≤99.9No filtering; higher cost [1]

Minimal vLLM enablement

  • Logprobs: enable logprobs to derive per‑token confidence [1].
  • Sliding window: compute cumulative group confidence over window length L [1].
  • Early‑stop: threshold on quantile/minimum group value + consensus τ [1].
  • OpenAI‑compatible: extra args for window, quantile, and enable_logprobs [1].

Practical implications

  • “Low” (top‑10%) filter: maximizes token savings; ensure adequate consensus to avoid confidently wrong traces [1].
  • “High” (top‑90%) filter: keeps more traces; prefer when accuracy is paramount and budget is looser [1].
  • Risks: confidently wrong traces; use initial calibration and threshold warm‑up [1].
  • Consensus τ: set τ by number of traces and task variability [1].

Limitations and future work

Logprob‑based confidence can be miscalibrated for some models/domains; future work includes calibration strategies and studying how optimal windowing and tail statistics generalize across tasks [1].

References

[1] Deep Think with Confidence (DeepConf), arXiv:2508.15260 (v1), 21 Aug 2025.

Related reading