Preprint

Why and When Visual Token Pruning Fails?

A Study on Relevant Visual Information Shift in MLLMs Decoding

Jiwan Kim1 Kibum Kim1 Wonjoong Kim1 Byung-Kwan Lee2 Chanyoung Park1*
1Korea Advanced Institute of Science and Technology (KAIST)    2NVIDIA
* Corresponding author
KAIST NVIDIA
Paper Code (Coming Soon) Citation
TL;DR: We identify Relevant Visual Information Shift (RVIS) as the primary reason visual token pruning fails on reasoning tasks, and propose DSTP, a training-free add-on that adaptively swaps visual tokens during decoding to restore reasoning performance with minimal overhead.

Abstract

DSTP: Adaptive Token Pruning for Visual Reasoning

Figure 8

Figure. DSTP at 33.3% token retention outperforms vanilla FastV at 66.6% with significantly less computational cost, demonstrating that providing the right tokens at the right time matters more than simply retaining more tokens.

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies.

Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage.

Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks.

Motivation: Why Does Pruning Fail at Visual Reasoning?

Pruning works for VQA, but fails for Visual Math Reasoning

Figure 1a

Figure 1(a). Performance retention rates of pruning methods on VQA vs. VMR.

We evaluate prominent pruning methods (FastV, DivPrune) on state-of-the-art MLLMs (Qwen3-VL, InternVL3.5) across VQA and Visual Math Reasoning (VMR) benchmarks. While pruning methods effectively preserve performance on VQA, they exhibit precipitous performance drops on VMR.

This consistent decline on VMR suggests that existing pruning methods fundamentally struggle to generalize to complex visual reasoning, despite their strong performance on simple visual understanding tasks.

Visual Focus Shift during Reasoning

Figure 1b

Figure 1(b)–(f). Attention heatmaps for a MathVerse sample on Qwen3-VL. The model's visual focus shifts drastically across decoding steps, reflecting the changing reasoning context.

Visual focus shifts during reasoning

By visualizing attention heatmaps at different decoding steps, we discover that the model's visual focus does not remain fixed on regions identified during the prefill stage. Instead, it dynamically transitions to entirely different visual areas to align with each successive reasoning step — a phenomenon we term Relevant Visual Information Shift (RVIS).

Diagnosing RVIS: A Hallmark of Visual Reasoning

We conduct a systematic analysis to characterize RVIS and establish it as the primary failure driver of existing pruning methods.

Finding 1 — RVIS Exists in Reasoning

Attention Stability Diverges between VQA and VMR

We track the cosine similarity between the visual attention at the prefill stage and each decoding step. VQA maintains high similarity throughout, while VMR exhibits sharp declines, revealing that the model frequently re-focuses on different visual regions during reasoning.

Figure 2

Figure 2. (a) Cosine similarity of visual attention between prefill and each decoding step. (b) Proportion of samples maintaining attention similarity above thresholds.

Finding 2 — Reasoning-Intrinsic Nature

RVIS Is Not a Side Effect of Longer Generation

Even the shortest VMR samples (1–512 tokens) exhibit higher RVIS frequency than the longest VQA samples (2048–4096 tokens). This confirms RVIS as an inherent hallmark of the reasoning process, not a mere side effect of generation length.

Figure 3

Figure 3. Average RVIS occurrences across various answer lengths.

Observation

RVIS Frequency Distribution

VQA samples are largely static with zero RVIS, while VMR samples exhibit two or more shifts during decoding.

Figure 4

Figure 4. Distribution of RVIS occurrences for VQA and VMR.

Impact

Pruning Success Drops with RVIS

The success rate of pruning methods drops precipitously with increasing RVIS frequency, providing direct evidence of the causal link.

Figure 5

Figure 5. Success rate of FastV across different RVIS frequencies.

DSTP: Decoding-stage Shift-aware Token Pruning

Based on our analysis, we propose DSTP, a training-free and simple add-on framework that enables existing pruning methods to adaptively update visual tokens during decoding.

Module 1

RISD: Relevant Visual Information Shift Detect

Monitors visual attention similarity at each decoding step. When similarity drops below threshold τ, RVIS is detected and CPTS is triggered.

Module 2

CPTS: Context-Preserving Visual Token Swap

Re-evaluates all visual tokens (including discarded ones) and forms a union set that preserves original context while incorporating newly relevant tokens.

DSTP Framework Overview

Figure 6

Figure 6. Overall framework of DSTP. (a) Prefill-stage protocol. (b) RISD monitors attention similarity, invoking CPTS when RVIS is detected. (c) Overall flow throughout the decoding process.

Main Results

Visual Reasoning Benchmarks

DSTP is evaluated as a plug-and-play add-on to three pruning methods (FastV, DivPrune, VisionZip) on Qwen3-VL-4B and InternVL3.5-8B.

MethodMathVerseWeMathDynaMathLogicVistaMMMU-ProAcc.(%)
Qwen3-VL-4B — Retain 33.3% Tokens
Vanilla (Full Tokens)61.2948.2966.4849.2237.63100%
FastV32.2325.9038.7630.6419.4155.7%
  w/ DSTP52.5442.1951.0041.1630.5282.9%
DivPrune33.9029.3241.9430.6413.0855.2%
  w/ DSTP50.2542.9557.7042.7028.0383.8%
VisionZip36.8035.5743.1629.5316.3660.4%
  w/ DSTP50.7643.6252.9539.2227.3981.0%

Visual Understanding Benchmarks

DSTP also consistently yields performance gains on VQA tasks.

MethodSQAVQATGQAAcc.(%)
Vanilla (Full Tokens)93.4281.5761.82100%
FastV87.9874.8260.0494.3%
  w/ DSTP91.8477.9560.9397.5%
DivPrune86.1271.3658.0791.2%
  w/ DSTP91.3674.1160.4295.5%
VisionZip90.4177.1060.6896.5%
  w/ DSTP92.0877.2061.2797.4%

Robustness to RVIS: DSTP vs. FastV

Figure 7

Figure 7. DSTP yields consistent gains across all RVIS frequencies, with the gap widening as RVIS becomes more frequent.

In-Depth Analysis

Effect of Components: RISD & CPTS

RowDetectSwap StrategyRatioMathVerseMMMU-Pro
(a)Full visual tokens100%61.2937.63
(b)FastV (33.3%)33.3%32.2319.41
(c)RandomCPTS38.1%37.6922.36
(d)AvgCPTS38.1%51.5127.68
(e)RISDFull100%53.0429.55
(f)RISDHard33.3%45.0527.26
(g)RISDMerge33.3%47.5828.38
(h)RISDCPTS38.1%52.5430.52

Computational Efficiency

Despite DSTP's dynamic detect-and-swap mechanism, the additional TFLOPs overhead is negligible. Remarkably, DSTP at 33.3% retention surpasses vanilla FastV at 66.6% while requiring significantly less computation.

Qualitative Comparison

Visual Token Selection Visualization

Figure 9

Figure 9. DSTP successfully retrieves essential visual context that remains pruned even by the 66.6% static baseline (red tokens).

Qualitative Examples

We compare DSTP with FastV across diverse visual reasoning scenarios, illustrating how adaptive token swapping during decoding helps the model capture the precise visual features needed at each reasoning step.

Example 1. Robustness in numeric information extraction. DSTP maintains high fidelity in reading numerical figures where FastV exhibits semantic drift.

Citation

If you find this work helpful, please consider citing:

@misc{kim2026visualtokenpruningfails,
  title={Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding},
  author={Jiwan Kim and Kibum Kim and Wonjoong Kim and Byung-Kwan Lee and Chanyoung Park},
  year={2026},
  eprint={2604.12358},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.12358}
}