CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Abstract

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student).

However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue.

Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks while maintaining strong performance on visual question answering tasks.

Motivation: Visual Attention Misalignment

Do existing KD methods distill visual perception?

While existing KD methods for MLLMs have shown effectiveness in VQA tasks that primarily require visual recognition, they struggle with compositional reasoning tasks that demand fine-grained visual perception — such as attribute binding and spatial relationship understanding.

Compositional Reasoning Performance Gap

Motivation figure showing KD performance gap on compositional reasoning

Figure 1. Existing KD methods fail to transfer visual perception abilities, leading to degraded compositional reasoning performance.

What causes the performance gap?

Through systematic analysis, we identify visual attention misalignment as the root cause. The student model fails to attend to the same visual regions as the teacher, resulting in degraded visual perception despite successful knowledge transfer for simpler recognition tasks.

Visual Attention Misalignment

Attention map comparison between student and teacher MLLMs

Figure 2. Attention maps reveal misalignment between student and teacher, highlighting the root cause of degraded visual perception.

Why Is Visual Perception Not Distilled Properly?

We conduct a systematic three-step analysis to identify the root cause of the failure in distilling visual perception abilities from the teacher to the student MLLM.

Step 1 — Identifying the Key Factor

Teacher-Student Attention Similarity over Visual Tokens

We examine the layer-wise attention similarity between the student and teacher over both visual and text tokens. On VQA tasks, where KD succeeds, the student's visual attention is highly aligned with the teacher's in the visual understanding layers. However, on CR tasks, where KD fails, no such alignment exists — revealing visual attention misalignment as the key factor.

Layer-wise Attention Similarity

Figure 3. (a) Attention of the answer token over visual and text tokens. (b–e) Layer-wise teacher-student and teacher-SFT attention similarities over visual tokens (b, c) and text tokens (d, e). Visual attention similarity in the visual understanding layers is the key differentiator.

Step 2 — Direct Impact on Performance

Attention Similarity ↔ Answer Probability

We quantify the direct relationship between visual attention similarity and downstream performance. By grouping instances by their student-teacher attention similarity, we observe a clear positive correlation: higher attention similarity leads to higher answer token probability, providing direct evidence that attention alignment drives performance.

Figure 4: Attention similarity vs. answer probability

Figure 4. Higher student-teacher attention similarity correlates with higher answer token probability, providing direct evidence that attention alignment drives performance.

Step 3 — A Simple Solution

Replacing Student's Attention with Teacher's

To validate our hypothesis, we perform a direct intervention: substituting the student's visual attention with the average of the teacher's and student's attention during inference. This yields consistent performance gains on compositional reasoning sub-tasks (Swap, Replace, Add), confirming that the student benefits from incorporating the teacher's attention patterns.

Figure 5: Performance gain with teacher attention substitution

Figure 5. Replacing the student's visual attention with the teacher's yields consistent performance gains on compositional reasoning sub-tasks, validating our hypothesis.

CompoDistill Framework

CompoDistill Framework Overview

Figure 6. Our framework explicitly aligns the student's visual attention with the teacher's through an attention distillation loss, enhancing the student's visual perception abilities for compositional reasoning.

CompoDistill introduces an attention distillation mechanism that explicitly aligns the student MLLM's visual attention patterns with those of the teacher. By supervising how the student attends to visual tokens, our framework ensures that fine-grained visual perception abilities — critical for compositional reasoning — are effectively transferred during distillation.

Main Results

CompoDistill is evaluated on both visual question answering (VQA) and compositional reasoning (CR) benchmarks. The student (1.8B) and teacher (4B) both use SigLIP vision encoder and Qwen 1.5 LLM.

Comparison with KD Methods and Other MLLMs

Method	LLM	Size	# Samples	VQA Avg	CR Avg
LLaVA-4B (Teacher)	Qwen1.5-4B	4B	1.2M	62.6	70.3
LLaVA-2B (SFT)	Qwen1.5-1.8B	2B	1.2M	54.9	60.7
LLaVA-KD-2B	Qwen1.5-1.8B	2B	1.2M	61.6	61.5
LLaVADI-2B	Qwen1.5-1.8B	2B	1.2M	56.6	60.2
LLaVA-MoD-2B	Qwen1.5-1.8B	2B	5.0M	58.9	62.6
CompoDistill-2B (Ours)	Qwen1.5-1.8B	2B	1.2M	61.9	66.7

Relational Hallucination Evaluation

Beyond compositional reasoning, CompoDistill also mitigates relational hallucinations by accurately understanding object relationships — achieving performance nearly on par with the teacher on R-Bench and Reefknot benchmarks.

Model	R-Bench (F1) ↑	Reefknot (F1) ↑
Teacher (LLaVA-4B)	79.1	67.9
Student (LLaVA-2B)	74.3	61.3
LLaVA-KD-2B	76.5	60.3
LLaVA-MoD-2B	76.2	63.4
CompoDistill-2B (Ours)	78.6	66.7

Qualitative Comparison

Figure 7. Qualitative examples demonstrating improved visual attention alignment and compositional reasoning performance. CompoDistill's student attends to the correct regions, closely matching the teacher's attention patterns.

Ablation Studies

Effect of Core Components (VAT & TAF)

VAT	TAF	VQA Avg	CR Avg
		56.8	62.9
✓		57.9	65.0
	✓	61.3	63.8
✓	✓	62.9	66.7

Fine-grained Analysis on VAT Module

(a) Attention Loss Type

Loss	VQA Avg	CR Avg
None	61.3	63.8
MSE	60.3	65.2
KL Div.	60.7	65.5
Cos. Sim.	62.9	66.7

(b) Target Layers

Layers	VQA Avg	CR Avg
Early (~30%)	61.2	63.7
Later (70%~)	61.7	64.6
All	62.4	66.6
Intermediate	62.9	66.7

(c) Layer Matching

Strategy	VQA Avg	CR Avg
Simple	61.5	65.6
Adaptive	62.0	65.7
Group	62.9	66.7

Citation

If you find this work helpful, please consider citing:

@inproceedings{kim2026compodistill,
  title={CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs},
  author={Jiwan Kim and Kibum Kim and Sangwoo Seo and Chanyoung Park},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.12184}
}