Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student).
However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue.
Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks while maintaining strong performance on visual question answering tasks.
While existing KD methods for MLLMs have shown effectiveness in VQA tasks that primarily require visual recognition, they struggle with compositional reasoning tasks that demand fine-grained visual perception — such as attribute binding and spatial relationship understanding.
Compositional Reasoning Performance Gap
Figure 1. Existing KD methods fail to transfer visual perception abilities, leading to degraded compositional reasoning performance.
Through systematic analysis, we identify visual attention misalignment as the root cause. The student model fails to attend to the same visual regions as the teacher, resulting in degraded visual perception despite successful knowledge transfer for simpler recognition tasks.
Visual Attention Misalignment
Figure 2. Attention maps reveal misalignment between student and teacher, highlighting the root cause of degraded visual perception.
We conduct a systematic three-step analysis to identify the root cause of the failure in distilling visual perception abilities from the teacher to the student MLLM.
We examine the layer-wise attention similarity between the student and teacher over both visual and text tokens. On VQA tasks, where KD succeeds, the student's visual attention is highly aligned with the teacher's in the visual understanding layers. However, on CR tasks, where KD fails, no such alignment exists — revealing visual attention misalignment as the key factor.
Layer-wise Attention Similarity
Figure 3. (a) Attention of the answer token over visual and text tokens. (b–e) Layer-wise teacher-student and teacher-SFT attention similarities over visual tokens (b, c) and text tokens (d, e). Visual attention similarity in the visual understanding layers is the key differentiator.
We quantify the direct relationship between visual attention similarity and downstream performance. By grouping instances by their student-teacher attention similarity, we observe a clear positive correlation: higher attention similarity leads to higher answer token probability, providing direct evidence that attention alignment drives performance.
Figure 4. Higher student-teacher attention similarity correlates with higher answer token probability, providing direct evidence that attention alignment drives performance.
To validate our hypothesis, we perform a direct intervention: substituting the student's visual attention with the average of the teacher's and student's attention during inference. This yields consistent performance gains on compositional reasoning sub-tasks (Swap, Replace, Add), confirming that the student benefits from incorporating the teacher's attention patterns.
Figure 5. Replacing the student's visual attention with the teacher's yields consistent performance gains on compositional reasoning sub-tasks, validating our hypothesis.
CompoDistill Framework Overview
Figure 6. Our framework explicitly aligns the student's visual attention with the teacher's through an attention distillation loss, enhancing the student's visual perception abilities for compositional reasoning.
CompoDistill introduces an attention distillation mechanism that explicitly aligns the student MLLM's visual attention patterns with those of the teacher. By supervising how the student attends to visual tokens, our framework ensures that fine-grained visual perception abilities — critical for compositional reasoning — are effectively transferred during distillation.
CompoDistill is evaluated on both visual question answering (VQA) and compositional reasoning (CR) benchmarks. The student (1.8B) and teacher (4B) both use SigLIP vision encoder and Qwen 1.5 LLM.
| Method | LLM | Size | # Samples | VQA Avg | CR Avg |
|---|---|---|---|---|---|
| LLaVA-4B (Teacher) | Qwen1.5-4B | 4B | 1.2M | 62.6 | 70.3 |
| LLaVA-2B (SFT) | Qwen1.5-1.8B | 2B | 1.2M | 54.9 | 60.7 |
| LLaVA-KD-2B | Qwen1.5-1.8B | 2B | 1.2M | 61.6 | 61.5 |
| LLaVADI-2B | Qwen1.5-1.8B | 2B | 1.2M | 56.6 | 60.2 |
| LLaVA-MoD-2B | Qwen1.5-1.8B | 2B | 5.0M | 58.9 | 62.6 |
| CompoDistill-2B (Ours) | Qwen1.5-1.8B | 2B | 1.2M | 61.9 | 66.7 |
Beyond compositional reasoning, CompoDistill also mitigates relational hallucinations by accurately understanding object relationships — achieving performance nearly on par with the teacher on R-Bench and Reefknot benchmarks.
| Model | R-Bench (F1) ↑ | Reefknot (F1) ↑ |
|---|---|---|
| Teacher (LLaVA-4B) | 79.1 | 67.9 |
| Student (LLaVA-2B) | 74.3 | 61.3 |
| LLaVA-KD-2B | 76.5 | 60.3 |
| LLaVA-MoD-2B | 76.2 | 63.4 |
| CompoDistill-2B (Ours) | 78.6 | 66.7 |
Qualitative Comparison
Figure 7. Qualitative examples demonstrating improved visual attention alignment and compositional reasoning performance. CompoDistill's student attends to the correct regions, closely matching the teacher's attention patterns.
| VAT | TAF | VQA Avg | CR Avg |
|---|---|---|---|
| 56.8 | 62.9 | ||
| ✓ | 57.9 | 65.0 | |
| ✓ | 61.3 | 63.8 | |
| ✓ | ✓ | 62.9 | 66.7 |
(a) Attention Loss Type
| Loss | VQA Avg | CR Avg |
|---|---|---|
| None | 61.3 | 63.8 |
| MSE | 60.3 | 65.2 |
| KL Div. | 60.7 | 65.5 |
| Cos. Sim. | 62.9 | 66.7 |
(b) Target Layers
| Layers | VQA Avg | CR Avg |
|---|---|---|
| Early (~30%) | 61.2 | 63.7 |
| Later (70%~) | 61.7 | 64.6 |
| All | 62.4 | 66.6 |
| Intermediate | 62.9 | 66.7 |
(c) Layer Matching
| Strategy | VQA Avg | CR Avg |
|---|---|---|
| Simple | 61.5 | 65.6 |
| Adaptive | 62.0 | 65.7 |
| Group | 62.9 | 66.7 |
If you find this work helpful, please consider citing:
@inproceedings{kim2026compodistill,
title={CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs},
author={Jiwan Kim and Kibum Kim and Sangwoo Seo and Chanyoung Park},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.12184}
}