Key Contributions
- We formulate a unified vision-language framework that supports five key malware analysis tasks using a single model architecture.
- We design and generate a high-quality, multi-task, explanation-enhanced dataset aligned with malware-specific visual patterns and reasoning needs.
- We demonstrate that VIMAR achieves competitive performance across tasks, with strong generalization and interpretability, matching or surpassing task-specific baselines.
Methodology
VIMAR is built upon the SmolVLM architecture, a compact 2.2B parameter vision-language backbone. It takes grayscale byteplot images of malware binaries as input and adapts to a range of task settings via prompt-based conditioning:
- Classification (CLS): Assign a malware image to a predefined family
- Similarity Classification (SC): Determine if two malware images belong to the same family
- Similarity Preference (SP): Decide which of two references is more similar to a query
- Zero-shot Classification (ZSC): Classify into unseen families using textual descriptions
- Few-shot Classification (FSC): Classify with minimal labeled examples
A two-stage training strategy is employed: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) to enhance both task performance and output quality.
Results
| Task | Metric | Score |
|---|---|---|
| Family Classification (CLS) | Accuracy | 94.2% |
| Zero-shot Classification (ZSC) | Accuracy | 85.2% |
| Few-shot Classification (FSC) | Accuracy | 88.0% |
Citation
@article{xu2026vimar,
title={VIMAR: vision-language informed malware analysis and reasoning model},
author={Xu, Shiting},
journal={Cybersecurity},
volume={9},
pages={49},
year={2026},
publisher={Springer}
}