Key Contributions

  • We formulate a unified vision-language framework that supports five key malware analysis tasks using a single model architecture.
  • We design and generate a high-quality, multi-task, explanation-enhanced dataset aligned with malware-specific visual patterns and reasoning needs.
  • We demonstrate that VIMAR achieves competitive performance across tasks, with strong generalization and interpretability, matching or surpassing task-specific baselines.

Methodology

VIMAR is built upon the SmolVLM architecture, a compact 2.2B parameter vision-language backbone. It takes grayscale byteplot images of malware binaries as input and adapts to a range of task settings via prompt-based conditioning:

  • Classification (CLS): Assign a malware image to a predefined family
  • Similarity Classification (SC): Determine if two malware images belong to the same family
  • Similarity Preference (SP): Decide which of two references is more similar to a query
  • Zero-shot Classification (ZSC): Classify into unseen families using textual descriptions
  • Few-shot Classification (FSC): Classify with minimal labeled examples

A two-stage training strategy is employed: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) to enhance both task performance and output quality.

Results

Task Metric Score
Family Classification (CLS) Accuracy 94.2%
Zero-shot Classification (ZSC) Accuracy 85.2%
Few-shot Classification (FSC) Accuracy 88.0%

Citation

@article{xu2026vimar,
  title={VIMAR: vision-language informed malware analysis and reasoning model},
  author={Xu, Shiting},
  journal={Cybersecurity},
  volume={9},
  pages={49},
  year={2026},
  publisher={Springer}
}