Dragonfly: A large vision-language model with multi-resolution zoom
We are excited to announce the launch of Dragonfly, an instruction-tuning Vision-language architecture, that enhances fine-grained visual understanding and reasoning about image regions.
We are releasing the Dragonfly architecture, which uses multi-resolution zoom-and-select to enhance multi-modal reasoning while being context-efficient. We are also launching two new open-source models Llama-3-8b-Dragonfly-v1 a general-domain model trained on 5.5 million image-instruction pairs and Llama-3-8b-Dragonfly-Med-v1 finetuned on additional 1.4 biomedical image-instruction data. Dragonfly demonstrates promising performance on vision-language benchmarks like commonsense visual QA and image captioning. Dragonfly-Med outperforms prior models, including Med-Gemini on multiple medical imaging tasks, showcasing its capabilities for high-resolution medical data.
Figure 1: Generated example from Llama-3-8B-Dragonfly
Figure 2: Generated examples from Llama-3-8B-Dragonfly-Med
Dragonfly Model Architecture
Arxiv Paper with technical details and codebase for architecture implementation
Dragonfly is special because:
- Dragonfly employs two key strategies: multi-resolution visual encoding and zoom-in patch selection, which enables the model to focus more fine-grained details on image regions and provide better commonsense reasoning. Despite the model being optimized for capturing fine-grained image details, the model achieves good zero-shot performance at standard image-understanding benchmarks such as visual question-answering or natural-image-captioning.
- We demonstrate our model's understanding and reasoning capabilities on biomedical tasks, which often require a fine-grained understanding of high-resolution image regions. By finetuning our general-domain model on a biomedical instruction-tuning dataset with 1.4 million biomedical image-text pairs, Dragonfly-med, our biomedical version of the model, achieves state-of-the-art or competitive performance on multiple biomedical benchmarks, including visual question answering, image captioning, and radiology report generation.
Figure 3: Overview of the Dragonfly architecture
Multi-resolution Visual Encoding: Dragonfly processes images at multiple resolutions—low, medium, and high. Each image, depending on its resolution, is divided into sub-images that are encoded into visual tokens. These tokens are then projected into a language space, forming a concatenated sequence that feeds into the LLM. This method allows the model to handle larger images efficiently and improves the granularity of visual data processing.
Zoom-in Patch Selection: To further refine the model's focus on important visual details, Dragonfly employs a selective approach for high-resolution images. It uses a novel zoom-in patch selection strategy to identify and retain only those high-resolution sub-images that provide the most significant visual information. This is achieved by comparing summary embeddings of medium and high-resolution sub-images to select the most semantically relevant patches. This targeted selection reduces redundancy and focuses on crucial content areas, thereby enhancing the overall model efficiency and fine-grained understanding on these areas.
Together, these strategies enable Dragonfly to achieve a detailed and efficient visual understanding, making it particularly adept at processing complex image data in specific domains.
Figures 4 and 5: Generated examples from Llama-3-8B-Dragonfly
Dragonfly Model Evaluation
We evaluate Dragonfly trained based on LLaMA-8B on five popular vision-language benchmarks that require strong commonsense reasoning and detailed image understanding, AI2D, ScienceQA, MMMU, MMVet, and POPE. AI2D and ScienceQA evaluate visual commonsense reasoning in the science domain. MMMU and MMVet focus on providing a comprehensive evaluation of the vision-language capabilities. POPE evaluates the object-level hallucination of the regional details. Dragonfly achieved competitive performance compared with other popular vision-language models, demonstrating the effectiveness of Dragonfly on commonsense reasoning and fine-grained understanding of image regions. The results are presented in the following table:
Dragonfly-Med
We partnered with the Zou group at Stanford Medicine to finetune Dragonfly on additional 1.4 million biomedical image instructions to develop the biomedical version, Dragonfly-Med.
Figures 6 and 7: Generated examples from Llama-3-8B-Dragonfly-Med
Dragonfly-Med is evaluated on both visual question-answering, medical image captioning, and clinical report generation evaluation. Our model outperforms Med-Gemini, which was finetuned a customized version of Gemini 1.5 with about 5x more image-instruction data, on all the three question-answering datasets. The following two tables show the results on these benchmarks. The first table shows the performance on question-answering benchmarks and the second table presents the performance of biomedical image caption and report generation.
Dragonfly-Med outperforms the existing state-of-the-art models across multiple benchmarks on the Path-VQA dataset for both accuracy (closed) and token F1 metrics. It attains an accuracy of 90.4% on the SLAKE dataset, which is close to the current state-of-the-art of 91.6%. Notably, Dragonfly-Med outperforms Med-Gemini, a model finetuned from Gemini-1.5 on 7 million biomedical data samples, on all VQA tasks we evaluated. On the image captioning task, Dragonfly-Med achieves state-of-the-art or competitive results on several metrics across these datasets. Notably, on the Peir Gross and ROCO datasets, Dragonfly-Med outperforms existing methods on all three metrics: ROUGE-L, METEOR, and CIDEr. Some of the baseline models are much larger than our current implementation. The zoom-and-select architecture of Dragonfly is especially powerful for medical image understanding as medical images are often very large and the salient regions are found in small patches.
Conclusion and Future Work
Dragonfly architecture provides a potential research direction on zooming in image regions to focus more selected fine-grained visual information. We trained two checkpoints based on LLaMA3-8B-Instruct and achieved promising results on both general-domain and biomedical-domain tasks. We hope this work could benefit the research community to explore more open-sourced multimodal research and apply AI on real-world problems.
We will keep improving the comprehensive capabilities of open-source multimodal models. In the future, we will explore new architectures, better visual encoding strategies, more comprehensive studies on how the data mixture should be and more scientific domains to provide benefits for broader fields.
Acknowledgements
We especially appreciate the contributions and collaborations of our partner, Stanford Medicine. We would also like to acknowledge the following resources that were instrumental in the development of Dragonfly:
- Meta LLaMA3: We utilized the LLaMA3 as our language model backbone for our current two checkpoints.
- CLIP: Our current visual backbone is CLIP model from OpenAI
- Our codebase is built upon codebases of Otter and LLaVA-UHD.
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article