Microsoft Introduces Florence-2: A Versatile Model for Diverse Vision Tasks

Microsoft Introduces Florence-2: A Versatile Model for Diverse Vision Tasks

Sign up for our daily and weekly newsletters to stay updated with the latest AI news and exclusive content.

Today, Microsoft’s Azure AI team unveiled a new vision foundation model called Florence-2 on Hugging Face. Licensed under the permissive MIT license, this model can manage various vision and vision-language tasks using a unified, prompt-based representation. Available in two sizes—232 million and 771 million parameters—it already performs exceptionally well in tasks like captioning, object detection, visual grounding, and segmentation, often outpacing many larger vision models.

Though real-world performance tests are still pending, this model could offer enterprises a single, streamlined approach for handling various vision applications. This means businesses might not have to invest in separate, task-specific vision models that often need extensive fine-tuning to perform tasks outside their primary function.

What makes Florence-2 stand out?

Large language models (LLMs) have become integral to enterprise operations, performing tasks like summarizing content, writing marketing copy, and managing customer service with remarkable adaptability. This success has led researchers to question whether vision models, which have traditionally been task-specific, could achieve similar versatility.

Vision tasks are inherently more complex than text-based natural language processing (NLP) tasks because they require a comprehensive perceptual ability. To achieve a universal representation of diverse vision tasks, a model must understand spatial data at various scales—from broad image-level concepts like object location to fine pixel details and high-level semantic descriptions.

Microsoft encountered two significant challenges in this endeavor: the lack of comprehensively annotated visual datasets and the absence of a unified pretraining framework capable of understanding both spatial hierarchy and semantic granularity.

To overcome these obstacles, Microsoft used specialized models to create a visual dataset called FLD-5B, comprising 5.4 billion annotations for 126 million images, covering everything from high-level descriptions to specific regions and objects. Using this data, they trained Florence-2 with a sequence-to-sequence architecture that integrates an image encoder and a multi-modality encoder-decoder. This architecture allows the model to handle various vision tasks without needing task-specific modifications.

All annotations in FLD-5B are standardized into textual outputs, enabling a unified multi-task learning approach using a consistent optimization objective. The result is a versatile vision foundation model capable of performing multiple tasks within a single model governed by a uniform set of parameters, activated through textual prompts—similar to how large language models operate.

Florence-2 outperforms many larger models

When given image and text inputs, Florence-2 can handle tasks like object detection, captioning, visual grounding, and visual question answering, often with performance comparable to or better than many larger models. For example, in a zero-shot captioning test on the COCO dataset, both the 232 million and 771 million versions of Florence-2 scored higher than DeepMind’s 80 billion parameter Flamingo visual language model, with scores of 133 and 135.6, respectively. They also outperformed Microsoft’s visual grounding-specific Kosmos-2 model.

Even when fine-tuned with publicly available human-annotated data, Florence-2, despite its smaller size, competes closely with much larger specialist models in tasks like visual question answering. Its pre-trained backbone enhances performance on downstream tasks like COCO object detection and ADE20K semantic segmentation, surpassing both supervised and self-supervised models. Compared to pre-trained models on ImageNet, Florence-2 improves training efficiency by four times and achieves significant performance boosts of 6.9, 5.5, and 5.9 points on the COCO and ADE20K datasets.

Currently, both pre-trained and fine-tuned versions of Florence-2 in its 232 million and 771 million configurations are available on Hugging Face under an MIT license, allowing for unrestricted distribution and modification for commercial or private use.

Developers are expected to find innovative ways to utilize this model, potentially reducing the need for multiple vision models for different tasks. This could significantly lower computational costs and simplify the development process by using these small, task-agnostic models.