Mistral AI has officially launched Pixtral 12B, the first-ever multimodal model from the company, designed to handle both text and image data seamlessly. The model is licensed under Apache 2.0, according to Mistral AI.
Key Features of Pixtral 12B
Pixtral 12B stands out due to its natively multimodal capabilities, trained with interleaved image and text data. The model incorporates a new 400M parameter vision encoder and a 12B parameter multimodal decoder based on Mistral Nemo. This architecture allows it to support variable image sizes and aspect ratios, and process multiple images within its long context window of 128K tokens.
Performance-wise, Pixtral 12B excels in multimodal tasks and maintains state-of-the-art performance on text-only benchmarks. It has achieved a 52.5% score on the MMMU reasoning benchmark, surpassing several larger models.
Performance and Evaluation
Pixtral 12B was designed as a drop-in replacement for Mistral Nemo 12B, delivering best-in-class multimodal reasoning without compromising on text capabilities like instruction following, coding, and math. The model was evaluated using a consistent evaluation harness across various datasets, and it outperforms both open and closed models such as Claude 3 Haiku. Notably, Pixtral even matches or exceeds the performance of larger models like LLaVa OneVision 72B on multimodal benchmarks.
In instruction following, Pixtral particularly excels, showing a 20% relative improvement in text IF-Eval and MT-Bench over the nearest open-source model. It also performs strongly on multimodal instruction following benchmarks, outperforming models like Qwen2-VL 7B and Phi-3.5 Vision.
Architecture and Capabilities
The architecture of Pixtral 12B is designed to optimize for both speed and performance. The vision encoder tokenizes images at their native resolution and aspect ratio, converting them into image tokens for each 16×16 patch in the image. These tokens are then flattened to create a sequence, with [IMG BREAK] and [IMG END] tokens added between rows and at the end of the image. This allows the model to accurately understand complex diagrams and documents while providing fast inference speeds for smaller images.
Pixtral’s final architecture comprises two components: the Vision Encoder and the Multimodal Transformer Decoder. The model is trained to predict the next text token on interleaved image and text data, allowing it to process any number of images with arbitrary sizes in its large context window of 128K tokens.
Practical Applications
Pixtral 12B has shown exceptional performance in various practical applications, including reasoning over complex figures, chart understanding, and multi-image instruction following. For example, it can combine information from multiple tables into a single markdown table or generate HTML code to create a website based on an image prompt.
How to Access Pixtral
Users can easily try Pixtral via Le Chat, Mistral AI’s conversational chat interface, or through La Plateforme, which allows integration via API calls. Detailed documentation is available for those interested in leveraging Pixtral’s capabilities in their applications.
For those who prefer running Pixtral locally, the model can be accessed through the mistral-inference library or the vLLM library, which offers higher serving throughput. Detailed instructions for setup and usage are provided in the documentation.
Image source: Shutterstock