Where Knowledge Meets Innovation

Pixtral 12B 24.09

Name: Pixtral 12B 24.09 Introduction Video
Uploaded: 2025-04-27T06:19:03Z
Duration: 1 min 33 s
Description: Multimodal AI for image-text tasks with variable image support and 128K context

Free plan available

Multimodal AI for image-text tasks with variable image support and 128K context

Multimodal processing

Variable image support

128k context window

Text-image integration

Long-form analysis

Claim Offer

Try AI Agent

About Pixtral 12B 24.09

Launched Jan 22, 2025

Introduction Video

Description

Multimodal AI for image-text tasks with variable image support and 128K context

Pixtral-12B-2409 is a 12-billion-parameter multimodal model by Mistral AI, combining a 12B-parameter text decoder with a 400M-parameter vision encoder. It processes interleaved text and images natively, supporting variable image sizes and a 128K-token context window for long-form document analysis or multi-image workflows. The model excels in tasks like chart understanding, OCR, and multilingual reasoning, outperforming similar-sized open models (e.g., Qwen2-VL 7B, LLaVA-OV 7B) and even larger models like Llama-3.2 90B in benchmarks like MMMU (52.5%) and MathVista (58.0%)

Pixtral 12B 24.09 Key Features

128K Context Window: Handles long documents or multi-image inputs.
Variable Image Support: Processes images at native resolution and aspect ratio via a vision encoder.
Multilingual & Code Capabilities: Supports 80+ coding languages and nuanced multilingual understanding.
Open Source: Apache 2.0 license for free modification and deployment.
High Accuracy: Outperforms Claude 3 Haiku and Gemini-1.5 Flash 8B in multimodal benchmarks.
Vision-to-Code: Generates HTML/CSS from sketches or diagrams

Pixtral 12B 24.09 Use Cases

Image Captioning & OCR: Generate descriptions or extract text from images/documents.
Data Analysis: Convert charts to Markdown tables or interactive dashboards.
Document QA: Answer questions from technical manuals or financial reports.
Academic Research: Summarize papers or analyze scientific diagrams.
Automation: Integrate with workflows for invoice processing or customer support

Pros

Supports interleaved text and images, suitable for complex tasks.
High 128K-token context window for long-form documents and multi-image workflows.
Outperforms similar-sized and larger models in benchmarks like MMMU and MathVista.
Effective at chart understanding, OCR, and multilingual reasoning.
Variable image support enhances versatility in image processing tasks.

Cons

May require significant computational resources due to its 12-billion parameters.
Potentially complex integration into existing workflows due to its multimodal nature.
Limited information on real-world application performance from user reviews.