French AI startup Mistral has dropped its first multimodal model, Pixtral 12B, capable of processing both images and text.
The 12-billion-parameter model, built on Mistral’s existing text-based model Nemo 12B, is designed for tasks like captioning images, identifying objects, and answering image-related queries.
Weighing in at 24GB, the model is available for free under the Apache 2.0 license, meaning anyone can use, modify, or commercialize it without restrictions. Developers can download it from GitHub and Hugging Face, but functional web demos aren’t live yet.
According to Mistral’s head of developer relations, Pixtral 12B will soon be integrated into the company’s chatbot, Le Chat, and API platform, La Platforme.
Multimodal models like Pixtral 12B could be the next frontier for generative AI, following in the footsteps of tools like OpenAI’s GPT-4 and Anthropic’s Claude. However, questions loom over the data sources used to train these models. As noted by Tech Crunch, Mistral, like many AI firms, likely trained Pixtral 12B using vast quantities of publicly available web data — a practice that’s sparked lawsuits from copyright holders challenging the « fair use » argument often made by tech companies.
The release follows Mistral raising $645 million in funding, pushing its valuation to $6 billion. With Microsoft among its backers, Mistral is positioning itself as Europe’s response to OpenAI.