119B parameters, 6B active, Apache 2.0 licensed. Reasoning + Vision + Code in a single deployment.
Mistral has taken a contrarian approach with Small 4. While competitors scale up to hundreds of billions of dense parameters, Mistral built a 119B mixture-of-experts model where only 6B parameters activate per forward pass. The result: a model that's fast, cheap to run, and handles reasoning, vision, and code in a single architecture. Is the unified approach better than specialized models?
Mistral Small 4's headline innovation is consolidation. Instead of maintaining three separate model deployments for different tasks, everything runs through one architecture. The 128-expert MoE design routes each token through the 4 most relevant experts, keeping the active parameter count at just 6B.
This has real operational benefits. One model to deploy, one set of weights to store, one inference pipeline to maintain. For teams running self-hosted AI, this cuts infrastructure complexity significantly.
The 256K context window is generous for an open-weight model, and the configurable reasoning intensity lets you trade speed for quality on a per-request basis — high intensity for complex problems, low intensity for quick tasks.
Small 4's performance is competitive rather than leading in any single category. On the Intelligence Index, it sits in the mid-tier — strong enough for production use but below the proprietary frontier models.
Coding performance benefits from the Devstral heritage, producing clean, well-structured code across multiple languages. Vision capabilities from Pixtral allow it to process images, diagrams, and screenshots — useful for applications that need multimodal input without a separate vision model.
Reasoning from the Magistral lineage handles complex analytical tasks well, though with the configurable intensity, you control how much compute is spent on thinking versus speed.
The practical impact of 6B active parameters is dramatic. Inference throughput is 3x higher than Mistral's previous generation, and end-to-end completion times are 40% faster. On self-hosted hardware, this means you can serve 3x more concurrent users or process 3x more batch jobs with the same GPUs.
For teams evaluating the cost of self-hosting vs API access, this throughput advantage changes the economics. A single high-end GPU serving Small 4 can handle workloads that would cost hundreds per month through API providers.
Small 4 is ideal when you need: a single model deployment that handles multiple task types, efficient self-hosting on your own hardware, open-weight licensing for compliance or privacy, and good-enough quality across the board rather than best-in-class at one thing.
It's not the right choice when you need: maximum intelligence (use Qwen3.5 397B or a proprietary model), maximum coding performance (use Nemotron 3 Super or GPT-5.3 Codex), or maximum vision quality (use a dedicated vision model).
The sweet spot is mid-size teams that need AI across multiple use cases and want to consolidate their model infrastructure.
Tested on Mistral's recommended inference configuration. Performance compared against specialized models from the same generation. Throughput measurements on standardized hardware.
Mistral Small 4 is the best 'one model to rule them all' option for self-hosting. It won't beat specialized models at their strengths, but it handles reasoning, vision, and code competently in a single efficient deployment. The Apache 2.0 license and 3x throughput improvement make it a practical choice for teams that value simplicity and efficiency over peak performance.
Published May 11, 2026. Data updated daily from independent benchmarks and API providers.