Best Open-Weight Models You Can Self-Host in 2026

Llama, Qwen, Mistral, and NVIDIA Nemotron lead the open-weight race. Here's what you can run on your own hardware.

Updated March 27, 202612 min read

Key Takeaways

-Qwen3.5 397B scores 45.0 intelligence — the highest open-weight model, Apache 2.0 licensed.
-NVIDIA Nemotron 3 Super (120B, 12B active) offers 7.5x throughput and 60.47% SWE-bench.
-Mistral Small 4 (119B, 6B active) unifies reasoning, vision, and code under Apache 2.0.
-For smaller hardware: Qwen3 32B and Mistral Small 3.1 24B run on consumer GPUs.
-The open-closed gap is roughly 12-15 intelligence points — down from 25+ a year ago.

Not everyone wants to send data to a third-party API. For companies with privacy requirements, researchers who need full control, and developers who want to run models locally, open-weight models are the answer. The quality gap between open and closed models has shrunk dramatically — the best open models now compete with last year's frontier. Here's what's worth self-hosting.

The State of Open-Weight AI

Open-weight models have made remarkable progress. The top open model (Qwen3.5 397B at 45.0 intelligence) would have been competitive with the best closed models from mid-2025. The gap to today's frontier (57.2) is about 12 points — significant but shrinking every quarter.

More importantly, for many practical tasks, the difference between 45 and 57 on the intelligence scale doesn't matter. Document processing, basic coding assistance, summarization, classification, and customer support all work well with open models. It's only on the hardest reasoning tasks that the gap becomes apparent.

The licensing landscape has also improved. Apache 2.0 is now standard for most major open releases, meaning you can use these models commercially without restrictions.

Best Overall: Qwen3.5 397B

Alibaba's Qwen3.5 397B A17B is the intelligence leader among open-weight models at 45.0. It uses a mixture-of-experts architecture with only 17B active parameters during inference, making it more efficient than its total parameter count suggests.

The model supports reasoning mode (configurable intensity), 200+ languages, and a solid coding score of 41.3. Under Apache 2.0, it's fully open for commercial use.

The catch: at 397B total parameters, you need serious hardware to run it. Expect to need 4-8 high-end GPUs for efficient inference. For smaller deployments, Qwen3 32B offers a strong alternative that runs on a single GPU.

Best for Efficiency: NVIDIA Nemotron 3 Super

Nemotron 3 Super uses a Mamba-Transformer hybrid MoE architecture — 120B total parameters but only 12B active. This gives it up to 7.5x higher inference throughput than comparable dense models.

It leads open-weight models on SWE-bench Verified at 60.47% and has a 1M token context window. The Mamba architecture enables particularly efficient long-context inference, making it ideal for processing large codebases or document collections.

Released under a permissive license at GTC 2026, Nemotron 3 Super is NVIDIA's answer to the question of how to make frontier-quality models run efficiently on their hardware.

Best All-in-One: Mistral Small 4

Mistral Small 4 takes a different approach: instead of maximizing any single capability, it unifies reasoning (Magistral), vision (Pixtral), and code (Devstral) into a single 119B MoE model with just 6B active parameters.

The result is a model you can deploy once that handles text, images, and code. No need to maintain three separate model deployments. With a 256K context window and Apache 2.0 licensing, it's the most versatile open-weight model for teams that want a single self-hosted solution.

Performance is strong across the board rather than exceptional in any one area. If you need peak coding performance, Nemotron is better. If you need peak intelligence, Qwen3.5 is better. If you need one model that does everything, Mistral Small 4 is the choice.

Models That Run on Consumer Hardware

Not everyone has a GPU cluster. For models that run on a single consumer GPU (24GB VRAM):

Qwen3 32B: Strong reasoning and coding capability. Runs comfortably on an RTX 4090 with 4-bit quantization.

Mistral Small 3.1 24B: Good all-around performance with vision capability. Fits on a single 24GB GPU.

Llama 3 8B variants: The smallest capable models. Run on an RTX 3090 or even Apple M-series Macs with 32GB unified memory.

For the smallest hardware (16GB), Qwen3 8B and NVIDIA Nemotron Nano 9B offer surprisingly capable performance for their size.

Methodology

All models tested through their official weights and recommended inference configurations. Benchmark scores from Artificial Analysis. Hardware requirements based on half-precision inference with common quantization formats.

The Verdict

Qwen3.5 397B for maximum open-weight intelligence. Nemotron 3 Super for efficient inference at scale. Mistral Small 4 for a versatile all-in-one deployment. Qwen3 32B for consumer hardware. The open-weight ecosystem is now strong enough for production use in most applications.

Published May 9, 2026. Data updated daily from independent benchmarks and API providers.