Xiaomi: MiMo-V2-Omni

XiaomiID: xiaomi/mimo-v2-omni

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities. 256K context window.

Pricing per 1M Tokens

Input (Prompt)	$0.40
Output (Completion)	$2.00
Cache Read	$0.08
Cache Write	Free
Image	N/A

Specifications

Context Length	262K
Max Output Tokens	66K
Input Modalities	Text + Audio + Image + Video
Output Modalities	Text
Tokenizer	Other
Instruct Type	N/A
Top Provider Context	262K
Top Provider Max Output	66K
Moderated	No

Compare this model

See how Xiaomi: MiMo-V2-Omni stacks up against other models.

vs Xiaomi: MiMo-V2-Pro vs Xiaomi: MiMo-V2-Flash

More from Xiaomi

Xiaomi: MiMo-V2-Pro

Input$1.00

Context1.0M

Xiaomi: MiMo-V2-Flash

Input$0.09

Context262K

Last updated: March 23, 2026

First tracked: March 23, 2026