Xiaomi: MiMo-V2-Omni

XiaomiID: xiaomi/mimo-v2-omni

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities. 256K context window.

Pricing per 1M Tokens

Input (Prompt)$0.40
Output (Completion)$2.00
Cache Read$0.08
Cache WriteFree
ImageN/A

Specifications

Context Length262K
Max Output Tokens66K
Input ModalitiesText + Audio + Image + Video
Output ModalitiesText
TokenizerOther
Instruct TypeN/A
Top Provider Context262K
Top Provider Max Output66K
ModeratedNo

Compare this model

See how Xiaomi: MiMo-V2-Omni stacks up against other models.

More from Xiaomi

Last updated: March 23, 2026

First tracked: March 23, 2026