ByteDance: UI-TARS 7B

ByteDanceID: bytedance/ui-tars-1.5-7b

UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.

Pricing per 1M Tokens

Input (Prompt)	$0.10
Output (Completion)	$0.20
Cache Read	Free
Cache Write	Free
Image	N/A

Specifications

Context Length	128K
Max Output Tokens	2K
Input Modalities	Image + Text
Output Modalities	Text
Tokenizer	Other
Instruct Type	N/A
Top Provider Context	128K
Top Provider Max Output	2K
Moderated	No

Compare this model

See how ByteDance: UI-TARS 7B stacks up against other models.

Last updated: March 23, 2026

First tracked: March 23, 2026