Baidu: ERNIE 4.5 VL 424B A47B
ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.
Pricing per 1M Tokens
| Input (Prompt) | $0.42 |
| Output (Completion) | $1.25 |
| Cache Read | Free |
| Cache Write | Free |
| Image | N/A |
Specifications
| Context Length | 123K |
| Max Output Tokens | 16K |
| Input Modalities | Image + Text |
| Output Modalities | Text |
| Tokenizer | Other |
| Instruct Type | N/A |
| Top Provider Context | 123K |
| Top Provider Max Output | 16K |
| Moderated | No |
Compare this model
See how Baidu: ERNIE 4.5 VL 424B A47B stacks up against other models.
More from Baidu
Last updated: March 23, 2026
First tracked: March 23, 2026