Meta: Llama 3.2 11B Vision Instruct

MetaID: meta-llama/llama-3.2-11b-vision-instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

Pricing per 1M Tokens

Input (Prompt)$0.05
Output (Completion)$0.05
Cache ReadFree
Cache WriteFree
ImageN/A

Specifications

Context Length131K
Max Output Tokens16K
Input ModalitiesText + Image
Output ModalitiesText
TokenizerLlama3
Instruct Typellama3
Top Provider Context131K
Top Provider Max Output16K
ModeratedNo

Compare this model

See how Meta: Llama 3.2 11B Vision Instruct stacks up against other models.

More from Meta

Last updated: March 23, 2026

First tracked: March 23, 2026