Tutorial

How to Use Multimodal LLMs: Working with Images, Video, and Audio

Multimodal LLMs can process images, video, audio, and documents alongside text, unlocking applications that text-only models cannot handle. From analyzing charts and diagrams to understanding video content to processing scanned documents, these capabilities are transforming how we interact with AI. This tutorial covers practical techniques for working with multimodal inputs across the leading AI models.

Step-by-Step Guide

Understand multimodal capabilities by model

Not all models handle all modalities equally well. GPT-5.2 excels at image understanding, document analysis, and chart interpretation. Gemini 3 Ultra leads in video and audio processing with native support for long video inputs. Claude Opus 4.6 provides detailed image analysis with strong faithfulness — it rarely hallucines details that are not present. For image-only tasks, all three frontier models perform well. For video understanding, Gemini is significantly ahead. For document processing (invoices, forms, reports), GPT-5.2 and Claude are strongest. Audio processing is best handled by Whisper (transcription) or Gemini (native audio understanding). Choose your model based on the primary modality you need to process.

Prepare images for optimal AI analysis

Image quality directly affects analysis quality. For photos: ensure adequate lighting, focus, and resolution. Crop to the relevant area — sending a full room photo when you only need the whiteboard wastes tokens and can confuse the model. For documents: scan at 300 DPI minimum, ensure text is legible, and correct rotation. For charts and diagrams: ensure labels are readable and colors are distinguishable. Resize images appropriately: most models process images at internal resolutions of 512-2048 pixels. Sending a 4000x4000 image provides no benefit over 2048x2048 but may cost more tokens. For API usage, images can be passed as base64-encoded data or URLs. Base64 is more reliable for programmatic use, while URLs are simpler for one-off requests. Some providers support multiple images in a single request — useful for comparison tasks.

Write effective prompts for visual analysis

Generic prompts like 'What is in this image?' produce generic descriptions. Specific prompts get specific results. For chart analysis: 'Analyze this bar chart. Identify the top 3 categories by value, note any trends, and calculate the percentage difference between the highest and lowest values.' For document processing: 'Extract all line items from this invoice, including product name, quantity, unit price, and total. Format the results as a JSON array.' For photo analysis: 'Describe the architectural style of this building, estimate its era, and identify the construction materials visible.' Always specify the output format you need. Tell the model exactly what information to extract and how to present it. For complex images, guide the model's attention: 'Focus on the data in the bottom-right quadrant of this dashboard.'

Process documents at scale

For batch document processing (invoices, forms, reports), build a pipeline that handles file intake, image preprocessing, API calls, and result storage. Convert multi-page PDFs to individual page images. Process each page with a targeted prompt for the information you need. For structured extraction, use JSON mode to get consistent output format across documents. Implement validation checks: verify extracted numbers sum correctly, dates are valid, and required fields are present. For high-volume processing, use async requests and batch APIs to maximize throughput. Monitor extraction accuracy on a sample of documents and iterate on your prompt when accuracy drops below your threshold. Consider fine-tuning for very specific document types that appear repeatedly with the same format.

Work with video and audio inputs

For video analysis with Gemini, you can upload video files directly and ask questions about their content. For other models, extract key frames at regular intervals (1 frame per second for short videos, 1 per 10 seconds for long ones) and process them as a sequence of images. Include timestamps with each frame for temporal reference. For audio, transcribe using Whisper or Gemini's audio capabilities, then process the transcript with any LLM. For tasks that require understanding tone, emotion, or non-verbal sounds, use Gemini's native audio processing which captures nuances beyond the words. Combine modalities when needed: process a presentation by extracting both slides (images) and speaker audio (transcript) and sending them together for comprehensive analysis.

Handle common challenges and edge cases

Multimodal models have specific failure modes to watch for. OCR errors: models sometimes misread characters, especially in handwritten text or unusual fonts. Always validate critical extracted text against the source image. Hallucinated text: models occasionally describe text that is not actually present in an image — cross-reference extracted content when accuracy matters. Resolution limitations: fine details in large images may be lost at the model's internal processing resolution. Color interpretation: models can struggle with subtle color differences, especially in charts. Multi-image confusion: when processing multiple images, clearly label which image each question refers to. For production applications, implement confidence scoring where you use the model to rate its own certainty about extractions, routing low-confidence items for human review.

Recommended AI Tools

Gemini

The strongest multimodal model with native video and audio understanding capabilities.

ChatGPT

GPT-5.2 offers excellent image analysis and document processing with wide API compatibility.

Claude

Claude's image analysis is detailed and highly faithful to the actual visual content.

Perplexity

Useful for researching best practices and comparing multimodal model capabilities.

Multimodal Chat

Try This on Vincony.com

Upload an image or document to Vincony and compare how GPT-5.2, Gemini 3 Ultra, and Claude analyze it — side by side in a single interface. Find which model provides the most accurate extraction, the best chart interpretation, or the most detailed scene description for your specific content type. With 400+ models, you will find the optimal multimodal model for every task.

Try Vincony Free Learn More

Free tier: 100 credits/month. Pro: $24.99/month with 400+ AI models.

Frequently Asked Questions

Which AI model is best for image analysis?

GPT-5.2 and Gemini 3 Ultra lead on most image understanding benchmarks. For document processing, Claude Opus provides highly detailed and faithful analysis. For video, Gemini is the clear leader. Test with your specific image types — performance varies significantly by content type.

How much do multimodal API calls cost?

Image inputs are priced based on resolution. A standard image costs roughly equivalent to 1,000-2,000 text tokens ($0.01-0.03 at frontier pricing). High-resolution images cost more. Video processing is proportionally more expensive. Optimize by resizing images and sampling video frames appropriately.

Can multimodal AI read handwritten text?

Yes, modern multimodal models handle handwritten text reasonably well for clearly written content. Accuracy decreases with poor handwriting, unusual scripts, or low image quality. For critical applications, treat AI handwriting recognition as a first pass and have humans verify the results.