Multi-Modal AI

AI models that can understand and generate multiple types of content: text, images, audio, video, and code.

A multi-modal AI model can process and produce more than just text. Modern multi-modal models accept images (screenshots, diagrams, photos), audio, video, and documents as input alongside text, and can generate images, code, or structured data in return.

For developers, multi-modal capabilities unlock workflows like: pasting a screenshot of a UI design and asking the AI to generate the code, sending an error screenshot instead of typing it out, or uploading a database diagram and asking for the SQL schema.

Claude, GPT-4, and Gemini all support vision (image input). Some models also handle audio and video. As these capabilities mature, the gap between "describing what you want" and "showing what you want" continues to shrink.

Links open the course details directly on the Courses page.

← View all glossary terms

Multi-Modal AI

Related Courses

Related Terms