Generative AI can take many kinds of inputs and produce many kinds of outputs—most commonly text, images, audio, video, and code. Some models work with a single modality (e.g., text in → text out), while newer multimodal models can combine inputs (like text + image) and generate one or more outputs (like an edited image plus an explanation).
A simple way to describe generative models is by their input → output format:
- Text → Text (chat, summarization, translation)
- Text → Image
- Text → Video
- Text → Code
- Text → Speech (text-to-speech)
- Speech → Text (speech recognition)
- Image → Text (captioning, “describe this image”)
- Image → Image (editing, inpainting, style transfer)
- Image + Text → Image (instruction-based image editing)
- Video → Text (video summarization)
- Text + Image → Video (video generation with a reference image)
This “input → output” framing is useful because it quickly communicates what a model can do and what kind of data you need to use it.
