Modality

Three Things AI Creates

Different outputs (text, images, or both) require different technologies.

Text is processed as word pieces, images as dots (pixels).

Text, image, multimodal - each creates content in its own way.

AI
Text
Image
Multimodal
Language Model

LLM = Next-token Prediction

LLM is a model that predicts the probability of the next token.

Sentence generation is repeatedly picking 'what word comes next.'

The bar chart shows each candidate token's probability, and one gets selected.

0.7

Candidate Tokens

Retrieval + Generation

RAG = Retrieval + Generation

RAG is a system that combines 'search' and 'generation'.

AI fills knowledge gaps by searching external documents.

Find documents in the library, convert to vectors, and fetch the most relevant info.

Employees receive 15 days of paid vacation annually.

HR Policy Handbook [1]
Image Generation

Diffusion = Denoising Process

Diffusion gradually creates a clear image from blurry noise.

Instead of finishing at once, it completes through multiple steps.

The timeline shows how noise is progressively cleaned up step by step.

20