Tokenization: Word Splitting
Text is converted into token IDs that the model can understand.
Human language must be converted to numeric IDs for computers to process.
Input text is split into token cards, each with a unique ID.
Embedding: Word Coordinates
Each word piece is converted into a number array (vector) that holds meaning.
A single number can't express meaning, so we bundle hundreds together.
Like looking up a word in a dictionary, each token fetches its number array.
Typical models use vectors with 768 to 4096 dimensions
Positional Encoding: Seat Numbers
Transformers don't know word order, so position info is added.
The same word can mean different things depending on where it appears.
Shuffled words receive position tags and get ordered correctly.
A math formula gives each position a unique tag
Self-Attention: Reference Lines
Each word decides how much to reference other words.
To understand context, information must be gathered from surrounding words.
Click a word to see attention line thickness representing weights.
Click a word to see its attention weights:
Line thickness represents the attention weight magnitude
Multi-Head: Multiple Perspectives
Multiple 'heads' view word relationships from different angles.
A single perspective isn't enough to understand complex language.
4 heads each focus on different word pairs.
Focuses on syntactic structure
Captures semantic similarity
Tracks positional relationships
Identifies long-range connections
All head outputs are combined into a rich representation
Shortcut + Normalization: Information Highway
Original information is preserved while preventing value explosion.
Prevents information from vanishing or exploding in deep networks.
Shortcuts preserve original info; normalization keeps values stable.
Original input is added to output to prevent information loss
Scales values to a stable range for training
Feed-Forward Network: Thinking Time
Expand → bend → compress to refine information.
A 'thinking' stage that deeply processes information gathered by attention.
Information spreads out, gets bent, then shrinks back down.
Output: Creativity Dial
Temperature and Top-k control the 'creativity' of next token selection.
How you pick from the candidates changes the result.
Move the sliders to see how probability distribution and filtering change.
Balanced selection (natural output)
Moderate candidate pool (balanced)
Temperature sharpens/flattens distribution; Top-k limits candidate count
Memory Window & Cache: AI's Memory
The model remembers only limited tokens, and uses cache for speed.
Infinite memory is impossible, so the context window sets the limit.
Memory slots fill up; toggle cache to see the speed difference.
Maximum characters AI can remember at once
Saves previous work to speed things up