Tokenization

Tokenization: Word Splitting

Text is converted into token IDs that the model can understand.

Human language must be converted to numeric IDs for computers to process.

Input text is split into token cards, each with a unique ID.

Input Text
Hello, world!
Tokens + IDs
Hello
15496
,
11
world
995
!
0

Embedding: Word Coordinates

Each word piece is converted into a number array (vector) that holds meaning.

A single number can't express meaning, so we bundle hundreds together.

Like looking up a word in a dictionary, each token fetches its number array.

Hello
Lookup
[0.00, -0.30, ...]
,
Lookup
[0.50, -0.11, ...]
world
Lookup
[0.07, 0.22, ...]
!
Lookup
[-0.49, 0.27, ...]

Typical models use vectors with 768 to 4096 dimensions

Positional Encoding: Seat Numbers

Transformers don't know word order, so position info is added.

The same word can mean different things depending on where it appears.

Shuffled words receive position tags and get ordered correctly.

Without Order (Chaos)
sat
cat
down
The
?
+ Positional Encoding
With Order (Organized)
The
Pos0
cat
Pos1
sat
Pos2
down
Pos3

A math formula gives each position a unique tag

Self-Attention: Reference Lines

Each word decides how much to reference other words.

To understand context, information must be gathered from surrounding words.

Click a word to see attention line thickness representing weights.

Query (Q)
What am I looking for?
Key (K)
What do I contain?
Value (V)
What info can I give?

Click a word to see its attention weights:

Line thickness represents the attention weight magnitude

Multi-Head: Multiple Perspectives

Multiple 'heads' view word relationships from different angles.

A single perspective isn't enough to understand complex language.

4 heads each focus on different word pairs.

The
cat
sat
on
the
mat
Head 1

Focuses on syntactic structure

Thecatsatonthemat
Head 2

Captures semantic similarity

Thecatsatonthemat
Head 3

Tracks positional relationships

Thecatsatonthemat
Head 4

Identifies long-range connections

Thecatsatonthemat

All head outputs are combined into a rich representation

Shortcut + Normalization: Information Highway

Original information is preserved while preventing value explosion.

Prevents information from vanishing or exploding in deep networks.

Shortcuts preserve original info; normalization keeps values stable.

Residual Connection (Skip)
Sub-layer+

Original input is added to output to prevent information loss

Layer Normalization
Unstable
Stable

Scales values to a stable range for training

Feed-Forward Network: Thinking Time

Expand → bend → compress to refine information.

A 'thinking' stage that deeply processes information gathered by attention.

Information spreads out, gets bent, then shrinks back down.

Input
Expand
GELU
Compress
Expand
4x dimension expansion
GELU
Non-linear activation applied
Compress
Return to original dimension

Output: Creativity Dial

Temperature and Top-k control the 'creativity' of next token selection.

How you pick from the candidates changes the result.

Move the sliders to see how probability distribution and filtering change.

1.0

Balanced selection (natural output)

10

Moderate candidate pool (balanced)

the
62%
a
17%
my
11%
your
6%
an
3%
their
2%

Temperature sharpens/flattens distribution; Top-k limits candidate count

Memory Window & Cache: AI's Memory

The model remembers only limited tokens, and uses cache for speed.

Infinite memory is impossible, so the context window sets the limit.

Memory slots fill up; toggle cache to see the speed difference.

Memory Window

Maximum characters AI can remember at once

0 / 16 tokens
Calculation Cache
K
V

Saves previous work to speed things up

Performance Tradeoff
Memory Usage:
Generation Speed: