Tokenization: Word Splitting

Tokenization

Text is converted into token IDs that the model can understand.

Human language must be converted to numeric IDs for computers to process.

Input text is split into token cards, each with a unique ID.

Input Text

Hello, world!

Tokens + IDs

Hello

15496

world

995

Embedding: Word Coordinates

Each word piece is converted into a number array (vector) that holds meaning.

A single number can't express meaning, so we bundle hundreds together.

Like looking up a word in a dictionary, each token fetches its number array.

Hello

Lookup

[0.00, -0.30, ...]

Lookup

[0.50, -0.11, ...]

world

Lookup

[0.07, 0.22, ...]

Lookup

[-0.49, 0.27, ...]

Typical models use vectors with 768 to 4096 dimensions

Positional Encoding: Seat Numbers

Transformers don't know word order, so position info is added.

The same word can mean different things depending on where it appears.

Shuffled words receive position tags and get ordered correctly.

Without Order (Chaos)

sat

cat

down

The

+ Positional Encoding

With Order (Organized)

The

Pos0

cat

Pos1

sat

Pos2

down

Pos3

A math formula gives each position a unique tag

Self-Attention: Reference Lines

Each word decides how much to reference other words.

To understand context, information must be gathered from surrounding words.

Click a word to see attention line thickness representing weights.

Query (Q)

What am I looking for?

Key (K)

What do I contain?

Value (V)

What info can I give?

Click a word to see its attention weights:

Line thickness represents the attention weight magnitude

Multi-Head: Multiple Perspectives

Multiple 'heads' view word relationships from different angles.

A single perspective isn't enough to understand complex language.

4 heads each focus on different word pairs.

The

cat

sat

the

mat

Head 1

Focuses on syntactic structure

Thecatsatonthemat

Head 2

Captures semantic similarity

Thecatsatonthemat

Head 3

Tracks positional relationships

Thecatsatonthemat

Head 4

Identifies long-range connections

Thecatsatonthemat

All head outputs are combined into a rich representation

Shortcut + Normalization: Information Highway

Original information is preserved while preventing value explosion.

Prevents information from vanishing or exploding in deep networks.

Shortcuts preserve original info; normalization keeps values stable.

Residual Connection (Skip)

Original input is added to output to prevent information loss

Layer Normalization

Unstable

Stable

Scales values to a stable range for training

Feed-Forward Network: Thinking Time

Expand → bend → compress to refine information.

A 'thinking' stage that deeply processes information gathered by attention.

Information spreads out, gets bent, then shrinks back down.

Input

Expand

GELU

Compress

Expand

4x dimension expansion

GELU

Non-linear activation applied

Compress

Return to original dimension

Output: Creativity Dial

Temperature and Top-k control the 'creativity' of next token selection.

How you pick from the candidates changes the result.

Move the sliders to see how probability distribution and filtering change.

Temperature1.0

Balanced selection (natural output)

Top-k10

Moderate candidate pool (balanced)

the

62%

17%

11%

your

their

Temperature sharpens/flattens distribution; Top-k limits candidate count

Memory Window & Cache: AI's Memory

The model remembers only limited tokens, and uses cache for speed.

Infinite memory is impossible, so the context window sets the limit.

Memory slots fill up; toggle cache to see the speed difference.

Memory Window

Maximum characters AI can remember at once

0 / 16 tokens

Calculation Cache

Saves previous work to speed things up

Performance Tradeoff

Memory Usage:

Generation Speed: