Most developers use LLMs every day. We send prompts to ChatGPT, Claude, or Copilot and get back useful responses. But what actually happens in those few seconds between hitting enter and seeing text appear?
Understanding how LLMs generate text will make you a better developer. You will write better prompts. You will debug strange model behaviors. You will know when to use different temperature settings. And you will understand why these models sometimes confidently say things that are completely wrong.
This guide breaks down the entire process. No deep learning background required. Just the practical knowledge you need.
TL;DR: LLMs generate text one token at a time by repeatedly predicting the most likely next token. Your text gets split into tokens, converted to numbers, processed through attention layers that understand context, and turned into probabilities for every possible next token. The model picks one, appends it to the input, and repeats. Temperature and sampling strategies control how it picks from the probability distribution.
The Core Loop: Next Token Prediction
Here is the fundamental thing to understand: LLMs do not generate entire responses at once. They predict one token at a time in a loop.
When you ask “What is the capital of France?”, the model does not think “Paris” and output it. Instead, it:
- Processes your entire prompt
- Predicts the next token (maybe “The”)
- Appends “The” to the prompt
- Processes the whole thing again
- Predicts the next token (“capital”)
- Repeats until done
This is called autoregressive generation. Each token prediction uses the full context of everything that came before it, including the tokens the model just generated.
flowchart LR
subgraph Loop["Autoregressive Generation Loop"]
A["Input: What is the capital of France?"] --> B["Process through model"]
B --> C["Predict next token: The"]
C --> D["Append to input"]
D --> E["Input: What is the capital of France? The"]
E --> F["Process through model"]
F --> G["Predict next token: capital"]
G --> H["Continue until EOS token"]
end
style A fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
This explains why longer responses take longer to generate. The model literally runs its entire neural network for every single token it outputs. A 100 token response means 100 full passes through the model.
The Text Generation Pipeline
Let me walk you through every step, from your text input to the generated response.
flowchart LR
subgraph Input["Input"]
A["Prompt"] --> B["Tokenizer"] --> C["Embeddings"]
end
subgraph Transformer["Transformer x N"]
D["Attention"] --> E["FFN"]
end
subgraph Output["Output"]
F["Logits"] --> G["Softmax"] --> H["Sample"]
end
C --> D
E --> F
H -.->|"Next Token"| A
style A fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style D fill:#fff3e0,stroke:#e65100,stroke-width:2px
style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
Let us break down each step.
Step 1: Tokenization
LLMs do not read text the way humans do. They cannot see individual letters or words directly. First, your text must be converted into tokens.
What Are Tokens?
A token is the atomic unit an LLM processes. It can be:
- A whole word: “apple” = 1 token
- Part of a word: “unhappiness” = “un” + “happiness” = 2 tokens
- A single character: unusual symbols might be 1 character = 1 token
- Multiple words: common phrases might be combined
Most modern LLMs use about 50,000 to 100,000 unique tokens in their vocabulary.
Why Subwords?
Using whole words would require millions of tokens to handle every possible word in every language. Using characters would make sequences too long to process. Subwords are the sweet spot.
Algorithms like Byte Pair Encoding (BPE) learn common subword patterns from training data. Frequent words stay whole. Rare words get split into recognizable pieces.
1
2
3
4
5
6
# Example tokenization (simplified)
text = "The transformer architecture is powerful"
# Possible tokenization:
tokens = ["The", " transform", "er", " architecture", " is", " powerful"]
token_ids = [464, 6121, 263, 10959, 318, 3665]
Why This Matters for Developers
Many strange LLM behaviors come from tokenization:
| Behavior | Tokenization Explanation |
|---|---|
| Bad at spelling | “strawberry” is one token. The model cannot see individual letters inside it. |
| Struggles with math | “1234” might become “12” + “34”. Number boundaries get lost. |
| Worse at some languages | Languages not well represented in training have inefficient tokenization. |
| Counts wrong | Cannot count characters in a word if the word is a single token. |
| Code style issues | Inconsistent whitespace tokenization affects formatting. |
As Andrej Karpathy puts it in his tokenizer deep dive, if you see an LLM doing something strange, it is worth checking how that input gets tokenized.
Token Counting in Practice
Since you pay for tokens with most APIs, understanding token counts matters:
| Text | Approximate Tokens |
|---|---|
| 1 word | 1-2 tokens |
| 1 sentence | 15-25 tokens |
| 1 paragraph | 50-100 tokens |
| 1 page | 500-700 tokens |
| 1 average code file | 500-1500 tokens |
The rule of thumb: 1 token is roughly 4 characters or 0.75 words in English.
Step 2: Embeddings
Once text is tokenized into IDs, those IDs get converted into embeddings. An embedding is a list of numbers (a vector) that represents the “meaning” of a token.
flowchart LR
subgraph Tokens["Token IDs"]
T1["464"]
T2["6121"]
T3["263"]
end
subgraph Embeddings["Embedding Vectors"]
E1["[0.12, -0.45, 0.78, ...]"]
E2["[0.33, 0.21, -0.56, ...]"]
E3["[-0.19, 0.67, 0.42, ...]"]
end
T1 --> E1
T2 --> E2
T3 --> E3
style Tokens fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Embeddings fill:#fff3e0,stroke:#e65100,stroke-width:2px
The Magic of Vector Space
In this high-dimensional space (GPT-5 uses 16,384 dimensions), tokens with similar meanings are mathematically close to each other.
- “Python” and “Java” are close (both programming languages)
- “Python” and “snake” are somewhat close (same word, different meaning)
- “Python” and “happiness” are far apart (unrelated concepts)
This learned representation is what allows the model to understand that “The cat sat on the mat” and “The feline rested on the rug” mean similar things, even though they share no tokens.
Positional Encoding
There is a problem: embeddings alone do not capture word order. “Dog bites man” and “Man bites dog” would have the same embeddings, just in different order.
To fix this, the model adds positional encodings to embeddings. These are patterns that encode where each token appears in the sequence. Now the model knows that “Dog” is first and “man” is last.
Step 3: The Transformer and Attention
The embeddings enter the transformer, which is where the actual “thinking” happens. A transformer is a stack of identical layers, each containing two main components:
- Self-Attention: Figures out how tokens relate to each other
- Feed-Forward Network: Processes each token independently
GPT-4 has 120+ layers. GPT-5 and Claude 4 have even more. Each layer refines the representation.
Self-Attention Explained
Self-attention is the key innovation that made modern LLMs possible. It answers the question: for each token, which other tokens should I pay attention to?
Consider the sentence: “The bank by the river was muddy.”
When processing “bank”, the attention mechanism looks at all other words and learns that “river” and “muddy” are relevant. This tells the model that “bank” means the land beside water, not a financial institution.
flowchart TB
subgraph Attention["Self-Attention: Processing 'bank'"]
direction LR
W1["The"] --> B["bank"]
W2["by"] --> B
W3["the"] --> B
W4["river"] ==>|"High attention"| B
W5["was"] --> B
W6["muddy"] ==>|"High attention"| B
end
B --> Out["Meaning: river bank, not financial bank"]
style W4 fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px
style W6 fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px
style B fill:#fff3e0,stroke:#e65100,stroke-width:2px
The attention mechanism computes attention scores for every pair of tokens. High scores mean strong relationships. Low scores mean weak relationships.
This is also why LLMs have a context window limit. Attention computation scales quadratically with sequence length. Doubling the context length quadruples the computation. A 128K context window is expensive.
Multi-Head Attention
In practice, transformers use multiple attention “heads” running in parallel. Each head can learn different types of relationships:
- One head might track subject-verb agreement
- Another might track pronoun references
- Another might track syntactic structure
The outputs of all heads get combined for a richer understanding.
Feed-Forward Networks
After attention, each token passes through a feed-forward network independently. This is where the model applies learned “rules” to transform the representation.
Think of attention as “gathering information from context” and feed-forward as “processing that information.”
Step 4: Generating Output Probabilities
After passing through all transformer layers, the model has a rich representation of the entire input sequence. Now it needs to predict the next token.
The final hidden state goes through a projection layer that maps it to the vocabulary size. If the vocabulary has 50,000 tokens, the output is 50,000 numbers, one for each possible next token.
Softmax: Converting to Probabilities
These raw numbers (called logits) are converted to probabilities using the softmax function. After softmax:
- All values are between 0 and 1
- All values sum to 1
- Higher logits become higher probabilities
| Token | Raw Logit | After Softmax |
|---|---|---|
| “the” | 8.2 | 0.42 |
| “a” | 6.1 | 0.18 |
| “Paris” | 5.9 | 0.15 |
| “is” | 4.2 | 0.08 |
| “apple” | -2.1 | 0.0001 |
| … | … | … |
| Total | - | 1.00 |
The model does not just output the top token. It gives you a full probability distribution. What you do with that distribution is where sampling strategies come in.
Step 5: Sampling Strategies
The model has produced probabilities for 50,000+ possible next tokens. How do we pick one?
Greedy Decoding
The simplest approach: always pick the token with the highest probability.
Pros: Deterministic, fast, consistent Cons: Boring, repetitive, can get stuck in loops
Greedy decoding often produces text like: “The best way to learn programming is to practice. Practice is the best way to learn. Learning is the best way to practice…”
Temperature Sampling
Temperature is a parameter that controls randomness. It scales the logits before softmax.
1
adjusted_logits = logits / temperature
| Temperature | Effect |
|---|---|
| 0 | Same as greedy (deterministic) |
| 0.1 - 0.3 | Very focused, good for code and factual answers |
| 0.5 - 0.7 | Balanced, good for general use |
| 0.8 - 1.0 | Creative, good for brainstorming and writing |
| > 1.0 | Chaotic, often incoherent |
Low temperature sharpens the distribution. The top tokens get even higher probabilities.
High temperature flattens the distribution. Lower-probability tokens get a better chance.
flowchart TB
subgraph Temp["Temperature Effect on Distribution"]
direction TB
T0["Temperature = 0.1<br/>Sharp: Top choice almost certain"]
T5["Temperature = 0.7<br/>Balanced: Some variety"]
T1["Temperature = 1.2<br/>Flat: Many choices possible"]
end
style T0 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style T5 fill:#fff3e0,stroke:#e65100,stroke-width:2px
style T1 fill:#fce4ec,stroke:#c2185b,stroke-width:2px
Top-K Sampling
Top-k limits the model to only consider the k most likely tokens. If k=50, the model ignores everything except the top 50 options, then samples from those.
Problem: k is fixed. Sometimes only 2 tokens make sense, sometimes 500 do. A fixed cutoff does not adapt.
Top-P (Nucleus) Sampling
Top-p is smarter. Instead of a fixed count, it takes the smallest set of tokens whose probabilities sum to p (e.g., 0.9).
If the model is confident and the top token has 95% probability, only that token is considered. If the model is uncertain and probabilities are spread out, many tokens are included.
flowchart LR
subgraph TopP["Top-P = 0.9 Example"]
direction TB
C1["Confident Case:<br/>'Paris': 92%<br/>Only 1 token included"]
C2["Uncertain Case:<br/>'might': 25%<br/>'could': 22%<br/>'would': 20%<br/>'may': 18%<br/>4 tokens to reach 85%"]
end
style C1 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style C2 fill:#fff3e0,stroke:#e65100,stroke-width:2px
Typical settings:
- Code generation: temperature 0.2, top_p 0.95
- Chat/conversation: temperature 0.7, top_p 0.9
- Creative writing: temperature 0.9, top_p 0.95
Combining Parameters
Most APIs let you set multiple parameters:
1
2
3
4
5
6
7
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a poem about coding"}],
temperature=0.8, # Creative
top_p=0.9, # Nucleus sampling
max_tokens=200 # Limit response length
)
The model applies temperature first, then top_p filtering, then samples from what remains.
Why LLMs Hallucinate
Now you understand why hallucinations happen. The model is not trying to be truthful. It is trying to predict the most plausible next token.
If your prompt is “The first person to walk on Mars was”, the model looks for tokens that:
- Are grammatically correct
- Sound confident
- Match patterns from training data
It might output “Neil Armstrong” because that pattern (first person + space + was + famous astronaut name) appears frequently in training data. The model has no way to verify facts. It just predicts likely tokens.
flowchart TD
subgraph Hallucination["Why Hallucinations Happen"]
P["Prompt: 'The first person on Mars was'"]
P --> M["Model predicts most likely continuation"]
M --> O1["'Neil' (confident astronaut pattern)"]
M --> O2["'Elon' (Mars + tech association)"]
M --> O3["'an' (grammatically safe)"]
O1 --> H["Picks 'Neil' - plausible but wrong"]
end
N["Model optimizes for plausibility, not truth"]
style H fill:#ffebee,stroke:#c62828,stroke-width:2px
style N fill:#fff3e0,stroke:#e65100,stroke-width:2px
Reducing Hallucinations
As a developer, you can reduce hallucinations by:
- Providing context: Use RAG to give the model actual facts
- Lowering temperature: More deterministic outputs are often more accurate
- Asking for sources: If the model cites specific sources, you can verify them
- Breaking down questions: Smaller, focused questions hallucinate less than broad ones
- Using newer models: Larger models trained with more RLHF tend to say “I don’t know” more often
How LLMs Learn: Training Pipeline
Understanding training helps you understand model behavior. There are three main stages.
Stage 1: Pre-training
The model reads a massive amount of text (hundreds of billions of tokens from the internet, books, code, etc.) and learns to predict the next token.
This stage teaches:
- Grammar and syntax
- Facts and knowledge
- Reasoning patterns
- Code structure
- Multiple languages
After pre-training, the model can continue any text you give it. But it is not yet good at following instructions.
Stage 2: Supervised Fine-Tuning (SFT)
The model is trained on curated prompt-response pairs. Human contractors write ideal responses to various prompts.
1
2
3
4
5
6
7
Prompt: "Write a Python function to reverse a string"
Response: "Here's a Python function that reverses a string:
def reverse_string(s):
return s[::-1]
This uses Python's slice notation..."
This teaches the model to:
- Follow instructions
- Format responses appropriately
- Explain its reasoning
- Stay on topic
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
Humans rank multiple model outputs for the same prompt. A separate “reward model” learns these preferences. Then the LLM is trained to maximize the reward model’s score.
flowchart LR
subgraph RLHF["RLHF Process"]
P["Prompt"] --> M["Generate multiple responses"]
M --> R1["Response A"]
M --> R2["Response B"]
M --> R3["Response C"]
R1 --> H["Human ranks: B > A > C"]
R2 --> H
R3 --> H
H --> RM["Train reward model"]
RM --> LLM["Update LLM to match preferences"]
end
style H fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style LLM fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
RLHF teaches the model to:
- Be helpful
- Avoid harmful content
- Admit uncertainty
- Match human communication style
This is why ChatGPT and Claude “feel” helpful. They have been optimized for human preferences, not just next-token accuracy.
The Context Window
Every LLM has a maximum context window, measured in tokens. This is the limit on how much text it can process at once.
| Model | Context Window |
|---|---|
| GPT-4.1 | 128K tokens |
| GPT-5 | 256K tokens |
| Claude 4 Sonnet | 200K tokens |
| Gemini 3 Pro | 2M+ tokens |
| Llama 4 | 256K tokens |
The “Lost in the Middle” Problem
Bigger is not always better. Research shows that LLMs pay more attention to information at the beginning and end of the context. Details in the middle can be “lost.”
For optimal results:
- Put critical instructions at the start
- Put important reference information at the end
- Keep less critical content in the middle
This is part of context engineering, which is becoming as important as prompt engineering.
Context and Memory
LLMs have no persistent memory. Each request starts fresh. When you chat with ChatGPT, it is not “remembering” your previous messages. The application is sending the entire conversation history with each request.
This is why very long conversations start to degrade. Once you hit the context limit, old messages get truncated or summarized.
Practical Implications for Developers
Understanding this pipeline has real implications for building with LLMs.
For Code Generation
| Setting | Why |
|---|---|
| Low temperature (0.1-0.3) | Code needs to be correct, not creative |
| Higher top_p (0.95) | Still want some variety in solutions |
| Clear specifications | Reduce ambiguity that leads to hallucination |
| Include examples | In-context learning through examples works well |
For Chat Applications
| Setting | Why |
|---|---|
| Medium temperature (0.7) | Balance between coherence and engagement |
| Moderate top_p (0.9) | Natural conversational variety |
| System prompt | Define persona and constraints upfront |
| Manage history | Summarize or truncate to stay in context limits |
For Content Generation
| Setting | Why |
|---|---|
| Higher temperature (0.8-0.9) | Creative output benefits from variety |
| Top_p around 0.9 | Prevents completely random outputs |
| Iterate | Generate multiple versions and select best |
| Seed for consistency | Some APIs let you set a seed for reproducible output |
Token Cost Optimization
Since you pay per token:
- Be concise in prompts: Every word costs money
- Use shorter system prompts: These get sent with every request
- Limit max_tokens: Set reasonable limits to avoid runaway responses
- Cache common prompts: Some APIs offer prompt caching discounts
- Use smaller models when possible: A 7B model is much cheaper than GPT-5
Quick Reference: Key Concepts
| Concept | What It Is | Why It Matters |
|---|---|---|
| Tokenization | Breaking text into subword units | Explains many strange behaviors |
| Embeddings | Numeric representations of meaning | How models understand similarity |
| Attention | Mechanism to relate tokens | How models understand context |
| Softmax | Converts scores to probabilities | Creates the distribution we sample from |
| Temperature | Controls randomness | Tune for your use case |
| Top-p | Dynamic probability cutoff | Better than fixed top-k |
| Autoregressive | One token at a time | Why longer outputs are slower |
| Context window | Maximum input length | Plan for this limit |
| RLHF | Training from human preferences | Why models feel “helpful” |
Key Takeaways
-
LLMs are next-token predictors. Everything else emerges from this simple objective. They do not “understand” or “think” in the human sense. They predict plausible continuations.
-
Tokenization explains a lot. When an LLM does something strange with spelling, math, or specific phrases, check how it tokenizes that input. Many puzzling behaviors are tokenization artifacts.
-
Attention is the key innovation. The ability to relate tokens across the entire input is what makes transformers so powerful. It is also why context windows exist and are expensive.
-
Sampling is where you have control. Temperature, top-p, and top-k let you tune the tradeoff between consistency and creativity. Use them deliberately.
-
Hallucinations are a feature, not a bug. From the model’s perspective, it is doing exactly what it was trained to do: predict likely text. It has no concept of truth. Design your systems accordingly.
-
Context is precious. You have limited tokens. Use them wisely. Put important things first and last. Consider what really needs to be in context.
-
Understanding internals helps debugging. When your AI application behaves unexpectedly, knowing the pipeline helps you diagnose whether it is a tokenization issue, a context issue, a sampling issue, or something else.
Further Reading:
- Attention Is All You Need - The original transformer paper
- Let’s Build the GPT Tokenizer - Andrej Karpathy’s deep dive
- Running LLMs Locally - Run models on your own hardware
- Building AI Agents - Use LLMs as the brain for autonomous agents
- Context Engineering - Master what you put in the context window
- LLM Inference Speed Comparison - Benchmark data for local models
References: Towards AI, Google Developers, Ves Ivanov, Ashutosh.dev