When you run a SQL query, a Postgres backend process parses the text into a tree, the rewriter expands views and rules, the planner/optimizer picks the cheapest plan from many candidates using table statistics, and the executor pulls rows through the plan one at a time. Reads come from the shared buffer pool (or disk on a miss), writes go to the Write-Ahead Log first for durability, MVCC keeps old row versions visible until vacuum reclaims them, and a small army of background processes (checkpointer, bgwriter, autovacuum, WAL writer) keeps the whole thing healthy. Run EXPLAIN ANALYZE on any slow query to see exactly which step is costing you.
Key Takeaways
- A query travels through four stages inside Postgres: parser, rewriter, planner, executor. Each one transforms the tree before passing it on.
- The planner is a cost based optimizer. It estimates rows from `pg_statistic` and picks the lowest cost plan. Bad statistics, not a bad query, are the most common cause of bad plans.
- Postgres uses one OS process per connection. That is why connection pooling with PgBouncer is non optional in production.
- Reads hit the shared buffer pool first. A buffer pool miss means a real disk read, which is roughly 10000 times slower.
- Every write goes to the Write-Ahead Log before it touches a data page. The WAL is the single source of truth for durability and replication.
- MVCC means UPDATE never updates in place. It writes a new row version and marks the old one dead. Vacuum is what cleans up the corpses, which is why autovacuum tuning matters.
- EXPLAIN ANALYZE is the single most useful tool a Postgres developer has. Learn to read it and you will fix most slow query problems on your own.
You write a SELECT. You hit enter. A few milliseconds later, rows come back. Most of the time it just works, and you move on. Then one day a query that used to return in ten milliseconds takes ten seconds. Or a JOIN you wrote a year ago suddenly chooses a sequential scan over the index you carefully built. Or a transaction blocks for a minute and you have no idea why.
The difference between a developer who shrugs at this and one who fixes it in an hour is one thing: knowing what actually happens inside PostgreSQL between SELECT and the rows hitting the wire.
This post is a tour of PostgreSQL internals from the developer’s seat. Not a contributor’s guide. Not the source code. Just the mental model you need so that the next time EXPLAIN ANALYZE shows a 9 second Bitmap Heap Scan, you know what it means and what to do about it. We will follow a single query from the network socket all the way to the disk and back, with a stop at every component that matters.
If you want a refresher on storage first, the How Databases Store Data Internally post covers pages, slotted pages, and B-trees. If you want a hands on companion, the PostgreSQL cheat sheet has every command we use here.
The 30 Second Picture
Before we zoom in, here is the whole pipeline on one slide.
flowchart LR
Client["fa:fa-laptop-code <b>Client</b><br/>psql, app, JDBC"]
subgraph Backend["fa:fa-server Backend Process (one per connection)"]
direction TB
Parse["fa:fa-code <b>Parser</b><br/>SQL text to query tree"]
Rewrite["fa:fa-pen-to-square <b>Rewriter</b><br/>views, rules"]
Plan["fa:fa-lightbulb <b>Planner / Optimizer</b><br/>cheapest plan wins"]
Exec["fa:fa-cogs <b>Executor</b><br/>pull rows through tree"]
Parse --> Rewrite --> Plan --> Exec
end
subgraph Mem["fa:fa-memory Shared Memory"]
SB["<b>Shared Buffers</b><br/>8 KB pages"]
WB["<b>WAL Buffers</b>"]
end
subgraph Disk["fa:fa-hard-drive Disk"]
Heap["Heap files<br/>+ indexes"]
WAL["WAL segments"]
end
Client -->|SQL over TCP| Parse
Exec -->|page request| SB
SB -->|miss| Heap
Exec -->|change record| WB
WB -->|fsync on COMMIT| WAL
Exec -->|rows| Client
classDef boxBlue fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef boxGreen fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef boxOrange fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
class Client boxBlue
class SB,WB boxGreen
class Heap,WAL boxOrange
Five things to lock in:
- There is one backend process per connection. No threads.
- Every query goes through four stages inside that backend.
- Reads always go through the shared buffer pool.
- Every write goes to the WAL before it touches a data page.
- The result rows stream back as the executor produces them.
Now let us walk through it slowly.
Process Architecture: One Backend Per Connection
When you start postgres, the first process to come up is the postmaster. It listens on the configured port (5432 by default), reads the configuration files, and supervises everything else. It does not handle queries itself.
When a client connects, the postmaster authenticates it and then fork()s a new OS process called a backend. That backend owns the connection for its entire life. Every parse, every plan, every page read for that session happens in that one process. When the client disconnects, the backend exits.
Around the postmaster sits a small fleet of auxiliary processes that handle work no individual backend should do.
flowchart TB
PM["fa:fa-crown <b>Postmaster</b><br/>listens on port 5432, reads config,<br/>forks every other process"]
subgraph Backends["fa:fa-users Backend Processes (one per client connection)"]
direction LR
B1["<b>Backend #1</b><br/>session A"]
B2["<b>Backend #2</b><br/>session B"]
B3["<b>Backend #N</b><br/>session N"]
end
subgraph WritePath["fa:fa-pen Write Path Helpers"]
direction LR
BG["<b>Background Writer</b><br/>trickles dirty buffers<br/>to data files"]
CH["<b>Checkpointer</b><br/>flushes all dirty buffers,<br/>recycles old WAL"]
WW["<b>WAL Writer</b><br/>flushes WAL buffer<br/>to disk every 200 ms"]
end
subgraph Maint["fa:fa-tools Maintenance and Observability"]
direction LR
AV["<b>Autovacuum Launcher</b><br/>spawns workers to vacuum<br/>and refresh statistics"]
ST["<b>Stats Collector</b><br/>powers pg_stat_*<br/>views"]
IO["<b>IO Workers</b><br/>async page reads<br/>(Postgres 17 and 18)"]
end
PM --> Backends
PM --> WritePath
PM --> Maint
classDef pm fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef be fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef ax fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
class PM pm
class B1,B2,B3 be
class BG,CH,WW,AV,ST,IO ax
A few practical implications fall out of this design.
- Each backend uses around 10 MB of resident memory by itself, plus whatever it allocates for sorts and hashes. A thousand idle connections is not free.
- A crash in one backend cannot take down the others, because they live in separate address spaces. Postgres survives application bugs that would kill a threaded database.
- Connection setup is expensive. Forking a process and authenticating takes milliseconds. Web apps with tens of thousands of short requests will swamp Postgres unless you put PgBouncer or pgcat in front. This is exactly the lesson covered in the How OpenAI Scales PostgreSQL write up.
For a deeper look at the process model, the EnterpriseDB write up Postgres Internals Deep Dive: Process Architecture goes into each auxiliary process in detail.
Stage 1: The Parser
Your SELECT arrives as a string of bytes over a TCP socket. The first thing the backend does is hand it to the parser.
The parser does two jobs.
- The lexer (built on the classic Unix tool
flex) chops the text into tokens: keywords, identifiers, numbers, operators. - The grammar (built with
bison) checks that the tokens form valid SQL according to the formal Postgres grammar, and produces a tree of nodes called the raw parse tree.
At this point Postgres has not opened a single table. The parser does not know whether users is a table, a view, or a typo. That happens next.
After the raw parse, the analyzer runs. It looks up every name in the system catalogs (pg_class, pg_attribute, pg_proc, and friends), checks types, resolves operator overloads, and produces a fully decorated tree called a Query. From the official Path of a Query docs, this Query node has three fields you should know about:
jointree: theFROMandWHEREclausesrtable: the range table, listing every relation referencedtargetList: the columns you actually want back, with expressions
flowchart LR
SQL["<b>SQL text</b><br/>SELECT name<br/>FROM users<br/>WHERE id = 42"]
Lex["fa:fa-puzzle-piece <b>Lexer</b><br/>tokens"]
Gram["fa:fa-sitemap <b>Grammar</b><br/>raw parse tree"]
Ana["fa:fa-search <b>Analyzer</b><br/>look up catalogs"]
Q["<b>Query node</b><br/>jointree, rtable,<br/>targetList"]
SQL --> Lex --> Gram --> Ana --> Q
classDef io fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef step fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
classDef out fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
class SQL io
class Lex,Gram,Ana step
class Q out
The parser is the cheapest part of executing a query. It runs in microseconds. If you want to skip it on repeated queries, that is what prepared statements are for. PREPARE parses, analyzes, and plans the statement once, then EXECUTE skips straight to the executor with new parameters bound in.
Stage 2: The Rewriter
The query tree from the parser then goes through the rewriter, also called the rule system. This stage is small but important.
Its main job is view expansion. If your query selects from a view, the rewriter inlines the view definition into the tree, replacing the view name with the underlying SELECT. After rewriting, the executor never sees views; it only sees base tables.
The rewriter is also what powers Postgres RULE definitions and the INSTEAD OF triggers on updatable views. Most apps will never write a RULE, but every app uses views and most use materialized views. The rewriter is where they get unwrapped.
The full reference is the Rule System chapter of the docs. For day to day work you can treat the rewriter as a transparent step that turns SELECT * FROM my_view into the longer query you would have written by hand.
Stage 3: The Planner / Optimizer
This is the most interesting stage and the one that decides whether your query takes 3 milliseconds or 30 seconds.
The planner takes the rewritten Query tree and asks a hard question: out of all the equivalent ways to compute this result, which is cheapest?
For a single table SELECT with a WHERE clause, the choices include:
- Sequential Scan: read every page of the table.
- Index Scan: walk a B-tree index, then fetch matching heap pages.
- Index Only Scan: walk the index and never touch the heap if all needed columns are in the index.
- Bitmap Index Scan + Bitmap Heap Scan: collect a bitmap of matching pages from one or more indexes, then read those pages in physical order.
For a JOIN, the choices multiply. There are three join algorithms (Nested Loop, Hash Join, Merge Join), and for N tables there are factorially many join orders. For up to 12 tables Postgres uses a dynamic programming algorithm that explores plans bottom up. Above 12 tables it switches to the Genetic Query Optimizer (GEQO), which uses a genetic algorithm to find a good (not optimal) plan in reasonable time.
flowchart TD
Q["<b>Rewritten Query Tree</b>"]
Paths["fa:fa-route <b>Generate paths</b><br/>seq scan, index scan,<br/>bitmap, joins"]
Costs["fa:fa-calculator <b>Cost each path</b><br/>uses pg_statistic,<br/>seq_page_cost,<br/>random_page_cost,<br/>cpu_tuple_cost"]
Pick["fa:fa-trophy <b>Pick cheapest</b><br/>per join level"]
Plan["<b>Final Plan Tree</b><br/>passed to executor"]
Q --> Paths --> Costs --> Pick --> Plan
classDef boxBlue fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef boxGreen fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef boxOrange fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
class Q,Plan boxBlue
class Paths,Costs boxOrange
class Pick boxGreen
How Cost Is Computed
The planner uses a cost model measured in arbitrary units. The defaults assume one sequential page read costs 1.0. The other costs are calibrated against that baseline.
| Setting | Default | Meaning |
|---|---|---|
seq_page_cost |
1.0 | Sequential page read |
random_page_cost |
4.0 | Random page read (lower for SSDs, often 1.1) |
cpu_tuple_cost |
0.01 | Per row CPU work |
cpu_index_tuple_cost |
0.005 | Per index entry CPU work |
cpu_operator_cost |
0.0025 | Per operator/function call |
parallel_tuple_cost |
0.1 | Cost of moving a row across worker boundary |
effective_cache_size |
4 GB | Planner’s belief about OS cache size |
The rule of thumb on modern SSDs is to lower random_page_cost to 1.1 and to raise effective_cache_size to about 70 percent of system RAM. Both are documented in the resource consumption reference and in the Timescale parameter tuning guide.
Statistics Are Everything
The cost formula needs to estimate how many rows each step produces. That estimate comes from pg_statistic, populated by ANALYZE and refreshed automatically by autovacuum. For each column Postgres stores:
- The fraction of
NULLs. - The number of distinct values (
n_distinct). - A most common values list with their frequencies.
- A histogram of the remaining values.
- The physical correlation of the column with table order.
When you write WHERE age = 30, the planner looks up 30 in the most common values list. If it is there, the planner gets an exact frequency. Otherwise it estimates from the histogram. The default sample size is 30000 rows times the per column default_statistics_target (100 by default).
Stale statistics are the single most common cause of “the planner suddenly chose the wrong plan”. When in doubt, run ANALYZE. When the table is large and skewed, raise default_statistics_target to 500 or 1000 for the column that matters and run ANALYZE again.
If you want to dig deeper into the planner’s math, the canonical reference is the Planner / Optimizer chapter and the excellent Internals of PostgreSQL book by Hironobu Suzuki.
Stage 4: The Executor
The final tree from the planner is passed to the executor, which actually runs it. The executor uses the Volcano model, also called the iterator model, introduced by Goetz Graefe in 1994. Every node in the plan tree exposes a single next() method that returns one tuple or NULL. The root node calls next() on its children, who call next() on theirs, and rows percolate up the tree one at a time.
This is what the official Executor docs mean by demand pull pipeline.
flowchart TD
Out["<b>Result</b><br/>back to client"]
Sort["fa:fa-sort <b>Sort</b><br/>by name"]
HJ["fa:fa-random <b>Hash Join</b><br/>users.id = orders.user_id"]
SS1["fa:fa-table <b>Seq Scan</b><br/>orders"]
IS1["fa:fa-search <b>Index Scan</b><br/>users_pkey"]
Out --> Sort --> HJ
HJ --> IS1
HJ --> SS1
classDef boxBlue fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef boxGreen fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef boxOrange fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
class Out boxBlue
class Sort,HJ boxOrange
class IS1,SS1 boxGreen
The executor calls Sort.next(). Sort calls HashJoin.next() repeatedly, builds the join, then sorts the result. HashJoin builds a hash table from users (the smaller side) by calling IndexScan.next() until done, then probes it for each row from SeqScan on orders. Each leaf node fetches pages from the buffer pool. No materialization of intermediate result sets, no temporary tables, just rows flowing through a pipeline.
The most common executor node types every developer should recognize:
| Node | When you see it | What it means |
|---|---|---|
Seq Scan |
small tables, no useful index, or selective filter on most rows | Read every page of the table |
Index Scan |
selective filter, useful index | Walk the index, then fetch matching heap pages |
Index Only Scan |
covering index | All needed columns are in the index, heap untouched |
Bitmap Heap Scan |
medium selectivity, multi index OR/AND |
Build a bitmap of pages from one or more indexes, then read in physical order |
Nested Loop |
very selective inner side, small outer side | For each outer row, look up matches in inner side |
Hash Join |
both sides medium to large, equality join | Build a hash table on the smaller side, probe with the larger |
Merge Join |
both sides already sorted on the join key | Walk both sorted streams in parallel |
Sort |
ORDER BY without a usable index, MERGE JOIN prep |
Sort tuples, in memory if work_mem permits, otherwise spill to disk |
Hash Aggregate |
GROUP BY, DISTINCT |
Build a hash table keyed by the group columns |
Gather / Gather Merge |
parallel query | Collect rows from worker processes |
The full list is in Using EXPLAIN. We will read a real plan in a moment.
Memory: Where Pages Live While You Work
The executor never reads disk directly. It always asks the shared buffer pool for a page, identified by a (relation, block number) pair. If the page is in the pool, you get it for the cost of a hash lookup. If it is not, the buffer manager reads it from disk into a free buffer slot, optionally evicting another page.
flowchart TD
Req["fa:fa-question <b>Page request</b><br/>(rel, blocknum)"]
Hit{"In shared<br/>buffers?"}
Yes["fa:fa-check <b>Hit</b><br/>~ 100 ns"]
PickVictim["fa:fa-sync <b>Pick victim</b><br/>clock-sweep"]
DirtyCheck{"Victim<br/>dirty?"}
Flush["fa:fa-arrow-down <b>Write back</b><br/>to data file"]
Read["fa:fa-hard-drive <b>Read page</b><br/>from disk"]
Done["<b>Page returned<br/>to executor</b>"]
Req --> Hit
Hit -->|yes| Yes --> Done
Hit -->|no| PickVictim --> DirtyCheck
DirtyCheck -->|yes| Flush --> Read
DirtyCheck -->|no| Read --> Done
classDef boxBlue fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef boxGreen fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef boxOrange fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
classDef boxGray fill:#f5f5f5,stroke:#64748b,stroke-width:2px,color:#0f172a
class Req boxBlue
class Yes,Done boxGreen
class PickVictim,Flush,Read boxOrange
class Hit,DirtyCheck boxGray
The pool is a contiguous array of fixed size buffers in shared memory. Each buffer holds exactly one 8 KB page. The number of slots equals shared_buffers / 8 KB. With shared_buffers = 8 GB you get about one million buffer slots.
When the pool is full, Postgres uses a clock-sweep eviction algorithm. Each buffer has a usage_count between 0 and 5. The sweeper rotates through buffers, decrementing the count on each pass. When it finds a buffer with count zero and no pin, it evicts it. Hot pages keep getting their counts bumped on use and survive.
A few rules of thumb every Postgres operator learns the hard way:
- Set
shared_buffersto about 25 percent of system RAM as a starting point. Going higher rarely helps because Postgres relies on the OS file cache for the rest. - Set
effective_cache_sizeto about 50-75 percent of system RAM. This is a hint to the planner, not a real allocation. work_memis per sort or hash node, per query, per backend. A query with three sorts running on 200 connections can use3 * 200 * work_membytes. Be careful when raising this.- If queries spill to disk because
work_memis too small, you will seeSort Method: external merge Disk: 64MBinEXPLAIN ANALYZE. That is your signal.
The full memory tuning reference is in the Resource Consumption chapter, with practical advice in the Fastware shared_buffers and work_mem guide.
On Disk: Pages, Tuples, and the Heap
Every Postgres table is stored as a heap file of fixed size pages, exactly as covered in the How Databases Store Data Internally post. Each page is 8 KB and uses a slotted page layout.
flowchart TB
subgraph Page["fa:fa-file One Page (8 KB)"]
direction TB
H["<b>Page Header</b><br/>LSN, checksum,<br/>free space pointers"]
SA["<b>Slot Array</b><br/>line pointers (item ids)"]
Free["<b>Free Space</b>"]
Tup["<b>Tuples</b><br/>grow upward<br/>from end of page"]
Special["<b>Special Space</b><br/>(index pages only)"]
H --> SA --> Free --> Tup --> Special
end
classDef hdr fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef arr fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
classDef free fill:#f5f5f5,stroke:#64748b,stroke-width:2px,color:#0f172a
classDef tup fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
class H hdr
class SA arr
class Free free
class Tup,Special tup
Each row Postgres stores is called a tuple. Every tuple has a fixed sized header followed by the column data. The header carries the transaction visibility information that powers MVCC:
xmin: the transaction id that inserted this tuple.xmax: the transaction id that deleted or updated this tuple, or 0 if still live.cmin/cmax: command ids inside a transaction.ctid: a(page, offset)pair, used as the physical address.- A handful of flag bits including the hint bits described in the Postgres wiki.
The full layout reference is the Database Page Layout chapter. You can see a tuple’s physical address yourself with SELECT ctid, * FROM users LIMIT 1.
If a row is too big for a single page, Postgres uses TOAST (The Oversized Attribute Storage Technique) to push large columns out to a side table with an internal pointer. The TOAST chapter is here.
Indexes are separate files of pages with the same 8 KB layout, but with a tree structure (B-tree by default, or GIN, GiST, BRIN, Hash, SP-GiST). The data structures B-tree post walks through how the tree balances itself.
MVCC: Why UPDATE Never Updates In Place
Here is the part that surprises most developers when they first read it. PostgreSQL never updates a row in place. Every UPDATE is, internally, a new INSERT plus a DELETE mark on the old row.
That trick is what enables Multi-Version Concurrency Control (MVCC). Every transaction takes a snapshot at start and sees only the row versions visible to that snapshot. Readers do not block writers. Writers do not block readers. Different transactions can see different versions of the same row at the same time.
sequenceDiagram
participant T1 as Tx 100
participant T2 as Tx 101
participant Heap as Heap Page
Note over Heap: row v1<br/>xmin=50, xmax=0<br/>balance=100
T1->>Heap: SELECT (snapshot at 100)
Heap-->>T1: balance=100
T2->>Heap: UPDATE balance=80
Note over Heap: row v1<br/>xmin=50, xmax=101 (DEAD)
Note over Heap: row v2<br/>xmin=101, xmax=0<br/>balance=80
T1->>Heap: SELECT again
Note right of T1: snapshot is 100<br/>still sees v1
Heap-->>T1: balance=100
T2->>T2: COMMIT
Note over Heap: After T1 commits or rolls back,<br/>v1 becomes invisible to all live txns,<br/>vacuum can reclaim it later.
The visibility rule (simplified) is: a tuple is visible to my transaction if xmin is committed and visible in my snapshot, and xmax is either zero, or my own transaction, or aborted, or not visible in my snapshot.
This design has three big consequences every developer should internalise.
- Tables grow even with constant row counts. Updates create dead tuples until vacuum reclaims them. A heavily updated table can grow several times its real size if vacuum cannot keep up. This is called bloat.
- Indexes also bloat. Each new tuple version needs new index entries unless HOT (Heap Only Tuple) kicks in (when no indexed column changed and the page has free space).
- Long running transactions are dangerous. A
BEGINfrom yesterday holds a snapshot that prevents vacuum from reclaiming any tuple newer than that snapshot. The fix is short transactions and theidle_in_transaction_session_timeoutGUC.
The classic write up is Bruce Momjian’s MVCC Unmasked talk. For a friendly walkthrough, the MVCC introduction in the official docs is short and clear. If you want to compare with how other databases handle concurrency, the Database Locks post covers locking and isolation in general.
The Write-Ahead Log
Now the durability story. When the executor decides to write, it does not flush the changed page to disk. That would be slow and would lose data on crash. Instead it does this:
- Modify the page in the buffer pool.
- Write a record describing the change to the WAL buffer.
- Mark the buffer dirty.
- On
COMMIT, force the WAL buffer down to disk withfsync()(or itsO_DSYNCequivalent). - Tell the client “OK”.
The dirty data page itself is written later by the background writer or the checkpointer. If we crash between steps 4 and 5, recovery replays the WAL from the last checkpoint and brings the data files back to a consistent state. If we crash before step 4, the COMMIT was never confirmed and the change is gone, which is correct.
flowchart LR
Exec["fa:fa-cogs <b>Executor</b><br/>change page"]
SB["<b>Shared Buffer</b><br/>marked dirty"]
WB["<b>WAL Buffer</b>"]
WAL["fa:fa-scroll <b>WAL Segment</b><br/>16 MB file"]
Data["fa:fa-folder <b>Data File</b>"]
Exec --> SB
Exec --> WB
WB -->|fsync on COMMIT| WAL
SB -.->|written later by<br/>bgwriter / checkpointer| Data
classDef e fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#0f172a
classDef m fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#0f172a
classDef d fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#0f172a
class Exec e
class SB,WB m
class WAL,Data d
Every WAL record carries a Log Sequence Number (LSN) that is exactly a Lamport-style monotonic counter measured in bytes from the start of the log. Replicas pull WAL by LSN, point in time recovery rewinds to a specific LSN, and physical replication slots track LSNs to know how much WAL the primary can recycle.
For more on the general pattern of writing to a log first, the Write-Ahead Log distributed systems pattern post covers it from a different angle. The Postgres specific reference is the WAL Internals chapter.
A useful trio of WAL related GUCs:
| Setting | What it controls |
|---|---|
wal_level |
minimal / replica / logical. Determines how much info is logged for replication |
wal_buffers |
size of the in-memory WAL buffer (defaults to 1/32 of shared_buffers) |
synchronous_commit |
on waits for fsync, off returns early (faster but loses last few ms on crash) |
checkpoint_timeout |
how often the checkpointer runs (default 5 minutes) |
max_wal_size |
soft cap on WAL between checkpoints |
Tuning is documented in WAL Configuration.
Putting It All Together: A SELECT, End to End
Let us walk through one query and tag every component we have discussed.
1
2
3
4
5
6
7
SELECT u.id, u.name, COUNT(o.id) AS orders
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id, u.name
ORDER BY orders DESC
LIMIT 10;
- Postmaster has long since forked your backend when the connection opened.
- The SQL bytes arrive on the socket. The parser lexes, parses, and analyses them, producing a Query node with two range table entries and a join tree.
- The rewriter runs. No views referenced, so the tree is unchanged.
- The planner generates paths.
- For
users: index scan oncreated_at(if such index exists) vs sequential scan. With statistics saying 5 percent ofusersrows are after Jan 1 2026, an index scan looks cheaper. - For
orders: estimate it has many rows per user. Hash join withusersas build side wins. - Aggregation: hash aggregate by
(u.id, u.name). - Final sort + limit on
orders DESC.
- For
- The executor starts pulling rows from the root.
LimitcallsSort.SortcallsHashAggregate.HashAggregatecallsHashJoin.HashJoincallsIndexScanonusersto build the hash table, thenSeqScanonordersto probe.- Each leaf node asks the shared buffer pool for pages. Misses go to disk.
- If
Sortoverflowswork_mem, it spills to a temporary file underpg_tmp.
- Rows stream out of
Limit10 at a time, are formatted by the wire protocol, and sent back to the client.
If we had run an UPDATE instead, every changed page would also have produced a WAL record, the buffer would be dirty, and the COMMIT would have flushed the WAL to disk before responding.
Reading EXPLAIN ANALYZE Like a Pro
EXPLAIN ANALYZE is the X-ray that shows everything we just described. It actually executes the query and reports each node with estimated and actual numbers.
Here is a real plan.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Limit (cost=18420.32..18420.34 rows=10 width=68)
(actual time=124.221..124.228 rows=10 loops=1)
-> Sort (cost=18420.32..18488.10 rows=27110 width=68)
(actual time=124.219..124.221 rows=10 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=17567.41..17838.51 rows=27110 width=68)
(actual time=120.510..122.844 rows=27033 loops=1)
Group Key: u.id, u.name
Batches: 1 Memory Usage: 4113kB
-> Hash Join (cost=4823.10..15871.00 rows=339282 width=44)
(actual time=12.910..78.413 rows=341557 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o
(cost=0.00..7912.00 rows=400000 width=12)
(actual time=0.014..18.205 rows=400000 loops=1)
-> Hash (cost=4485.10..4485.10 rows=27040 width=36)
(actual time=12.799..12.800 rows=27033 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 2095kB
-> Index Scan using users_created_at_idx on users u
(cost=0.42..4485.10 rows=27040 width=36)
(actual time=0.041..7.330 rows=27033 loops=1)
Index Cond: (created_at > '2026-01-01')
Planning Time: 0.612 ms
Execution Time: 124.366 ms
Three things to learn to spot.
1. Estimated vs actual rows. Seq Scan on orders estimates 400000 and gets 400000. Excellent. Hash Join estimates 339282 and gets 341557. Within a few percent. HashAggregate estimates 27110 and gets 27033. Close. The plan is well calibrated. If you ever see a 100x gap (rows=10 estimated, rows=1000000 actual) you have a statistics problem and ANALYZE is the first fix to try.
2. Time per node. The total Execution Time is 124.366 ms. Most of it is spent in HashJoin (about 78 ms cumulative). The seq scan on orders is 18 ms. The index scan is 7 ms. If you wanted to make this faster, the join is where to focus, not the index.
3. Loops and per-loop time. Inside a Nested Loop, the inner side runs loops=N times. The displayed actual time per node is per loop. Multiply by loops to get the real cost.
A few non obvious tips:
- Always run with
BUFFERS:EXPLAIN (ANALYZE, BUFFERS) SELECT ...shows shared hit, read, dirtied, and written. A query that does heavyBuffers: shared read=is doing real disk IO. A pureshared hit=is hot in cache. - Run
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)to see every output column and worker stats. - Tools like explain.depesz.com and pgMustard render plans visually and call out problems. For complex plans they save hours.
- Postgres 18 added
EXPLAIN (ANALYZE, BUFFERS, SETTINGS, MEMORY)so you can see plan time memory usage too.
The single best deep dive is the official Using EXPLAIN chapter. The thoughtbot post on reading plans and Markus Winand’s Use The Index, Luke are also excellent.
Auxiliary Processes That Keep Things Running
We covered them in the diagram. Here is what each one actually does, in priority order for what to monitor.
Autovacuum
UPDATE and DELETE leave dead tuples behind. Without cleanup, tables and indexes bloat and queries get slower. Autovacuum runs in the background, scanning tables that have crossed configurable thresholds (autovacuum_vacuum_scale_factor defaults to 0.2, meaning vacuum when 20 percent of rows are dead).
It does two things per pass:
- VACUUM: marks dead tuple slots reusable, freezes very old tuples to prevent transaction id wraparound, and updates the visibility map.
- ANALYZE: refreshes
pg_statisticso the planner has accurate row estimates.
The single most common production tuning is to make autovacuum more aggressive on large hot tables, by setting per table autovacuum_vacuum_scale_factor = 0.05 and autovacuum_vacuum_cost_limit = 1000. Reference: Routine Vacuuming and autovacuum settings.
Background Writer
The bgwriter trickles dirty pages from the buffer pool to data files between checkpoints. Its goal is to keep clean buffers available so that backends do not have to do their own writes during eviction. If you see backends doing heavy buffers_backend writes in pg_stat_bgwriter, raise bgwriter_lru_maxpages.
Checkpointer
Once every checkpoint_timeout (or once max_wal_size is reached), the checkpointer flushes all dirty buffers to disk and writes a checkpoint record to the WAL. After a checkpoint, all WAL before that point is no longer needed for crash recovery and can be recycled.
Checkpoints are bursty by nature. To smooth them out, raise checkpoint_completion_target to 0.9 (it is already the default in modern Postgres) so the checkpointer spreads the writes over 90 percent of the interval. This is one of the simplest tunings that makes write heavy workloads feel calmer.
WAL Writer
The WAL buffer is in shared memory. The WAL writer flushes it to the WAL files on a regular interval, even between commits, so that an explicit COMMIT does not have to flush a large backlog. Default cadence is 200 ms.
Stats Collector / Cumulative Statistics
In Postgres 15 and later this became the cumulative statistics system living in shared memory. It is what populates pg_stat_user_tables, pg_stat_user_indexes, pg_stat_statements, and friends. Always install pg_stat_statements in production. It is the single most useful diagnostic extension Postgres ships with.
IO Workers (Postgres 17 and 18)
The new asynchronous IO subsystem in Postgres 18 introduces dedicated IO workers that issue page reads concurrently instead of one at a time. The result is up to 3x faster sequential scans on modern SSDs. If you are running Postgres 16 or older, this is one more reason to upgrade.
Practical Lessons for Developers
You can use a database for years without thinking about any of this. But once you understand the pipeline, a lot of practical decisions become obvious.
Always Look at the Plan Before You Add an Index
Adding an index that the planner does not use is a silent tax on every write. Before you CREATE INDEX, run EXPLAIN (ANALYZE, BUFFERS) on the slow query, look at which scan it picks, and ask whether an index would actually help. The Database Indexing Explained post walks through B-tree, hash, GIN, and BRIN indexes and when each one shines.
Run ANALYZE After Big Loads
After a COPY, a bulk insert, or a schema migration, the row count statistics are stale until autovacuum catches up (which can be tens of minutes). Run ANALYZE explicitly. The cost is small. The benefit is a planner that picks the right plan from the next query onwards.
Keep Transactions Short
Long transactions hold snapshots that block vacuum and inflate bloat. They also block schema changes, because most DDL takes an AccessExclusiveLock. Set idle_in_transaction_session_timeout = '60s' (or whatever fits your app) and watch pg_stat_activity for old xact_start timestamps.
Use Connection Pooling
Even a small Rails or Django app can open more connections than Postgres can comfortably handle. Put PgBouncer in front, configure transaction mode pooling, and set max_connections on Postgres lower than you think (usually 100-200). The Django PostgreSQL setup guide walks through CONN_MAX_AGE and pooling for Django specifically.
Tune random_page_cost for SSDs
The default random_page_cost = 4.0 was set when spinning disks were normal. On NVMe SSDs the real ratio is closer to 1.1. Lowering it makes index scans look comparatively cheaper and unblocks the planner from picking sequential scans where indexes would actually be faster.
Watch the Buffer Pool Hit Ratio
1
2
3
SELECT
sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) AS hit_ratio
FROM pg_stat_database;
If this is below 99 percent on an OLTP workload, your shared_buffers is too small or your working set has grown beyond memory. Both are fixable.
Cache Aggressively at the App Layer
The fastest query is the one you do not run. Combine read replicas with an application cache layer. The Caching Strategies post walks through cache aside, write through, and write back patterns.
How PostgreSQL Compares to Other Databases
Once you know how Postgres works, the patterns elsewhere become familiar. The differences are mostly variations on the same theme.
| Aspect | PostgreSQL | MySQL InnoDB | MongoDB | DynamoDB |
|---|---|---|---|---|
| Concurrency | MVCC, no in place updates | MVCC, undo logs allow in place | MVCC, WiredTiger | Optimistic, conditional writes |
| Process model | Process per connection | Thread per connection | Thread per connection | Managed service |
| Storage unit | 8 KB page | 16 KB page | 32 KB page | Internal partitions |
| WAL equivalent | WAL | InnoDB redo log | Journal | Internal log |
| Default index | B-tree | Clustered B-tree | B-tree | Hash + sort key |
| Vacuum needed | Yes | No (in place) | Compact rare | Managed |
For a longer head-to-head, see PostgreSQL vs MongoDB vs DynamoDB. For lock semantics across engines, the How Database Locks Work post covers shared, exclusive, advisory, and SELECT FOR UPDATE in detail.
Further Reading
The PostgreSQL community has produced an unusual amount of high quality, freely available material on internals.
- The official Overview of PostgreSQL Internals is short, dense, and the source of truth.
- The Internals of PostgreSQL by Hironobu Suzuki is a free book that walks through every component.
- Postgres pro Habr article: Queries in PostgreSQL is the best long form post on query stages.
- The PostgreSQL Backend Flowchart gives you a one page overview of the C functions involved.
- A Tour of PostgreSQL Internals is Tom Lane’s classic talk in slide form.
- For a practical SaaS oriented summary, How PostgreSQL Works: Internal Architecture Explained on AlgoMaster is friendly and well diagrammed.
- For replication and distributed Postgres, Crunchy Data’s Overview of Distributed PostgreSQL Architectures is a great starting point.
If you want a hands on summary of every command you would actually use in your terminal, our own PostgreSQL cheat sheet pairs well with this post. For a broader map of database topics on this blog, the Database Engineering hub links them together.
Wrapping Up
PostgreSQL feels magical when it works and infuriating when it does not. The truth is that it is none of those things. It is a careful, well documented machine made of components you can name: postmaster, backend, parser, rewriter, planner, executor, buffer pool, WAL, MVCC, autovacuum, checkpointer.
Once you can place a slow query inside that machine, debugging is no longer a dark art. You read EXPLAIN ANALYZE, you spot the bad estimate or the cold buffer, you fix the index or the statistics, you move on. And when scale arrives and you start thinking about pooling, replication, sharding, and caching, the same mental model carries over because every Postgres scaling story is a story about moving load off one of the components we just walked through.
Spend an afternoon with EXPLAIN ANALYZE on your real production queries. Spend another with pg_stat_statements. Spend a third reading the WAL configuration page. By the end of the week you will have a clearer picture of your database than 90 percent of the developers who use it.
For more practical Postgres reading on this blog, see the PostgreSQL cheat sheet, PostgreSQL 18 features guide, How OpenAI Scales PostgreSQL, Django PostgreSQL setup, How Database Locks Work, Database Indexing Explained, How Databases Store Data Internally, and the broader Database Engineering hub.
Further reading: the official Path of a Query chapter, the Executor reference, Using EXPLAIN, WAL Internals, MVCC Introduction, Hironobu Suzuki’s free book The Internals of PostgreSQL, and Markus Winand’s Use The Index, Luke for query tuning.