🐘 Deep Dive · Database Engineering

How Zerodha Scaled to
Millions with Just PostgreSQL

A no-nonsense technical breakdown of how India's largest stockbroker stretched one open-source database to its absolute limits — and won.

📅 2025 ✍️ play-with-data.blogspot.com ⏱ ~18 min read 🎯 Data Engineers · Backend Engineers · DBAs

§ 00 · Opening

The Counterintuitive Choice

When you hear that a platform processes tens of millions of trades per day, serves ~15 million registered users, manages hundreds of billions of rows spanning close to 20 terabytes of financial data, and does all of that while running a 30-person engineering team — the first database that comes to mind is probably not PostgreSQL.

You'd expect Cassandra, maybe TiDB, maybe a hybrid of ClickHouse and Kafka with a sprinkling of DynamoDB. You'd expect horizontal auto-scaling, managed cloud services, and a dedicated database reliability team. Zerodha chose a different path. They chose PostgreSQL — and built something extraordinary around it.

This article is a complete, engineering-level breakdown of how they did it, why it works, and what every data engineer or software engineer can actually steal from their playbook.

~15MRegistered Users

20 TB+PostgreSQL Storage

100s BRows in Production

7M+PG Tables / Day (cache)

4Sharded Nodes

30Person Tech Team

§ 01 · Context

What Problem Is Zerodha Actually Solving?

Zerodha is a stock broker. Users trade equities, F&O (Futures & Options), currencies, and commodities on Indian exchanges. Every night, an end-of-day (EOD) reconciliation process runs that computes P&L per user per instrument, ledger entries, holdings, tax reports, and contract notes. The platform serving these reports is called Console — Zerodha's back-office web application.

Every time a user opens Console and asks for their trade history for the last 3 years, or filters P&L by segment — they are querying a database holding hundreds of billions of rows.

The core tension: Tens of thousands of users concurrently querying time-series financial data across billions of rows, in near-real-time, while a nightly batch process simultaneously inserts tens of millions of new rows. That is the problem Zerodha had to solve — with PostgreSQL.

Their EOD batch imports grew from roughly 100 MB/day in the early days (PostgreSQL 8!) to hundreds of GBs/day by 2023. The scale of growth demanded architectural evolution — not a change of database.

§ 02 · Schema Design

Schema Design: Denormalization as a First-Class Citizen

The most impactful architectural decision Zerodha made early: complex JOINs across tables holding billions of rows do not scale. This must be made at schema design time — retrofitting it is extremely painful.

The Denormalization Strategy

In a classically normalized database, you might have a trades table pointing via foreign key to users and instruments. Joining these three at multi-billion-row scale drowns the query planner. Zerodha's approach: duplicate data from parent tables directly into child tables as additional columns. Yes, this increases storage. The query speed trade-off is completely worthwhile.

-- ❌ Normalized (doesn't scale at billions of rows)
SELECT t.trade_date, t.quantity, t.price, u.name, i.symbol
FROM trades t
JOIN users u  ON t.user_id      = u.id
JOIN instruments i ON t.instrument_id = i.id
WHERE t.user_id = 12345;

-- ✅ Denormalized — Zerodha style (scales fine)
SELECT trade_date, quantity, price, user_name, instrument_symbol
FROM trades
WHERE user_id = 12345;

Zerodha also dropped foreign keys entirely on large tables during bulk EOD imports. When data integrity is guaranteed at the application layer and verified against exchange records, removing FK constraint checks during millions of row inserts is a dramatic performance win.

Materialized Views for Unavoidable JOINs

For the few relationships that genuinely can't be avoided, Zerodha uses Materialized Views — pre-computed query results stored as physical tables, refreshed during the EOD batch window.

-- Pre-compute expensive P&L during the nightly EOD window
CREATE MATERIALIZED VIEW mv_user_pnl_summary AS
SELECT
    user_id,
    instrument_symbol,
    trading_date,
    SUM(realised_pnl)  AS total_pnl,
    SUM(turnover)      AS total_turnover
FROM trades_denorm
GROUP BY user_id, instrument_symbol, trading_date;

-- Refresh concurrently (no read lock)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_user_pnl_summary;

💡 Engineering Takeaway

At scale, denormalization is not a code smell — it's a deliberate performance contract. Model your data the way you query it, not the way a textbook says it should be stored. Materialized views bridge the gap for the rare complex calculations that remain.

§ 03 · Partitioning

Logical Partitioning: Slicing Time-Series Data by Month

All of Zerodha's transactional data carries meaningful timestamps. They exploited this systematically using PostgreSQL declarative table partitioning, creating child partition tables grouped by calendar month.

-- Range-partitioned trades table
CREATE TABLE trades (
    id         BIGSERIAL,
    user_id    BIGINT       NOT NULL,
    trade_date DATE         NOT NULL,
    symbol     TEXT         NOT NULL,
    quantity   INT          NOT NULL,
    price      NUMERIC(12,4),
    segment    TEXT
) PARTITION BY RANGE (trade_date);

-- Each month = its own physical segment
CREATE TABLE trades_2024_01
  PARTITION OF trades
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE trades_2024_02
  PARTITION OF trades
  FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

A user asking "show me my January 2024 trades" causes PostgreSQL to scan one partition via partition pruning — automatic, transparent, orders-of-magnitude faster. Detaching old partitions and attaching them via FDW to cheaper storage nodes is near-instantaneous because it only manipulates metadata.

💡 Engineering Takeaway

If your data is time-series and queries almost always filter on a date column, range partitioning by month is one of the highest-leverage optimizations in PostgreSQL. It costs very little at setup and pays dividends indefinitely.

§ 04 · Sharding

Sharding with postgres_fdw: Boring and Brilliant

For the first several years, Zerodha ran everything on a single master-replica setup: two 32-core, 64 GB RAM EC2 instances. That single setup stored billions of rows — a testament to how far a well-tuned PostgreSQL instance can go. When exponential growth finally demanded horizontal scaling, they used what PostgreSQL already provides.

Why They Rejected Citus

Citus adds a new reference ID column to every sharded table. With hundreds of SQL queries already written against their existing schema, rewriting all of them — plus migrating multi-terabyte tables with unknown runtime — was unacceptable.

The postgres_fdw Solution

postgres_fdw treats an external PostgreSQL database as if it were a local table. It requires zero schema changes and zero query rewrites.

-- Step 1: enable extension on primary node
CREATE EXTENSION postgres_fdw;

-- Step 2: define the remote shard
CREATE SERVER shard_2021
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (host 'shard-2021.internal', port '5432', dbname 'console');

-- Step 3: map credentials
CREATE USER MAPPING FOR app_user
  SERVER shard_2021
  OPTIONS (user 'app_user', password '...');

-- Step 4: replace old local partition with foreign table
CREATE FOREIGN TABLE trades_2021 (
    id         BIGINT,
    user_id    BIGINT,
    trade_date DATE,
    symbol     TEXT,
    quantity   INT,
    price      NUMERIC(12,4)
) SERVER shard_2021
  OPTIONS (table_name 'trades_2021');

-- Application code stays COMPLETELY UNCHANGED ✓

The primary holds the current financial year's data. Older financial year data lives on smaller, cheaper EC2 instances, linked via FDW. Their current production setup: four sharded nodes. The migration from a single master to this setup was completed in under two hours using EC2 snapshots and DETACH PARTITION / ATTACH PARTITION.

⚠️ Acknowledged Tradeoffs

FDW has two known limits Zerodha openly admits: (1) aggregations spanning large date ranges suffer slight cross-network latency, and (2) cross-node query parallelization is not possible. Both are acceptable because old-year queries are rare and the cache layer absorbs most heavy read traffic.

§ 05 · The Cache Hack

The "Outrageous Abuse": PostgreSQL as a Hot Cache

The Problem with Concurrent Reads

Even with partitioning and denormalization, tens of thousands of concurrent users querying a multi-billion-row database is brutal. Zerodha considered Redis. The problem: implementing server-side filtering and sorting across dozens of report types with different schemas, sort keys, and filter predicates in Redis would require complex application-layer logic and ever-growing RAM. Ruled out.

The Insight: Use PostgreSQL to Cache PostgreSQL

If you execute a heavy query against the primary DB and get back 2,000 rows — those rows are just data. Write them into a brand-new table in a separate, lightweight PostgreSQL instance. The user sorts, filters, and paginates against that tiny table, not the multi-billion-row primary. The tiny table responds in milliseconds. The primary is never touched again for that user's session.

🏗 Zerodha Console — Async Reporting Architecture

👤 User (Console)

──▶

🌐 API Layer

──▶

📬 DungBeetle Job Queue

↓ Worker picks up job, queries primary DB

⚙️ DungBeetle Worker

──▶

🐘 Primary DB (20TB, 4 shards)

↓ Results written to a NEW table in cache DB

🗄 Cache DB (ephemeral PG)

◀──

👤 User polls for result

Filter / Sort / Paginate → served from tiny cache table (~1K–5K rows) · Primary never touched again · Cache dropped & recreated every night

DungBeetle: The Open-Source Middleware

Zerodha open-sourced the middleware powering this: DungBeetle (originally called sql-jobber). A lightweight, single-binary distributed job server written in Go that accepts SQL query jobs over HTTP, queues them with priority lanes, executes them asynchronously, writes each result as a new table in the cache DB, and rate-limits how many queries hit the primary concurrently.

# Launch a high-priority worker (30 concurrent queries max)
dungbeetle \
  --config /etc/dungbeetle/config.toml \
  --sql-directory /etc/dungbeetle/sql/ \
  --queue "high_priority" \
  --worker-concurrency 30

# Submit a P&L report job via HTTP API
curl localhost:6060/tasks/get_pnl_by_date/jobs \
  -H "Content-Type: application/json" \
  -d '{"job_id":"pnl_u123","queue":"high_priority",
       "args":["USER123","2024-01-01","2024-03-31"]}'

# Poll for job completion
curl localhost:6060/jobs/pnl_u123

Cache DB Tuning: Purpose-Built

-- Cache DB: tuned for millions of tiny, ephemeral tables
max_connections        = 2000   -- many concurrent user polls
work_mem               = '4MB'   -- no JOINs, tiny result sets
shared_buffers         = '8GB'   -- keep hot tables in RAM
synchronous_commit     = 'off'   -- disposable data, skip fsync
wal_level              = 'minimal' -- no WAL overhead needed

💡 Engineering Takeaway

PostgreSQL's strength as a cache layer is full SQL semantics on cached results for free. Every filter, sort, GROUP BY, and window function works without extra code. In Redis you'd deserialize, sort in-memory, and re-serialize for every filter variant. The "abuse" pays for itself.

§ 06 · Indexing

Indexing Strategy: Index Smart, Not Everything

Zerodha's hard-won lesson from their PGConf India 2023 talk: "Index but don't overdo it." Every index you create must be maintained on every INSERT, UPDATE, and DELETE. For EOD batch jobs inserting tens of millions of rows nightly, 10 indexes on a table means 10x the maintenance cost per row.

Their strategy: keep only indexes verified as used in production via pg_stat_user_indexes. Unused indexes are dropped ruthlessly.

-- Identify unused indexes (candidates for removal)
SELECT
    tablename,
    indexname,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

-- Composite B-tree: workhorse for user-facing queries
CREATE INDEX idx_trades_user_date
  ON trades_2024_01 (user_id, trade_date);

-- BRIN: tiny footprint for time-ordered append-only data
CREATE INDEX idx_trades_brin
  ON trades_2024_01 USING BRIN (trade_date)
  WITH (pages_per_range = 128);

-- Partial index: only index the segment you actually query
CREATE INDEX idx_trades_eq
  ON trades (user_id, trade_date)
  WHERE segment = 'EQ';

§ 07 · DB Tuning

PostgreSQL Parameter Tuning for Production

-- Primary node: 32 cores, 64 GB RAM, NVMe SSD
shared_buffers                  = '16GB'  -- 25% of RAM
effective_cache_size            = '48GB'  -- planner hint: OS page cache
work_mem                        = '8MB'   -- per sort/hash op. Keep conservative.
maintenance_work_mem            = '1GB'   -- VACUUM, CREATE INDEX
max_wal_size                    = '4GB'   -- fewer checkpoint interruptions
checkpoint_completion_target    = 0.9
max_parallel_workers_per_gather = 8
max_parallel_workers            = 16
log_min_duration_statement      = 500    -- log queries > 500ms
log_autovacuum_min_duration     = 250

VACUUM: The Invisible Janitor

PostgreSQL's MVCC keeps "dead tuples" until VACUUM cleans them up. With millions of rows deleted nightly during financial reconciliations, Zerodha triggers manual VACUUM ANALYZE immediately after every bulk import, plus aggressive autovacuum tuning on large tables.

-- Run immediately after bulk EOD imports
VACUUM ANALYZE trades_2024_01;

-- Aggressive autovacuum for high-churn tables
ALTER TABLE trades SET (
    autovacuum_vacuum_scale_factor  = 0.01,  -- vacuum at 1% dead rows
    autovacuum_analyze_scale_factor = 0.005, -- analyze at 0.5% changed
    autovacuum_vacuum_cost_delay    = 2       -- faster vacuuming (ms)
);

-- Monitor bloat
SELECT relname, n_dead_tup, n_live_tup,
  round(n_dead_tup::numeric / nullif(n_live_tup,0), 4) AS dead_ratio,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000000
ORDER BY n_dead_tup DESC;

The Hard Query Timeout Rule

Zerodha sets a hard limit on how long any query can run. When a query violates the limit, it's a signal: the schema needs tuning, or the index needs adding. The limit enforces engineering discipline rather than letting slow queries silently degrade the system.

-- Set a hard 30-second limit for the application user
ALTER ROLE app_user SET statement_timeout = '30s';

§ 08 · High Availability

High Availability: Simple and Pragmatic

Each PostgreSQL node has a streaming replica for failover. Backups use AWS EBS snapshots — point-in-time, consistent, restorable in minutes. A key contextual advantage: Indian stock markets are closed on weekends and public holidays, providing maintenance windows that 24/7 services don't have. Major migrations, shard splits, index rebuilds — done off-hours with zero user impact.

🔁 HA Topology per Shard

[ EC2 Primary ] ──streaming replication──▶ [ EC2 Replica ]

↓ EBS Snapshot (nightly, crash-consistent)

↓ S3 file-level backup (additional safety net)

Failover: promote replica → update connection string → resume

RTO: minutes · RPO: seconds (async replica lag)

Connection pooling via PgBouncer in transaction mode sits between the application and PostgreSQL. A pool of 50–100 persistent DB connections serves thousands of concurrent application requests — because most hold a connection for only milliseconds.

§ 09 · Query Optimization

EXPLAIN ANALYZE: Reading the Query Planner

-- Profile a P&L query end-to-end
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
    trade_date,
    instrument_symbol,
    SUM(quantity)     AS total_qty,
    SUM(realised_pnl) AS total_pnl
FROM trades
WHERE  user_id    = 99999
  AND  trade_date BETWEEN '2023-04-01' AND '2024-03-31'
GROUP BY trade_date, instrument_symbol
ORDER BY trade_date;

Key things Zerodha engineers look for in the output:

Seq Scan vs Index Scan — Seq Scan on a large table is almost always a red flag. Add an index or review the predicate.
Rows Estimate vs Actual Rows — large differences mean stale statistics. Run ANALYZE.
Buffers: hit vs read — "hit" = data was in RAM. "read" = disk I/O. High "read" ratio means your working set doesn't fit in shared_buffers.
Partition pruning confirmation — ensure only the expected partitions appear in the plan.
Hash Join vs Nested Loop — nested loops on large sets are slow. Hash joins preferred for large dataset joins.

§ 10 · Engineering Philosophy

The Deeper Lesson: Boring Tech, Thoughtful Engineers

Zerodha's CTO Kailash Nadh has been remarkably candid about their technical philosophy. A recurring theme in their tech blog and conference talks: "Simple scales. Scale the application before the infrastructure." They explicitly caution against copying how Google or Amazon scale — because those solutions are specific to problems 99.9% of companies don't have.

At PGConf India 2023, their database engineer Satyajit Sarangi summarized the journey plainly: they started with Postgres 8 handling 100 MB of daily imports and failed at even that scale — due to poor understanding of the data and the database. Today they handle hundreds of GBs of daily imports across 20 TB of data. The database is the same: PostgreSQL. What changed was the engineers' depth of understanding.

Kailash Nadh, Zerodha CTO: "Don't be afraid to build simple things and scale them in simple ways. Reduce networked dependencies. Pick technologies based on concrete rational reasons and not a vague idea of scale in the distant future."

§ 11 · Summary

The Full Playbook: 10 Things You Can Steal Today

📋 Zerodha's PostgreSQL Scaling Playbook

Denormalize aggressively. At billions of rows, multi-table JOINs are your enemy. Copy parent data into child tables and accept the storage cost.
Use Materialized Views for the few unavoidable complex computations. Refresh during your batch window, not on-demand.
Partition time-series tables by month. Partition pruning makes date-range queries orders of magnitude faster and simplifies archiving.
Shard with postgres_fdw. FDW-based sharding requires zero query rewrites and can be implemented in hours, not weeks.
Use a dedicated ephemeral PostgreSQL cache for user-facing read traffic. Each query result gets its own table. Drop and recreate nightly.
Build or deploy an async job queue (DungBeetle) to decouple heavy reporting from the primary DB and rate-limit concurrent queries.
Index selectively. Verify usage via pg_stat_user_indexes. Drop unused indexes before they degrade write performance.
Vacuum aggressively. Run VACUUM ANALYZE manually after bulk imports. Tune autovacuum scale factors for high-churn tables.
Set statement timeouts. A hard query limit forces schema and query discipline and prevents cascading slowdowns.
Drop FK constraints on bulk-import tables. When integrity is application-guaranteed, FK checks during multi-million-row inserts are pure waste.

This article was researched and written for play-with-data.blogspot.com. The goal is to make real-world, production-grade database engineering accessible to data engineers, software engineers, and anyone curious about how systems truly scale at the intersection of fintech and open-source technology.

📚 References & Credits

All technical details drawn from the following primary sources. Full credit to the original authors.

Search This Blog