Case 9: Optimizing Queries on Skewed Data with a Partial Index
The Story
- Table:
orders, a massive table with 100,000,000 order history records. - Columns:
id,user_id,status('completed', 'shipped', 'pending', 'cancelled'),order_value,created_at. - Data Skew (The Key Point): The data is not evenly distributed.
- 99% of all orders have
status = 'completed'. - A tiny fraction, maybe 0.01% (about 10,000 orders), have
status = 'pending'.
- 99% of all orders have
- The Problem: The back-office system has a dashboard to show all
pendingorders so staff can process them. This is a very frequent query that needs to be extremely fast.
The Problematic SQL Query
sql
EXPLAIN ANALYZE
SELECT id, user_id, order_value, created_at
FROM orders
WHERE status = 'pending'
ORDER BY created_at ASC;The Query Plan (BEFORE a fix)
Let's assume we already have a standard B-Tree index on the status column to speed things up. CREATE INDEX idx_orders_status ON orders(status);
The planner will use this index, but the performance won't be the best it could be.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on orders (cost=2450.75..185000.50 rows=10000 width=32) (actual time=55.45..150.80 rows=10000 loops=1)
Filter: (status = 'pending'::text)
Heap Blocks: exact=9850
-> Bitmap Index Scan on idx_orders_status (cost=0.00..2448.25 rows=10000 width=0) (actual time=45.12..45.12 rows=10000 loops=1)
Index Cond: (status = 'pending'::text)
Planning Time: 0.18 ms
Execution Time: 151.35 msThe Bottleneck
- Bloated Index: The
idx_orders_statusindex has to keep track of all 100 million rows in theorderstable. 99% of the entries in this index are for the value 'completed', which our query never, ever needs. - Sub-optimal Performance: Even though the index is used, the database still has to navigate a huge index structure that is "polluted" by tens of millions of 'completed' and 'shipped' entries, just to find the few 'pending' ones.
The Solution: Use a Partial Index
The Logic: Instead of creating a giant index for all statuses, we will create a specialized, tiny index that only contains the rows we actually care about.
The Optimized Command:
sql-- Create a partial index just for 'pending' orders, -- and include `created_at` to optimize the ORDER BY. CREATE INDEX idx_orders_pending_created_at ON orders(created_at ASC) WHERE status = 'pending';How it works: This index completely ignores any row that doesn't have
status = 'pending'. Its size will be incredibly small compared to the full index. When the planner sees a query withWHERE status = 'pending', it will recognize that there is a much more efficient, specialized index it can use.
The Query Plan (AFTER the fix)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_orders_pending_created_at on orders (cost=0.42..355.12 rows=10000 width=32) (actual time=0.025..3.45 rows=10000 loops=1)
Planning Time: 0.25 ms
Execution Time: 3.85 msAnalyzing the Result
- The Big Change: The plan switched to a super-efficient
Index Scanon the new partial index. NoBitmap Heap Scanis needed because the index already includescreated_atfor sorting. - Smaller Index: The partial index takes up a tiny fraction of the disk space compared to the full index.
- Faster Reads: Because the index is small and only contains relevant values, the lookup is almost instant.
- Faster Writes (Bonus!): When you update an order from 'shipped' to 'completed', this partial index doesn't need to be touched at all, which speeds up writes for irrelevant rows.
- Performance Comparison:
- Execution Time: Dropped from 151.35 ms to just 3.85 ms. That's nearly 40 times faster.
Conclusion
- A
Partial Indexis an extremely powerful tool for optimizing queries on datasets with uneven distribution (skewed data). - It's ideal for cases where you frequently query a small subset of your data that is defined by a mostly constant condition (e.g.,
status = 'active',is_deleted = false,type = 'admin').