Case 9: Optimizing Queries on Skewed Data with a `Partial Index`

The Story

Table: orders, a massive table with 100,000,000 order history records.
Columns: id, user_id, status ('completed', 'shipped', 'pending', 'cancelled'), order_value, created_at.
Data Skew (The Key Point): The data is not evenly distributed.
- 99% of all orders have status = 'completed'.
- A tiny fraction, maybe 0.01% (about 10,000 orders), have status = 'pending'.
The Problem: The back-office system has a dashboard to show all pending orders so staff can process them. This is a very frequent query that needs to be extremely fast.

The Problematic SQL Query

sql

EXPLAIN ANALYZE
SELECT id, user_id, order_value, created_at
FROM orders
WHERE status = 'pending'
ORDER BY created_at ASC;

The Query Plan (BEFORE a fix)

Let's assume we already have a standard B-Tree index on the status column to speed things up. CREATE INDEX idx_orders_status ON orders(status);

The planner will use this index, but the performance won't be the best it could be.

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders  (cost=2450.75..185000.50 rows=10000 width=32) (actual time=55.45..150.80 rows=10000 loops=1)
   Filter: (status = 'pending'::text)
   Heap Blocks: exact=9850
   ->  Bitmap Index Scan on idx_orders_status  (cost=0.00..2448.25 rows=10000 width=0) (actual time=45.12..45.12 rows=10000 loops=1)
         Index Cond: (status = 'pending'::text)
 Planning Time: 0.18 ms
 Execution Time: 151.35 ms

The Bottleneck

Bloated Index: The idx_orders_status index has to keep track of all 100 million rows in the orders table. 99% of the entries in this index are for the value 'completed', which our query never, ever needs.
Sub-optimal Performance: Even though the index is used, the database still has to navigate a huge index structure that is "polluted" by tens of millions of 'completed' and 'shipped' entries, just to find the few 'pending' ones.

The Solution: Use a `Partial Index`

The Logic: Instead of creating a giant index for all statuses, we will create a specialized, tiny index that only contains the rows we actually care about.

The Optimized Command:

sql

-- Create a partial index just for 'pending' orders,
-- and include `created_at` to optimize the ORDER BY.
CREATE INDEX idx_orders_pending_created_at
ON orders(created_at ASC)
WHERE status = 'pending';

How it works: This index completely ignores any row that doesn't have status = 'pending'. Its size will be incredibly small compared to the full index. When the planner sees a query with WHERE status = 'pending', it will recognize that there is a much more efficient, specialized index it can use.

The Query Plan (AFTER the fix)

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_orders_pending_created_at on orders  (cost=0.42..355.12 rows=10000 width=32) (actual time=0.025..3.45 rows=10000 loops=1)
 Planning Time: 0.25 ms
 Execution Time: 3.85 ms

Analyzing the Result

The Big Change: The plan switched to a super-efficient Index Scan on the new partial index. No Bitmap Heap Scan is needed because the index already includes created_at for sorting.
Smaller Index: The partial index takes up a tiny fraction of the disk space compared to the full index.
Faster Reads: Because the index is small and only contains relevant values, the lookup is almost instant.
Faster Writes (Bonus!): When you update an order from 'shipped' to 'completed', this partial index doesn't need to be touched at all, which speeds up writes for irrelevant rows.
Performance Comparison:
- Execution Time: Dropped from 151.35 ms to just 3.85 ms. That's nearly 40 times faster.

Conclusion

A Partial Index is an extremely powerful tool for optimizing queries on datasets with uneven distribution (skewed data).
It's ideal for cases where you frequently query a small subset of your data that is defined by a mostly constant condition (e.g., status = 'active', is_deleted = false, type = 'admin').

Case 9: Optimizing Queries on Skewed Data with a Partial Index ​

The Story ​

The Problematic SQL Query ​

The Query Plan (BEFORE a fix) ​

The Bottleneck ​

The Solution: Use a Partial Index ​

The Query Plan (AFTER the fix) ​

Analyzing the Result ​

Conclusion ​