Skip to content

Case 9: Optimizing Queries on Skewed Data with a Partial Index

The Story

  • Table: orders, a massive table with 100,000,000 order history records.
  • Columns: id, user_id, status ('completed', 'shipped', 'pending', 'cancelled'), order_value, created_at.
  • Data Skew (The Key Point): The data is not evenly distributed.
    • 99% of all orders have status = 'completed'.
    • A tiny fraction, maybe 0.01% (about 10,000 orders), have status = 'pending'.
  • The Problem: The back-office system has a dashboard to show all pending orders so staff can process them. This is a very frequent query that needs to be extremely fast.

The Problematic SQL Query

sql
EXPLAIN ANALYZE
SELECT id, user_id, order_value, created_at
FROM orders
WHERE status = 'pending'
ORDER BY created_at ASC;

The Query Plan (BEFORE a fix)

Let's assume we already have a standard B-Tree index on the status column to speed things up. CREATE INDEX idx_orders_status ON orders(status);

The planner will use this index, but the performance won't be the best it could be.

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders  (cost=2450.75..185000.50 rows=10000 width=32) (actual time=55.45..150.80 rows=10000 loops=1)
   Filter: (status = 'pending'::text)
   Heap Blocks: exact=9850
   ->  Bitmap Index Scan on idx_orders_status  (cost=0.00..2448.25 rows=10000 width=0) (actual time=45.12..45.12 rows=10000 loops=1)
         Index Cond: (status = 'pending'::text)
 Planning Time: 0.18 ms
 Execution Time: 151.35 ms

The Bottleneck

  1. Bloated Index: The idx_orders_status index has to keep track of all 100 million rows in the orders table. 99% of the entries in this index are for the value 'completed', which our query never, ever needs.
  2. Sub-optimal Performance: Even though the index is used, the database still has to navigate a huge index structure that is "polluted" by tens of millions of 'completed' and 'shipped' entries, just to find the few 'pending' ones.

The Solution: Use a Partial Index

  • The Logic: Instead of creating a giant index for all statuses, we will create a specialized, tiny index that only contains the rows we actually care about.

  • The Optimized Command:

    sql
    -- Create a partial index just for 'pending' orders,
    -- and include `created_at` to optimize the ORDER BY.
    CREATE INDEX idx_orders_pending_created_at
    ON orders(created_at ASC)
    WHERE status = 'pending';
  • How it works: This index completely ignores any row that doesn't have status = 'pending'. Its size will be incredibly small compared to the full index. When the planner sees a query with WHERE status = 'pending', it will recognize that there is a much more efficient, specialized index it can use.

The Query Plan (AFTER the fix)

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_orders_pending_created_at on orders  (cost=0.42..355.12 rows=10000 width=32) (actual time=0.025..3.45 rows=10000 loops=1)
 Planning Time: 0.25 ms
 Execution Time: 3.85 ms

Analyzing the Result

  1. The Big Change: The plan switched to a super-efficient Index Scan on the new partial index. No Bitmap Heap Scan is needed because the index already includes created_at for sorting.
  2. Smaller Index: The partial index takes up a tiny fraction of the disk space compared to the full index.
  3. Faster Reads: Because the index is small and only contains relevant values, the lookup is almost instant.
  4. Faster Writes (Bonus!): When you update an order from 'shipped' to 'completed', this partial index doesn't need to be touched at all, which speeds up writes for irrelevant rows.
  5. Performance Comparison:
    • Execution Time: Dropped from 151.35 ms to just 3.85 ms. That's nearly 40 times faster.

Conclusion

  • A Partial Index is an extremely powerful tool for optimizing queries on datasets with uneven distribution (skewed data).
  • It's ideal for cases where you frequently query a small subset of your data that is defined by a mostly constant condition (e.g., status = 'active', is_deleted = false, type = 'admin').