BanTool Documentation

Case 27: The "Skip Scan" Problem: Optimizing `SELECT DISTINCT` on an Index

Of course. We'll continue the series with an expert-level technique used to solve a very specific optimization problem that not many people know about: simulating a "Loose Index Scan" (or "Skip Scan") to speed up SELECT DISTINCT on a large index.

The Scenario 📝

System: A website analytics platform.
Table: page_views, a massive table that records every page view, with 500,000,000 rows.
Index: There is a composite index to serve many different types of queries: CREATE INDEX idx_page_views_site_time ON page_views(site_id, view_timestamp);
The Problem: We need to get a list of all unique site_ids that have had activity.

The Standard Approach and Its Bottleneck

The simplest way to write this is to use DISTINCT.

sql

EXPLAIN ANALYZE
SELECT DISTINCT site_id FROM page_views;

Analyzing the Execution Plan:
- PostgreSQL will most likely choose to scan the entire idx_page_views_site_time index from beginning to end. The plan might be an Index Only Scan followed by a HashAggregate to find the DISTINCT values.
The Bottleneck:
- The index is sorted by (site_id, view_timestamp). Imagine it looks like this:
```
(site_A, time_1)
(site_A, time_2)
... (1 million rows for site_A)
(site_B, time_3)
(site_B, time_4)
... (2 million rows for site_B)
...
```
- To find all unique site_ids, PostgreSQL has to read all 500 million entries in the index. However, it only needs to read the first row for site_A, then skip the next 999,999 rows for site_A, and jump straight to the first row for site_B. This "jumping" action is called a "Loose/Skip Scan," a feature that the PostgreSQL optimizer does not automatically perform for DISTINCT.

The Solution: Simulate a "Skip Scan" with a `Recursive CTE` ✅

The Logic: We will use a Recursive CTE to "jump" from the first DISTINCT value in the index to the next DISTINCT value.
1. Anchor: Find the smallest (first) site_id in the index.
2. Recursive Step: With the site_id we just found, find the next smallest site_id that is greater than it.
3. Repeat until the end.

The Optimized Query:

sql

EXPLAIN ANALYZE
WITH RECURSIVE distinct_sites AS (
    -- 1. Anchor: Find the first site_id.
    (SELECT site_id FROM page_views ORDER BY site_id LIMIT 1)

    UNION ALL

    -- 2. Recursive: With the previous site_id, find the next one.
    SELECT (
        SELECT next_pv.site_id
        FROM page_views next_pv
        WHERE next_pv.site_id > prev_site.site_id -- The condition to "jump".
        ORDER BY next_pv.site_id
        LIMIT 1
    )
    FROM distinct_sites prev_site
    WHERE prev_site.site_id IS NOT NULL
)
SELECT site_id FROM distinct_sites WHERE site_id IS NOT NULL;

Analyzing the Results ✨

A Fundamental Change in Logic:
- Instead of one massive scan of 500 million entries, this query performs thousands (equal to the number of unique sites) of tiny queries.
- Each small query inside the loop (SELECT ... WHERE site_id > ... ORDER BY ... LIMIT 1) is an extremely efficient Index Scan to find the very next entry.
Superior Performance:
- The amount of data that needs to be read and processed is drastically reduced. Instead of reading 500 million (site_id, view_timestamp) pairs, it only reads a few thousand site_ids.
- The execution time can drop from several minutes to just a few seconds.

Conclusion:

SELECT DISTINCT on the leading column of a large composite index might not be as efficient as you think.
The Recursive CTE pattern to simulate a "Skip Scan" is an advanced technique that allows you to "teach" the database how to efficiently jump from one unique value to the next.
This is not a technique you'll need every day, but when you encounter the right problem, it can make a huge difference in performance. It's a great example of the power of modern SQL.

Case 27: The "Skip Scan" Problem: Optimizing SELECT DISTINCT on an Index ​

The Scenario 📝 ​

The Standard Approach and Its Bottleneck ​

The Solution: Simulate a "Skip Scan" with a Recursive CTE ✅ ​

Analyzing the Results ✨ ​

Case 27: The "Skip Scan" Problem: Optimizing `SELECT DISTINCT` on an Index

The Scenario 📝

The Standard Approach and Its Bottleneck

The Solution: Simulate a "Skip Scan" with a `Recursive CTE` ✅

Analyzing the Results ✨