Skip to content

Case 27: The "Skip Scan" Problem: Optimizing SELECT DISTINCT on an Index

Of course. We'll continue the series with an expert-level technique used to solve a very specific optimization problem that not many people know about: simulating a "Loose Index Scan" (or "Skip Scan") to speed up SELECT DISTINCT on a large index.


The Scenario 📝

  • System: A website analytics platform.
  • Table: page_views, a massive table that records every page view, with 500,000,000 rows.
  • Index: There is a composite index to serve many different types of queries: CREATE INDEX idx_page_views_site_time ON page_views(site_id, view_timestamp);
  • The Problem: We need to get a list of all unique site_ids that have had activity.

The Standard Approach and Its Bottleneck

The simplest way to write this is to use DISTINCT.

sql
EXPLAIN ANALYZE
SELECT DISTINCT site_id FROM page_views;
  • Analyzing the Execution Plan:
    • PostgreSQL will most likely choose to scan the entire idx_page_views_site_time index from beginning to end. The plan might be an Index Only Scan followed by a HashAggregate to find the DISTINCT values.
  • The Bottleneck:
    • The index is sorted by (site_id, view_timestamp). Imagine it looks like this:
      (site_A, time_1)
      (site_A, time_2)
      ... (1 million rows for site_A)
      (site_B, time_3)
      (site_B, time_4)
      ... (2 million rows for site_B)
      ...
    • To find all unique site_ids, PostgreSQL has to read all 500 million entries in the index. However, it only needs to read the first row for site_A, then skip the next 999,999 rows for site_A, and jump straight to the first row for site_B. This "jumping" action is called a "Loose/Skip Scan," a feature that the PostgreSQL optimizer does not automatically perform for DISTINCT.

The Solution: Simulate a "Skip Scan" with a Recursive CTE

  • The Logic: We will use a Recursive CTE to "jump" from the first DISTINCT value in the index to the next DISTINCT value.

    1. Anchor: Find the smallest (first) site_id in the index.
    2. Recursive Step: With the site_id we just found, find the next smallest site_id that is greater than it.
    3. Repeat until the end.
  • The Optimized Query:

    sql
    EXPLAIN ANALYZE
    WITH RECURSIVE distinct_sites AS (
        -- 1. Anchor: Find the first site_id.
        (SELECT site_id FROM page_views ORDER BY site_id LIMIT 1)
    
        UNION ALL
    
        -- 2. Recursive: With the previous site_id, find the next one.
        SELECT (
            SELECT next_pv.site_id
            FROM page_views next_pv
            WHERE next_pv.site_id > prev_site.site_id -- The condition to "jump".
            ORDER BY next_pv.site_id
            LIMIT 1
        )
        FROM distinct_sites prev_site
        WHERE prev_site.site_id IS NOT NULL
    )
    SELECT site_id FROM distinct_sites WHERE site_id IS NOT NULL;

Analyzing the Results

  1. A Fundamental Change in Logic:
    • Instead of one massive scan of 500 million entries, this query performs thousands (equal to the number of unique sites) of tiny queries.
    • Each small query inside the loop (SELECT ... WHERE site_id > ... ORDER BY ... LIMIT 1) is an extremely efficient Index Scan to find the very next entry.
  2. Superior Performance:
    • The amount of data that needs to be read and processed is drastically reduced. Instead of reading 500 million (site_id, view_timestamp) pairs, it only reads a few thousand site_ids.
    • The execution time can drop from several minutes to just a few seconds.

Conclusion:

  • SELECT DISTINCT on the leading column of a large composite index might not be as efficient as you think.
  • The Recursive CTE pattern to simulate a "Skip Scan" is an advanced technique that allows you to "teach" the database how to efficiently jump from one unique value to the next.
  • This is not a technique you'll need every day, but when you encounter the right problem, it can make a huge difference in performance. It's a great example of the power of modern SQL.