Case 27: The "Skip Scan" Problem: Optimizing SELECT DISTINCT on an Index
Of course. We'll continue the series with an expert-level technique used to solve a very specific optimization problem that not many people know about: simulating a "Loose Index Scan" (or "Skip Scan") to speed up SELECT DISTINCT on a large index.
The Scenario 📝
- System: A website analytics platform.
- Table:
page_views, a massive table that records every page view, with 500,000,000 rows. - Index: There is a
composite indexto serve many different types of queries:CREATE INDEX idx_page_views_site_time ON page_views(site_id, view_timestamp); - The Problem: We need to get a list of all unique
site_ids that have had activity.
The Standard Approach and Its Bottleneck
The simplest way to write this is to use DISTINCT.
sql
EXPLAIN ANALYZE
SELECT DISTINCT site_id FROM page_views;- Analyzing the
Execution Plan:- PostgreSQL will most likely choose to scan the entire
idx_page_views_site_timeindex from beginning to end. The plan might be anIndex Only Scanfollowed by aHashAggregateto find theDISTINCTvalues.
- PostgreSQL will most likely choose to scan the entire
- The Bottleneck:
- The index is sorted by
(site_id, view_timestamp). Imagine it looks like this:(site_A, time_1) (site_A, time_2) ... (1 million rows for site_A) (site_B, time_3) (site_B, time_4) ... (2 million rows for site_B) ... - To find all unique
site_ids, PostgreSQL has to read all 500 million entries in the index. However, it only needs to read the first row forsite_A, then skip the next 999,999 rows forsite_A, and jump straight to the first row forsite_B. This "jumping" action is called a "Loose/Skip Scan," a feature that the PostgreSQL optimizer does not automatically perform forDISTINCT.
- The index is sorted by
The Solution: Simulate a "Skip Scan" with a Recursive CTE ✅
The Logic: We will use a
Recursive CTEto "jump" from the firstDISTINCTvalue in the index to the nextDISTINCTvalue.- Anchor: Find the smallest (first)
site_idin the index. - Recursive Step: With the
site_idwe just found, find the next smallestsite_idthat is greater than it. - Repeat until the end.
- Anchor: Find the smallest (first)
The Optimized Query:
sqlEXPLAIN ANALYZE WITH RECURSIVE distinct_sites AS ( -- 1. Anchor: Find the first site_id. (SELECT site_id FROM page_views ORDER BY site_id LIMIT 1) UNION ALL -- 2. Recursive: With the previous site_id, find the next one. SELECT ( SELECT next_pv.site_id FROM page_views next_pv WHERE next_pv.site_id > prev_site.site_id -- The condition to "jump". ORDER BY next_pv.site_id LIMIT 1 ) FROM distinct_sites prev_site WHERE prev_site.site_id IS NOT NULL ) SELECT site_id FROM distinct_sites WHERE site_id IS NOT NULL;
Analyzing the Results ✨
- A Fundamental Change in Logic:
- Instead of one massive scan of 500 million entries, this query performs thousands (equal to the number of unique sites) of tiny queries.
- Each small query inside the loop (
SELECT ... WHERE site_id > ... ORDER BY ... LIMIT 1) is an extremely efficientIndex Scanto find the very next entry.
- Superior Performance:
- The amount of data that needs to be read and processed is drastically reduced. Instead of reading 500 million
(site_id, view_timestamp)pairs, it only reads a few thousandsite_ids. - The execution time can drop from several minutes to just a few seconds.
- The amount of data that needs to be read and processed is drastically reduced. Instead of reading 500 million
Conclusion:
SELECT DISTINCTon the leading column of a large composite index might not be as efficient as you think.- The
Recursive CTEpattern to simulate a "Skip Scan" is an advanced technique that allows you to "teach" the database how to efficiently jump from one unique value to the next. - This is not a technique you'll need every day, but when you encounter the right problem, it can make a huge difference in performance. It's a great example of the power of modern SQL.