Skip to content

Case 3: Joining on Large Datasets Before Filtering

The Story

  • Table 1: event_logs (records every user event), very large with 20,000,000 rows. Columns: user_id, event_type, event_timestamp, details.
  • Table 2: users (user information), with 500,000 rows. Columns: id, registration_date, country.
  • Current State: Indexes exist on event_logs(user_id), event_logs(event_timestamp), and users(id).
  • The Problem: The marketing team wants to analyze 'purchase' events for users who registered in July 2025 and are from 'Vietnam'.

The Problematic SQL Query

A natural but potentially inefficient way to write the query:

sql
EXPLAIN ANALYZE
SELECT count(*)
FROM event_logs el
JOIN users u ON el.user_id = u.id
WHERE el.event_type = 'purchase'
  AND u.country = 'Vietnam'
  AND u.registration_date >= '2025-07-01'
  AND u.registration_date < '2025-08-01';

The Query Plan (BEFORE a fix)

In some cases, the query planner might choose a non-optimal plan that joins the large dataset first.

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=450123.45..450123.46 rows=1 width=8) (actual time=5123.456..5123.457 rows=1 loops=1)
   ->  Hash Join  (cost=15821.50..450098.21 rows=10100 width=0) (actual time=350.123..5110.987 rows=15234 loops=1)
         Hash Cond: (el.user_id = u.id)
         ->  Seq Scan on event_logs el  (cost=0.00..425432.00 rows=250000 width=4) (actual time=0.025..4567.890 rows=250000 loops=1)
               Filter: (event_type = 'purchase'::text)
         ->  Hash  (cost=15800.25..15800.25 rows=1050 width=4) (actual time=3.456..3.456 rows=1050 loops=1)
               Buckets: 2048  Batches: 1  Memory Usage: 48kB
               ->  Seq Scan on users u  (cost=0.00..15800.25 rows=1050 width=4) (actual time=0.015..2.987 rows=1050 loops=1)
                     Filter: ((country = 'Vietnam'::text) AND (registration_date >= '2025-07-01'::date) AND (registration_date < '2025-08-01'::date))
 Planning Time: 0.450 ms
 Execution Time: 5123.678 ms

Why is this a problem?

  1. Look at the steps from the inside out:

    • Seq Scan on users u: First, it filters the users table. Let's say it finds 1,050 users from 'Vietnam' who registered in July. This step is quite fast.
    • Seq Scan on event_logs el: It filters the event_logs table to get all 'purchase' events. Let's say there are 250,000 such events. This step has to read a large part of the table and is quite slow (actual time=...4567.890).
    • Hash Join: This is the main bottleneck. It performs a join between 250,000 rows from event_logs and 1,050 rows from users. Even though 1,050 is a small number, having to process 250,000 rows in the join is very expensive.
  2. The Root Cause:

    • The order of operations is not optimal. The plan filtered both tables but still left a very large dataset (250,000 'purchase' events) to be used in the JOIN.
    • A better approach would be to find the small list of users (1,050 people) first, and then find the 'purchase' events for only those specific users in the event_logs table.

The Solution

  • Action: Rewrite the query to force the database to filter the users table first, and then use that result (the list of user IDs) to efficiently query the event_logs table using its index. Using a CTE (Common Table Expression) or a Subquery works well.
  • Command with CTE:
    sql
    EXPLAIN ANALYZE
    WITH filtered_users AS (
        SELECT id
        FROM users
        WHERE country = 'Vietnam'
          AND registration_date >= '2025-07-01'
          AND registration_date < '2025-08-01'
    )
    SELECT count(*)
    FROM event_logs el
    JOIN filtered_users fu ON el.user_id = fu.id
    WHERE el.event_type = 'purchase';

The Query Plan (AFTER the fix)

The new plan will be much smarter.

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=18975.34..18975.35 rows=1 width=8) (actual time=245.812..245.813 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..18950.10 rows=10100 width=0) (actual time=0.067..240.123 rows=15234 loops=1)
         ->  Seq Scan on users u  (cost=0.00..15800.25 rows=1050 width=4) (actual time=0.021..2.899 rows=1050 loops=1)
               Filter: ((country = 'Vietnam'::text) AND (registration_date >= '2025-07-01'::date) AND (registration_date < '2025-08-01'::date))
         ->  Index Scan using idx_event_logs_user_id on event_logs el  (cost=0.57..2.98 rows=10 width=4) (actual time=0.025..0.225 rows=15 loops=1050)
               Index Cond: (user_id = u.id)
               Filter: (event_type = 'purchase'::text)
 Planning Time: 0.512 ms
 Execution Time: 246.015 ms

Analyzing the Result

  1. The Big Change: The plan switched to a Nested Loop.

    • Outer Loop: It scans the users table to find the 1,050 matching users. This step is the same and is still fast.
    • Inner Loop: This is the magic. Instead of scanning the event_logs table, it performs 1,050 Index Scans on the idx_event_logs_user_id index. Each Index Scan to find events for a single user_id is an extremely fast operation.
  2. Why is it efficient?: Instead of joining two large datasets, we take one small set (1,050 users) and perform 1,050 small, high-speed lookups on the large table. The total cost of these 1,050 small operations is much cheaper than one giant Hash Join.

  3. Performance Comparison:

    • Execution Time: Dropped from 5123.678 ms (~5.1 seconds) down to 246.015 ms (~0.25 seconds). That's about 20 times faster!

Conclusion

When joining a large table and a small table, try to filter the small table as much as possible first. Then, use that result to query the large table via an index. Rewriting the query with a CTE or subquery can "guide" the planner to this optimal path when it can't figure it out on its own.