Skip to content

Case 12: EXISTS vs. IN vs. JOIN for Existence Checks

The Story

  • Table 1: categories, containing 1,000 product categories.
  • Table 2: products, a very large table with 30,000,000 products, with an index on the category_id column.
  • The Problem: We need to find all categories that have at least one product. We only need the list of categories, not any information about the products themselves.

The Different Query Methods and Their Analysis

Method 1: Using JOIN with DISTINCT (Often the Slowest)

This is the first approach many people think of.

sql
-- Get categories that have products
EXPLAIN ANALYZE
SELECT DISTINCT c.id, c.name
FROM categories c
JOIN products p ON c.id = p.category_id;
  • The Problem:
    1. Intermediate Data Explosion: The JOIN creates a new row for every single product. If a category has 100,000 products, the JOIN will generate 100,000 intermediate rows for that one category.
    2. Expensive DISTINCT: After creating a massive intermediate result set (potentially 30 million rows), the database must perform a huge Sort or Hash Aggregate operation to remove the duplicate categories. This is the bottleneck.

Method 2: Using IN with a Subquery

This approach looks a bit cleaner.

sql
EXPLAIN ANALYZE
SELECT id, name
FROM categories
WHERE id IN (SELECT DISTINCT category_id FROM products);
  • The Problem:
    1. Subquery Runs First: The database will usually execute the subquery (SELECT DISTINCT category_id FROM products) first. It has to scan the entire products table (or its category_id index) to build a unique list of all category IDs that have products.
    2. Large IN List: This list of IDs could still be quite large (e.g., 900 category IDs). While better than the JOIN, it's still not the most efficient way.

Method 3: Using EXISTS (Usually the Fastest) ✅

EXISTS was designed for exactly this kind of problem.

  • The Logic: For each category in the categories table, check if there exists at least one product that belongs to it.
sql
EXPLAIN ANALYZE
SELECT id, name
FROM categories c
WHERE EXISTS (
    SELECT 1 -- We only need to select a constant value
    FROM products p
    WHERE p.category_id = c.id
);

Why is EXISTS the most efficient?

The core strength of EXISTS is its ability to short-circuit.

  • When checking a category (e.g., 'Electronics'), the subquery inside EXISTS goes to the products table and uses the index to look for a match.
  • As soon as it finds the very first product in the 'Electronics' category, it immediately returns true and stops searching for that category, moving on to the next one.
  • It doesn't need to count or find all the products. It just needs a "yes" or "no" answer.

Execution Plan Comparison (A Simulation)

  • DISTINCT JOIN Plan: Will have a large Hash Join node, followed by an extremely expensive HashAggregate node (to perform the DISTINCT).
  • IN Plan: Will have a Subquery Scan node to create the list of IDs, which is then joined to the categories table.
  • EXISTS Plan: Will have a Nested Loop with a (semi-join) type. The inner loop (scanning the products table) will use an index and stop as soon as it finds the first match, making the cost of each loop extremely low.
MethodLogicPerformanceWhen to Use
DISTINCT JOINJoin everything, then remove duplicatesSlowest 🐢When you need data from both tables
INFind all valid IDs, then filterDecentEasy to read, good for small, static ID lists
EXISTSFor each row, check for existenceFastest 🚀When you only need to check for existence

Conclusion:

When you only need to answer a "yes or no" question (e.g., "find customers who have placed an order," "find products that are in stock"), EXISTS is the most performant choice because of its short-circuiting mechanism.