Case 12: `EXISTS` vs. `IN` vs. `JOIN` for Existence Checks

The Story

Table 1: categories, containing 1,000 product categories.
Table 2: products, a very large table with 30,000,000 products, with an index on the category_id column.
The Problem: We need to find all categories that have at least one product. We only need the list of categories, not any information about the products themselves.

The Different Query Methods and Their Analysis

Method 1: Using `JOIN` with `DISTINCT` (Often the Slowest)

This is the first approach many people think of.

sql

-- Get categories that have products
EXPLAIN ANALYZE
SELECT DISTINCT c.id, c.name
FROM categories c
JOIN products p ON c.id = p.category_id;

The Problem:
1. Intermediate Data Explosion: The JOIN creates a new row for every single product. If a category has 100,000 products, the JOIN will generate 100,000 intermediate rows for that one category.
2. Expensive DISTINCT: After creating a massive intermediate result set (potentially 30 million rows), the database must perform a huge Sort or Hash Aggregate operation to remove the duplicate categories. This is the bottleneck.

Method 2: Using `IN` with a Subquery

This approach looks a bit cleaner.

sql

EXPLAIN ANALYZE
SELECT id, name
FROM categories
WHERE id IN (SELECT DISTINCT category_id FROM products);

The Problem:
1. Subquery Runs First: The database will usually execute the subquery (SELECT DISTINCT category_id FROM products) first. It has to scan the entire products table (or its category_id index) to build a unique list of all category IDs that have products.
2. Large IN List: This list of IDs could still be quite large (e.g., 900 category IDs). While better than the JOIN, it's still not the most efficient way.

Method 3: Using `EXISTS` (Usually the Fastest) ✅

EXISTS was designed for exactly this kind of problem.

The Logic: For each category in the categories table, check if there exists at least one product that belongs to it.

sql

EXPLAIN ANALYZE
SELECT id, name
FROM categories c
WHERE EXISTS (
    SELECT 1 -- We only need to select a constant value
    FROM products p
    WHERE p.category_id = c.id
);

Why is `EXISTS` the most efficient?

The core strength of EXISTS is its ability to short-circuit.

When checking a category (e.g., 'Electronics'), the subquery inside EXISTS goes to the products table and uses the index to look for a match.
As soon as it finds the very first product in the 'Electronics' category, it immediately returns true and stops searching for that category, moving on to the next one.
It doesn't need to count or find all the products. It just needs a "yes" or "no" answer.

Execution Plan Comparison (A Simulation)

DISTINCT JOIN Plan: Will have a large Hash Join node, followed by an extremely expensive HashAggregate node (to perform the DISTINCT).
IN Plan: Will have a Subquery Scan node to create the list of IDs, which is then joined to the categories table.
EXISTS Plan: Will have a Nested Loop with a (semi-join) type. The inner loop (scanning the products table) will use an index and stop as soon as it finds the first match, making the cost of each loop extremely low.

Method	Logic	Performance	When to Use
`DISTINCT JOIN`	Join everything, then remove duplicates	Slowest 🐢	When you need data from both tables
`IN`	Find all valid IDs, then filter	Decent	Easy to read, good for small, static ID lists
`EXISTS`	For each row, check for existence	Fastest 🚀	When you only need to check for existence

Conclusion:

When you only need to answer a "yes or no" question (e.g., "find customers who have placed an order," "find products that are in stock"), EXISTS is the most performant choice because of its short-circuiting mechanism.

Case 12: EXISTS vs. IN vs. JOIN for Existence Checks ​

The Story ​

The Different Query Methods and Their Analysis ​

Method 1: Using JOIN with DISTINCT (Often the Slowest) ​

Method 2: Using IN with a Subquery ​

Method 3: Using EXISTS (Usually the Fastest) ✅ ​

Why is EXISTS the most efficient? ​

Execution Plan Comparison (A Simulation) ​