Case 12: EXISTS vs. IN vs. JOIN for Existence Checks
The Story
- Table 1:
categories, containing 1,000 product categories. - Table 2:
products, a very large table with 30,000,000 products, with an index on thecategory_idcolumn. - The Problem: We need to find all categories that have at least one product. We only need the list of categories, not any information about the products themselves.
The Different Query Methods and Their Analysis
Method 1: Using JOIN with DISTINCT (Often the Slowest)
This is the first approach many people think of.
sql
-- Get categories that have products
EXPLAIN ANALYZE
SELECT DISTINCT c.id, c.name
FROM categories c
JOIN products p ON c.id = p.category_id;- The Problem:
- Intermediate Data Explosion: The
JOINcreates a new row for every single product. If a category has 100,000 products, theJOINwill generate 100,000 intermediate rows for that one category. - Expensive
DISTINCT: After creating a massive intermediate result set (potentially 30 million rows), the database must perform a hugeSortorHash Aggregateoperation to remove the duplicate categories. This is the bottleneck.
- Intermediate Data Explosion: The
Method 2: Using IN with a Subquery
This approach looks a bit cleaner.
sql
EXPLAIN ANALYZE
SELECT id, name
FROM categories
WHERE id IN (SELECT DISTINCT category_id FROM products);- The Problem:
- Subquery Runs First: The database will usually execute the subquery
(SELECT DISTINCT category_id FROM products)first. It has to scan the entireproductstable (or itscategory_idindex) to build a unique list of all category IDs that have products. - Large
INList: This list of IDs could still be quite large (e.g., 900 category IDs). While better than theJOIN, it's still not the most efficient way.
- Subquery Runs First: The database will usually execute the subquery
Method 3: Using EXISTS (Usually the Fastest) ✅
EXISTS was designed for exactly this kind of problem.
- The Logic: For each category in the
categoriestable, check if there exists at least one product that belongs to it.
sql
EXPLAIN ANALYZE
SELECT id, name
FROM categories c
WHERE EXISTS (
SELECT 1 -- We only need to select a constant value
FROM products p
WHERE p.category_id = c.id
);Why is EXISTS the most efficient?
The core strength of EXISTS is its ability to short-circuit.
- When checking a category (e.g., 'Electronics'), the subquery inside
EXISTSgoes to theproductstable and uses the index to look for a match. - As soon as it finds the very first product in the 'Electronics' category, it immediately returns
trueand stops searching for that category, moving on to the next one. - It doesn't need to count or find all the products. It just needs a "yes" or "no" answer.
Execution Plan Comparison (A Simulation)
DISTINCT JOINPlan: Will have a largeHash Joinnode, followed by an extremely expensiveHashAggregatenode (to perform theDISTINCT).INPlan: Will have aSubquery Scannode to create the list of IDs, which is then joined to thecategoriestable.EXISTSPlan: Will have aNested Loopwith a(semi-join)type. The inner loop (scanning theproductstable) will use an index and stop as soon as it finds the first match, making the cost of each loop extremely low.
| Method | Logic | Performance | When to Use |
|---|---|---|---|
DISTINCT JOIN | Join everything, then remove duplicates | Slowest 🐢 | When you need data from both tables |
IN | Find all valid IDs, then filter | Decent | Easy to read, good for small, static ID lists |
EXISTS | For each row, check for existence | Fastest 🚀 | When you only need to check for existence |
Conclusion:
When you only need to answer a "yes or no" question (e.g., "find customers who have placed an order," "find products that are in stock"), EXISTS is the most performant choice because of its short-circuiting mechanism.