BanTool Documentation

Of course, let's proceed with Case 22. This time, we'll look at a topic that seems basic but hides many performance traps: the different variations of the COUNT function.

Case 22: The Subtle Costs of `COUNT`: `*` vs. `column` vs. `DISTINCT`

The Story

Table: page_views, a very large table recording every page view, with 200,000,000 rows.
Columns: id, url, user_id (can be NULL for guests), country_code.
The Problem: We need to answer three different analytical questions:
1. What is the total number of page views?
2. How many page views came from logged-in users?
3. How many unique users have viewed a page?

Analyzing Each Method

1. `COUNT(*)`: The Fastest Way to Count All Rows

Query: SELECT COUNT(*) FROM page_views;
How it works: COUNT(*) has a single purpose: to count the total number of rows in the table. It doesn't need to look at the values in any column.
Performance: PostgreSQL is highly optimized for this task. It will often find the smallest available index on the table and perform a super-fast Index-Only Scan to count the entries, which is much more efficient than scanning the entire table.
Conclusion: Always use COUNT(*) when you want the total row count. It's fast and clearly expresses your intent.

2. `COUNT(column)`: Counting Non-`NULL` Values

Query: SELECT COUNT(user_id) FROM page_views;
How it works: Unlike COUNT(*), COUNT(user_id) has a more specific job: to count the number of rows where the user_id column is NOT NULL. It has to check the value of the user_id column in each row.
Performance: It's usually slightly slower than COUNT(*). It still needs to scan an index or the table, but with the added cost of checking for NULL. If the column is not indexed, it will have to perform a Full Table Scan.
Conclusion: Only use COUNT(column) when you explicitly intend to exclude NULL values.

3. `COUNT(DISTINCT column)`: The Potential Performance Killer 🐢

Query: SELECT COUNT(DISTINCT user_id) FROM page_views;
How it works: This is an extremely heavy operation. To execute it, the database must:
1. Scan through all 200 million rows.
2. Extract the user_id value (ignoring NULLs).
3. Maintain a list of all unique user_ids it has already seen.
4. To do this, it usually has to perform a massive Sort operation or build a very large Hash Table in memory to keep track of the values.
The Bottleneck: This Sort or Hash operation is incredibly expensive. If the list of unique user_ids doesn't fit into memory (work_mem), the database will have to write temporary data to disk (spill to disk), making the query extremely slow.
Conclusion: This is the most expensive form of COUNT. Be extremely cautious when using it on large tables.

Summary Table

Method	Purpose	Performance
*`COUNT()`**	Counts all rows	Fastest 🚀
`COUNT(column)`	Counts rows where `column` is not `NULL`	Fast
`COUNT(DISTINCT column)`	Counts the number of unique values	Slow (Potentially Extremely Slow)

Pro Tip: If you need to display a COUNT(DISTINCT) metric on a dashboard and the data doesn't need to be 100% real-time, consider using a Materialized View (as in Case 16) to pre-calculate this value on a schedule.

Case 22: The Subtle Costs of COUNT: * vs. column vs. DISTINCT ​

The Story ​

Analyzing Each Method ​

1. COUNT(*): The Fastest Way to Count All Rows ​

2. COUNT(column): Counting Non-NULL Values ​

3. COUNT(DISTINCT column): The Potential Performance Killer 🐢 ​