Of course, let's proceed with Case 22. This time, we'll look at a topic that seems basic but hides many performance traps: the different variations of the COUNT function.
Case 22: The Subtle Costs of COUNT: * vs. column vs. DISTINCT
The Story
- Table:
page_views, a very large table recording every page view, with 200,000,000 rows. - Columns:
id,url,user_id(can beNULLfor guests),country_code. - The Problem: We need to answer three different analytical questions:
- What is the total number of page views?
- How many page views came from logged-in users?
- How many unique users have viewed a page?
Analyzing Each Method
1. COUNT(*): The Fastest Way to Count All Rows
- Query:
SELECT COUNT(*) FROM page_views; - How it works:
COUNT(*)has a single purpose: to count the total number of rows in the table. It doesn't need to look at the values in any column. - Performance: PostgreSQL is highly optimized for this task. It will often find the smallest available index on the table and perform a super-fast Index-Only Scan to count the entries, which is much more efficient than scanning the entire table.
- Conclusion: Always use
COUNT(*)when you want the total row count. It's fast and clearly expresses your intent.
2. COUNT(column): Counting Non-NULL Values
- Query:
SELECT COUNT(user_id) FROM page_views; - How it works: Unlike
COUNT(*),COUNT(user_id)has a more specific job: to count the number of rows where theuser_idcolumn is NOTNULL. It has to check the value of theuser_idcolumn in each row. - Performance: It's usually slightly slower than
COUNT(*). It still needs to scan an index or the table, but with the added cost of checking forNULL. If the column is not indexed, it will have to perform a Full Table Scan. - Conclusion: Only use
COUNT(column)when you explicitly intend to excludeNULLvalues.
3. COUNT(DISTINCT column): The Potential Performance Killer 🐢
- Query:
SELECT COUNT(DISTINCT user_id) FROM page_views; - How it works: This is an extremely heavy operation. To execute it, the database must:
- Scan through all 200 million rows.
- Extract the
user_idvalue (ignoringNULLs). - Maintain a list of all unique
user_ids it has already seen. - To do this, it usually has to perform a massive
Sortoperation or build a very largeHash Tablein memory to keep track of the values.
- The Bottleneck: This
SortorHashoperation is incredibly expensive. If the list of uniqueuser_ids doesn't fit into memory (work_mem), the database will have to write temporary data to disk (spill to disk), making the query extremely slow. - Conclusion: This is the most expensive form of
COUNT. Be extremely cautious when using it on large tables.
Summary Table
| Method | Purpose | Performance |
|---|---|---|
COUNT(*) | Counts all rows | Fastest 🚀 |
COUNT(column) | Counts rows where column is not NULL | Fast |
COUNT(DISTINCT column) | Counts the number of unique values | Slow (Potentially Extremely Slow) |
Pro Tip: If you need to display a COUNT(DISTINCT) metric on a dashboard and the data doesn't need to be 100% real-time, consider using a Materialized View (as in Case 16) to pre-calculate this value on a schedule.