Skip to content

Of course, let's proceed with Case 22. This time, we'll look at a topic that seems basic but hides many performance traps: the different variations of the COUNT function.


Case 22: The Subtle Costs of COUNT: * vs. column vs. DISTINCT

The Story

  • Table: page_views, a very large table recording every page view, with 200,000,000 rows.
  • Columns: id, url, user_id (can be NULL for guests), country_code.
  • The Problem: We need to answer three different analytical questions:
    1. What is the total number of page views?
    2. How many page views came from logged-in users?
    3. How many unique users have viewed a page?

Analyzing Each Method

1. COUNT(*): The Fastest Way to Count All Rows

  • Query: SELECT COUNT(*) FROM page_views;
  • How it works: COUNT(*) has a single purpose: to count the total number of rows in the table. It doesn't need to look at the values in any column.
  • Performance: PostgreSQL is highly optimized for this task. It will often find the smallest available index on the table and perform a super-fast Index-Only Scan to count the entries, which is much more efficient than scanning the entire table.
  • Conclusion: Always use COUNT(*) when you want the total row count. It's fast and clearly expresses your intent.

2. COUNT(column): Counting Non-NULL Values

  • Query: SELECT COUNT(user_id) FROM page_views;
  • How it works: Unlike COUNT(*), COUNT(user_id) has a more specific job: to count the number of rows where the user_id column is NOT NULL. It has to check the value of the user_id column in each row.
  • Performance: It's usually slightly slower than COUNT(*). It still needs to scan an index or the table, but with the added cost of checking for NULL. If the column is not indexed, it will have to perform a Full Table Scan.
  • Conclusion: Only use COUNT(column) when you explicitly intend to exclude NULL values.

3. COUNT(DISTINCT column): The Potential Performance Killer 🐢

  • Query: SELECT COUNT(DISTINCT user_id) FROM page_views;
  • How it works: This is an extremely heavy operation. To execute it, the database must:
    1. Scan through all 200 million rows.
    2. Extract the user_id value (ignoring NULLs).
    3. Maintain a list of all unique user_ids it has already seen.
    4. To do this, it usually has to perform a massive Sort operation or build a very large Hash Table in memory to keep track of the values.
  • The Bottleneck: This Sort or Hash operation is incredibly expensive. If the list of unique user_ids doesn't fit into memory (work_mem), the database will have to write temporary data to disk (spill to disk), making the query extremely slow.
  • Conclusion: This is the most expensive form of COUNT. Be extremely cautious when using it on large tables.

Summary Table

MethodPurposePerformance
COUNT(*)Counts all rowsFastest 🚀
COUNT(column)Counts rows where column is not NULLFast
COUNT(DISTINCT column)Counts the number of unique valuesSlow (Potentially Extremely Slow)

Pro Tip: If you need to display a COUNT(DISTINCT) metric on a dashboard and the data doesn't need to be 100% real-time, consider using a Materialized View (as in Case 16) to pre-calculate this value on a schedule.