Skip to content

I'm glad you're finding the series helpful! We'll continue with a more architectural topic, where we intentionally "break" standard design rules to achieve extreme performance: Denormalization.


Case 15: Denormalization for Extreme Read Performance

The Story

  • System: A social network with massive traffic, similar to Twitter or Facebook.
  • Normalized Schema:
    • users (100 million users): id, username, avatar_url, ...
    • posts (billions of posts): id, user_id (foreign key to users.id), content, created_at.
  • The Problem: The "news feed" is the most frequently accessed page. To display 50 posts on the feed, the system has to run a query that JOINs the posts and users tables to get the author's username and avatar_url.

The SQL Query in a Normalized World

sql
SELECT
    p.content,
    p.created_at,
    u.username,
    u.avatar_url
FROM posts p
JOIN users u ON p.user_id = u.id
ORDER BY p.created_at DESC
LIMIT 50;

The Bottleneck

  1. JOIN at Massive Scale: Although a single JOIN is very fast, when it's executed millions of times per minute by all online users, it creates a huge load on the database. The users table becomes a "hotspot," constantly being accessed.
  2. Latency: Every JOIN has a certain cost. At a large scale, this small cost adds up to noticeable latency.

The Solution: Denormalization ✅

  • The Logic: Instead of JOINing to get the data every time we read, we will duplicate the less frequently changed data and store it directly where it's needed.

  • The New Schema Design: Modify the posts table to include the author's information.

    • posts: id, user_id, content, created_at, author_username, author_avatar_url.
  • How it Works:

    • On Write: When a user creates a new post, the application fetches their username and avatar_url and saves them directly into the new columns in the posts table.
    • On Read: The query to get the news feed now becomes incredibly simple and fast.
    sql
    -- NO JOIN NEEDED!
    SELECT
        content,
        created_at,
        author_username,
        author_avatar_url
    FROM posts
    ORDER BY created_at DESC
    LIMIT 50;

    This query only needs to scan the created_at index of a single table, completely eliminating the burden of the JOIN.


The Trade-off

This is a significant architectural decision with a clear trade-off.

  • The Gain:

    • Extremely Fast Reads: The system's most critical query (reading the feed) becomes as fast as possible.
    • Reduced Database Load: Significantly reduces the number of JOINs, helping the database handle more read requests.
  • The Cost:

    • More Complex Writes:
      • The logic for INSERTing a new post now has to do the extra work of copying data.
      • The Biggest Problem: When a user changes their username or avatar, you must find all of their old posts to update the copied information. This can be a very heavy and complex background job.
    • Redundant and Potentially Inconsistent Data:
      • You are storing the same data in multiple places, which uses more disk space.
      • There is a risk of data becoming inconsistent if the background update job fails (e.g., old posts showing the old username).

Conclusion

  • Denormalization is a high-level optimization technique used when read performance is the absolute priority and a specific JOIN has been identified as the main system bottleneck under high load.
  • It is a conscious trade-off: making writes more complex to make reads much, much simpler and faster.
  • This is not something you should do by default, but rather a solution for performance problems at a massive scale.