Skip to content

Case 24: The Hidden Cost of Choosing the Wrong Data Type

Of course. Let's continue the series with a fundamental topic that is often overlooked but can cause quiet performance problems throughout an application's life: Choosing the right data types.


The Scenario 📝

  • System: A developer is designing a new table for a critical feature: notifications. This table is expected to grow very large.
  • The Problem: The developer needs to decide on the data types for a few important columns. Their choices now will affect performance and storage costs for years to come.

Analyzing the Choices and Their Consequences

1. The status Column: TEXT vs. ENUM vs. smallint

This column needs to store the status of a notification, such as 'unread', 'read', or 'archived'.

  • A Poor Choice (TEXT):

    • The Logic: "Let's use TEXT for flexibility. We can add any new status we want later."
    • The Consequences (The Bottleneck):
      1. Wasted Space: Each row must store the entire text string, like "unread" (6 bytes) or "archived" (8 bytes). With 100 million notifications, this column alone could waste hundreds of megabytes.
      2. Slow Comparisons: Comparing text strings is always slower than comparing numbers.
      3. Large, Inefficient Indexes: An index on a TEXT column is much larger than an index on a number column. This makes searching and maintaining the index slower.
  • A Good Choice (ENUM or smallint):

    • The Logic: Represent the statuses with a more compact data type.
    • Method 1 (ENUM - a PostgreSQL feature):
      sql
      CREATE TYPE notification_status AS ENUM ('unread', 'read', 'archived');
      CREATE TABLE notifications (
          status notification_status
      );
      An ENUM stores the data internally as a small number but displays it as text. This ensures data integrity and high performance.
    • Method 2 (smallint - the traditional way): Use a small integer (which only takes 2 bytes) to represent the status (0 for 'unread', 1 for 'read', 2 for 'archived'). The meaning of these numbers is handled in the application code.
    • The Result: Both methods provide the smallest storage size, the fastest comparison speed, and the most efficient indexes.

2. The Primary Key id Column: BIGINT (auto-increment) vs. UUID

  • Choosing BIGINT (with a SEQUENCE):

    • Advantages: It only takes up 8 bytes. Most importantly, the values are generated sequentially. When you INSERT a new row, the database just adds it to the end of the index, which is a very fast and cache-friendly operation.
    • Disadvantages: You need the database to create the ID, and it's not globally unique.
  • Choosing UUID (specifically, a random UUIDv4):

    • Advantages: It's globally unique, and the application can generate IDs without asking the database.
    • Disadvantages (The Bottleneck):
      1. Large Size: It's 16 bytes, twice the size of a BIGINT.
      2. Index Fragmentation: Because the IDs are generated completely randomly, every new INSERT might have to be written to a random place in the index tree. This causes severe index fragmentation, which slows down both writes and, over time, reads.
  • A Hybrid Solution: Use UUIDv7, as we've discussed before. It combines the advantages of both: it's globally unique and can be sorted sequentially.


Conclusion

  • The Golden Rule: "Always choose the smallest and simplest data type that can accurately represent your data."
  • Decisions made when designing a schema have a deep and long-lasting impact on performance.
  • Don't use strings for columns with a small number of distinct values (like status, type, or gender). Use ENUM, boolean, or small integers instead.
  • Think carefully about your primary key strategy. For tables with a high number of writes, a sequential BIGINT or UUIDv7 is often a much better choice than a random UUIDv4.

Optimizing at the design level can prevent entire classes of performance problems before they even start.