How Indexes Really Work

Alright, we know indexes make databases faster. But why? It's not just magic; it's about efficiently navigating the way data is physically stored and reducing the amount of work the database has to do, specifically when reading from disk. Think of your database table, maybe a Users table, as a massive collection of records (rows in SQL, documents in NoSQL). These records aren't just floating in memory; they have to be serialized and stored persistently on disk (HDD or SSD). The Slow Lane: Life Without Indexes (Full Table Scans) Let's visualize how data hits the disk. Imagine our Users table: CREATE TABLE Users ( ID INT PRIMARY KEY, -- Let's say 4 bytes Name VARCHAR(60), -- 60 bytes Age INT, -- 4 bytes Bio TEXT, -- 128 bytes (can be large!) TotalBlogs INT -- 4 bytes ); Based on these sizes, each user record takes up 4 + 60 + 4 + 128 + 4 = 200 bytes when stored. Now, disks don't read data byte-by-byte. They read in blocks (or pages). A common block size is 4KB, but for simplicity, let's use the transcript's example of 600 bytes per block. If each record is 200 bytes and a block is 600 bytes, we can fit 600 / 200 = 3 user records into a single disk block. Let's say our Users table has 100 rows. To store these, we'd need: 100 rows / 3 rows/block = 33.33 blocks Since we can't have a fraction of a block, we need 34 blocks to store the entire table sequentially. Now, the critical part: Disk I/O is slow! Reading a block from disk is orders of magnitude slower than reading from memory. Consider this query: SELECT * FROM Users WHERE Age = 23; Without an index on Age, how does the database find these users? It has no choice but to perform a full table scan: Read Block 1 (contains rows 1-3) into memory. Check the Age for each of those 3 rows. Keep matches. Discard the block (conceptually). Read Block 2 (contains rows 4-6) into memory. Check the Age for rows 4-6. Keep matches. Discard the block. ... Repeat for all 34 blocks. Even if only two users have Age = 23, the database still had to read and process all 34 blocks just to find them. If reading one block takes a hypothetical 1 second (as per the transcript's illustration – it's much faster in reality, but the principle holds), this query takes 34 seconds. That's painful. The Fast Lane: Introducing Indexes An index is essentially a separate, smaller, and specially structured table that acts like a quick lookup guide for your main table data. It stores the indexed column's value and a pointer (like the row ID or primary key) to the actual data row. Let's create an index on our Age column: CREATE INDEX idx_users_age ON Users(Age); This idx_users_age index might look something like this conceptually (it's often stored in more complex structures like B+ Trees, but let's simplify): Age Row ID (Pointer) 21 2 21 7 22 3 22 5 23 1 23 4 24 6 ... ... (Sorted by Age) Key characteristics: Smaller: Each index entry only contains the indexed value (Age, 4 bytes) and the pointer (ID, 4 bytes) = 8 bytes. Much smaller than the full 200-byte record! Sorted: The index is kept sorted by the indexed value (Age). This is crucial for fast lookups. Now, let's calculate the size of this index for our 100-row table: 100 entries * 8 bytes/entry = 800 bytes. Using our 600-byte blocks, how many blocks does the index need? 800 bytes / 600 bytes/block = 1.33 blocks -> 2 blocks. How the Query Runs With the Index Let's run that same query again: SELECT * FROM Users WHERE Age = 23; This time, the database query planner sees the index on Age and uses it: Scan the Index: Instead of scanning the main table, it scans the tiny index. Because the index is sorted, it can quickly find all entries where Age = 23. In the worst case (using the transcript's simple scan model), it reads the entire index. How many blocks is that? Just 2 blocks. (Optimization Note: Real databases using B+ Trees are even smarter and often don't need to read all index blocks, but let's stick to the simple model for comparison.) Identify Relevant Row IDs: From scanning the index, the database finds the pointers for rows where Age = 23. In our example, it gets Row ID 1 and Row ID 4. Fetch Actual Rows: Now, armed with the specific Row IDs, the database goes directly to the main table blocks containing those rows. Row ID 1 is in Block 1. Read Block 1 (1 block read). Row ID 4 is in Block 2. Read Block 2 (1 block read). Return Results: Collect the full data for rows 1 and 4 and return them. Total disk blocks read in this indexed query: 2 blocks (for index scan) + 1 block (for row 1) + 1 block (for row 4) = **4 blocks**. The Payoff: Comparing Performance Without Index: 34 block reads (34 hypothetical seconds) With Index: 4 block reads (4 hypothetical seconds) That's an 8x performance improvement in this simplified example, just by add

Apr 2, 2025 - 04:24
 0
How Indexes Really Work

Alright, we know indexes make databases faster. But why? It's not just magic; it's about efficiently navigating the way data is physically stored and reducing the amount of work the database has to do, specifically when reading from disk.

Think of your database table, maybe a Users table, as a massive collection of records (rows in SQL, documents in NoSQL). These records aren't just floating in memory; they have to be serialized and stored persistently on disk (HDD or SSD).

The Slow Lane: Life Without Indexes (Full Table Scans)

Let's visualize how data hits the disk. Imagine our Users table:

CREATE TABLE Users (
    ID INT PRIMARY KEY, -- Let's say 4 bytes
    Name VARCHAR(60),   -- 60 bytes
    Age INT,            -- 4 bytes
    Bio TEXT,           -- 128 bytes (can be large!)
    TotalBlogs INT      -- 4 bytes
);

Based on these sizes, each user record takes up 4 + 60 + 4 + 128 + 4 = 200 bytes when stored.

Now, disks don't read data byte-by-byte. They read in blocks (or pages). A common block size is 4KB, but for simplicity, let's use the transcript's example of 600 bytes per block.

Block Allocation

If each record is 200 bytes and a block is 600 bytes, we can fit 600 / 200 = 3 user records into a single disk block.

Index Allocation

Let's say our Users table has 100 rows. To store these, we'd need:
100 rows / 3 rows/block = 33.33 blocks
Since we can't have a fraction of a block, we need 34 blocks to store the entire table sequentially.

Now, the critical part: Disk I/O is slow! Reading a block from disk is orders of magnitude slower than reading from memory.

Consider this query:

SELECT * FROM Users WHERE Age = 23;

Without an index on Age, how does the database find these users? It has no choice but to perform a full table scan:

  1. Read Block 1 (contains rows 1-3) into memory.
  2. Check the Age for each of those 3 rows. Keep matches. Discard the block (conceptually).
  3. Read Block 2 (contains rows 4-6) into memory.
  4. Check the Age for rows 4-6. Keep matches. Discard the block.
  5. ... Repeat for all 34 blocks.

Searching in whole table

Even if only two users have Age = 23, the database still had to read and process all 34 blocks just to find them. If reading one block takes a hypothetical 1 second (as per the transcript's illustration – it's much faster in reality, but the principle holds), this query takes 34 seconds. That's painful.

The Fast Lane: Introducing Indexes

An index is essentially a separate, smaller, and specially structured table that acts like a quick lookup guide for your main table data. It stores the indexed column's value and a pointer (like the row ID or primary key) to the actual data row.

Let's create an index on our Age column:

CREATE INDEX idx_users_age ON Users(Age);

This idx_users_age index might look something like this conceptually (it's often stored in more complex structures like B+ Trees, but let's simplify):

Age Row ID (Pointer)
21 2
21 7
22 3
22 5
23 1
23 4
24 6
... ...
(Sorted by Age)

Key characteristics:

  1. Smaller: Each index entry only contains the indexed value (Age, 4 bytes) and the pointer (ID, 4 bytes) = 8 bytes. Much smaller than the full 200-byte record!
  2. Sorted: The index is kept sorted by the indexed value (Age). This is crucial for fast lookups.

Now, let's calculate the size of this index for our 100-row table:
100 entries * 8 bytes/entry = 800 bytes.

Using our 600-byte blocks, how many blocks does the index need?
800 bytes / 600 bytes/block = 1.33 blocks -> 2 blocks.

How the Query Runs With the Index

Let's run that same query again:

SELECT * FROM Users WHERE Age = 23;

This time, the database query planner sees the index on Age and uses it:

  1. Scan the Index: Instead of scanning the main table, it scans the tiny index. Because the index is sorted, it can quickly find all entries where Age = 23. In the worst case (using the transcript's simple scan model), it reads the entire index. How many blocks is that? Just 2 blocks.
    • (Optimization Note: Real databases using B+ Trees are even smarter and often don't need to read all index blocks, but let's stick to the simple model for comparison.)
  2. Identify Relevant Row IDs: From scanning the index, the database finds the pointers for rows where Age = 23. In our example, it gets Row ID 1 and Row ID 4.
  3. Fetch Actual Rows: Now, armed with the specific Row IDs, the database goes directly to the main table blocks containing those rows.
    • Row ID 1 is in Block 1. Read Block 1 (1 block read).
    • Row ID 4 is in Block 2. Read Block 2 (1 block read).
  4. Return Results: Collect the full data for rows 1 and 4 and return them.

Finding Index

Total disk blocks read in this indexed query:
2 blocks (for index scan) + 1 block (for row 1) + 1 block (for row 4) = **4 blocks**.

The Payoff: Comparing Performance

  • Without Index: 34 block reads (34 hypothetical seconds)
  • With Index: 4 block reads (4 hypothetical seconds)

That's an 8x performance improvement in this simplified example, just by adding an index! In real-world scenarios with larger tables and more complex queries, the difference can be even more dramatic (100x, 1000x, or more!).

Why Does This Matter For You?

  1. Reduced Disk I/O: This is the core win. Indexes drastically cut down the number of blocks read from slow disk storage.
  2. Targeted Reads: Instead of blindly scanning everything, the database uses the index to pinpoint exactly which data blocks it needs.
  3. Query Planning: Understand that your database's query planner tries to use indexes for columns in your WHERE, JOIN ON, ORDER BY, and sometimes GROUP BY clauses.
  4. Index Selectivity: Indexes work best on columns where the values are diverse (high selectivity). Indexing a gender column with only 'M', 'F', 'Other' might be less effective than indexing an email column.
  5. The Trade-off: Indexes aren't free! They consume disk space and slightly slow down INSERT, UPDATE, and DELETE operations because the index needs to be updated too. Choose your indexes wisely based on your read patterns.

Conclusion

Indexes are not a silver bullet, but they are arguably the single most impactful tool for optimizing database read performance. By creating these smaller, sorted lookup structures, we drastically reduce the need for costly full table scans, minimizing disk I/O and making our queries fly.

Next time you face a slow query, don't just throw more hardware at it. Analyze the query plan (EXPLAIN ANALYZE), check your WHERE and JOIN clauses, and ensure you have appropriate indexes on the right columns. Your users (and your servers) will thank you.
Credit : Arpit Bhayani