How to Optimize Large Tables in SQL

Preface Large table optimization is a long-standing topic, but as business scales grow, someone always ends up getting “burned” by it. Many databases perform well at the beginning—queries are smooth—but once the data volume in a table reaches tens of millions of records, performance issues start to emerge: slow queries, laggy writes, sluggish pagination, and even occasional system crashes. At that point, you might start wondering: is the database not good enough? Do we need to upgrade to better hardware? In reality, the root cause is often poor optimization. Today, we’ll start from the core of the problem, analyzing the common performance bottlenecks of large tables step by step, and exploring how to optimize them one step at a time. Hopefully, this will be helpful to you. 1. Why Are Large Tables Slow? Before we jump into optimization, let’s first understand the root causes of performance issues in large tables. Why does the database slow down when the data volume grows? 1.1 Disk I/O Bottleneck Large table data is stored on disk, and database queries usually involve reading data blocks from disk. When the dataset is very large, a single query might need to read a large amount of data from multiple disk blocks. The read/write speed of the disk directly limits query performance. Example: Suppose you have an order table named orders with 50 million records, and you want to query the latest 10 orders of a user: SELECT * FROM orders WHERE user_id = 123 ORDER BY order_time DESC LIMIT 10; If there is no index, the database will scan the entire table and then sort the results—performance will undoubtedly suffer. 1.2 Index Missing or Ineffective If a query does not hit an index, the database will perform a full table scan—reading all rows one by one. This operation is extremely resource-intensive with tens of millions of records, and performance drops sharply. Example: Let’s say you run a query like this: SELECT * FROM orders WHERE DATE(order_time) = '2025-01-01'; Here, the DATE() function is applied, which requires the database to compute the value of order_time for all records, causing the index to become ineffective. 1.3 Pagination Performance Degradation Pagination is common with large tables, but deep pagination (e.g., after page 100) leads to performance issues. Even if you only need 10 records, the database still needs to scan all the previous ones first. Example: Query 10 records from page 1000: SELECT * FROM orders ORDER BY order_time DESC LIMIT 9990, 10; This SQL tells the database to fetch the first 9990 records, discard them, and return the next 10. As the page number increases, query performance continues to deteriorate. 1.4 Lock Contention In high-concurrency scenarios, multiple threads simultaneously performing insert, update, delete, or select operations on the same table can lead to row lock or table lock contention, thereby affecting performance. 2. Overall Approach to Performance Optimization The essence of performance optimization is to reduce unnecessary I/O, computation, and lock contention. The goal is to minimize “wasted effort” by the database. The general optimization approach can be summarized as follows: Design reasonable table structures: Avoid unnecessary fields; split data where possible. Use efficient indexing: Design appropriate index structures and avoid index invalidation. Optimize SQL queries: Make conditions precise and avoid full table scans whenever possible. Apply table sharding: Use horizontal or vertical sharding to reduce the data volume per table. Leverage caching and asynchronization: Reduce direct pressure on the database. Next, we’ll dive into each part. 3. Table Structure Optimization The table structure forms the foundation of database performance optimization. Poorly designed table structures can lead to serious issues in both querying and storage performance. 3.1 Streamline Field Types Field types determine storage size and query performance. Use INT instead of BIGINT when possible. Use VARCHAR(100) instead of TEXT if appropriate. For time fields, it’s better to use TIMESTAMP or DATETIME rather than CHAR or VARCHAR. Example: -- Not recommended CREATE TABLE orders ( id BIGINT, user_id BIGINT, order_status VARCHAR(255), remarks TEXT ); -- Optimized CREATE TABLE orders ( id BIGINT, user_id INT UNSIGNED, order_status TINYINT, -- Use enum for status remarks VARCHAR(500) -- Set a maximum length ); This helps save storage space and improves query performance. 3.2 Table Sharding: Vertical and Horizontal Sharding Vertical Sharding When a table has too many fields and some of them are not frequently queried, it can be split into multiple smaller tables based on business logic. Example: Split the orders table into two tables—orders_basic and orders_details.

Apr 8, 2025 - 19:31

Preface

Large table optimization is a long-standing topic, but as business scales grow, someone always ends up getting “burned” by it.

Many databases perform well at the beginning—queries are smooth—but once the data volume in a table reaches tens of millions of records, performance issues start to emerge: slow queries, laggy writes, sluggish pagination, and even occasional system crashes.

At that point, you might start wondering: is the database not good enough? Do we need to upgrade to better hardware?

In reality, the root cause is often poor optimization.

Today, we’ll start from the core of the problem, analyzing the common performance bottlenecks of large tables step by step, and exploring how to optimize them one step at a time. Hopefully, this will be helpful to you.

1. Why Are Large Tables Slow?

Before we jump into optimization, let’s first understand the root causes of performance issues in large tables. Why does the database slow down when the data volume grows?

1.1 Disk I/O Bottleneck

Large table data is stored on disk, and database queries usually involve reading data blocks from disk.

When the dataset is very large, a single query might need to read a large amount of data from multiple disk blocks. The read/write speed of the disk directly limits query performance.

Example:

Suppose you have an order table named orders with 50 million records, and you want to query the latest 10 orders of a user:

SELECT * FROM orders WHERE user_id = 123 ORDER BY order_time DESC LIMIT 10;

If there is no index, the database will scan the entire table and then sort the results—performance will undoubtedly suffer.

1.2 Index Missing or Ineffective

If a query does not hit an index, the database will perform a full table scan—reading all rows one by one.

This operation is extremely resource-intensive with tens of millions of records, and performance drops sharply.

Example:

Let’s say you run a query like this:

SELECT * FROM orders WHERE DATE(order_time) = '2025-01-01';

Here, the DATE() function is applied, which requires the database to compute the value of order_time for all records, causing the index to become ineffective.

1.3 Pagination Performance Degradation

Pagination is common with large tables, but deep pagination (e.g., after page 100) leads to performance issues.

Even if you only need 10 records, the database still needs to scan all the previous ones first.

Example:

Query 10 records from page 1000:

SELECT * FROM orders ORDER BY order_time DESC LIMIT 9990, 10;

This SQL tells the database to fetch the first 9990 records, discard them, and return the next 10.

As the page number increases, query performance continues to deteriorate.

1.4 Lock Contention

In high-concurrency scenarios, multiple threads simultaneously performing insert, update, delete, or select operations on the same table can lead to row lock or table lock contention, thereby affecting performance.

2. Overall Approach to Performance Optimization

The essence of performance optimization is to reduce unnecessary I/O, computation, and lock contention. The goal is to minimize “wasted effort” by the database.

The general optimization approach can be summarized as follows:

Design reasonable table structures: Avoid unnecessary fields; split data where possible.
Use efficient indexing: Design appropriate index structures and avoid index invalidation.
Optimize SQL queries: Make conditions precise and avoid full table scans whenever possible.
Apply table sharding: Use horizontal or vertical sharding to reduce the data volume per table.
Leverage caching and asynchronization: Reduce direct pressure on the database.

Next, we’ll dive into each part.

3. Table Structure Optimization

The table structure forms the foundation of database performance optimization. Poorly designed table structures can lead to serious issues in both querying and storage performance.

3.1 Streamline Field Types

Field types determine storage size and query performance.

Use INT instead of BIGINT when possible.
Use VARCHAR(100) instead of TEXT if appropriate.
For time fields, it’s better to use TIMESTAMP or DATETIME rather than CHAR or VARCHAR.

Example:

-- Not recommended
CREATE TABLE orders (
    id BIGINT,
    user_id BIGINT,
    order_status VARCHAR(255),
    remarks TEXT
);

-- Optimized
CREATE TABLE orders (
    id BIGINT,
    user_id INT UNSIGNED,
    order_status TINYINT,  -- Use enum for status
    remarks VARCHAR(500)   -- Set a maximum length
);

This helps save storage space and improves query performance.

3.2 Table Sharding: Vertical and Horizontal Sharding

Vertical Sharding

When a table has too many fields and some of them are not frequently queried, it can be split into multiple smaller tables based on business logic.

Example: Split the orders table into two tables—orders_basic and orders_details.

-- Basic information table
CREATE TABLE orders_basic (
    id BIGINT PRIMARY KEY,
    user_id INT UNSIGNED,
    order_time TIMESTAMP
);

-- Details table
CREATE TABLE orders_details (
    id BIGINT PRIMARY KEY,
    remarks VARCHAR(500),
    shipping_address VARCHAR(255)
);

Horizontal Sharding

When a single table has too many records, you can split it into multiple tables based on certain rules.

Example: Split the orders table by user ID:

orders_0 -- Stores orders where user_id % 2 = 0
orders_1 -- Stores orders where user_id % 2 = 1

After sharding, the number of records per table is greatly reduced, which significantly improves query performance.

4. Index Optimization

Indexes are the “primary weapon” for database performance optimization. However, many developers are not familiar with how to use indexes effectively, which can lead to degraded rather than improved performance.

4.1 Create Appropriate Indexes

Create indexes for high-frequency query fields, such as primary keys, foreign keys, and fields used in query conditions.

Example:

CREATE INDEX idx_user_id_order_time ON orders (user_id, order_time DESC);

The above composite index can accelerate queries that filter by user_id and sort by order_time at the same time.

4.2 Avoid Index Invalidation

Don’t use functions or operations on indexed fields

Incorrect:

SELECT * FROM orders WHERE DATE(order_time) = '2023-01-01';

Optimized:

SELECT * FROM orders WHERE order_time >= '2023-01-01 00:00:00'
  AND order_time < '2023-01-02 00:00:00';

Be careful with implicit type conversion

Incorrect:

SELECT * FROM orders WHERE user_id = '123';

Optimized:

SELECT * FROM orders WHERE user_id = 123;

If the data type of the parameter doesn’t match the field type, the database may perform an implicit conversion, which can prevent the index from being used.

5. SQL Optimization

5.1 Reduce Queried Fields

Only query the fields you actually need. Avoid using SELECT *.

-- Incorrect
SELECT * FROM orders WHERE user_id = 123;

-- Optimized
SELECT id, order_time FROM orders WHERE user_id = 123;

Querying unnecessary fields increases I/O and memory usage, especially when the table contains large fields like TEXT or BLOB.

5.2 Pagination Optimization

For deep pagination, use the “seek method” (also called keyset pagination or cursor-based pagination) to avoid scanning excessive data.

-- Deep pagination (poor performance)
SELECT * FROM orders ORDER BY order_time DESC LIMIT 9990, 10;

-- Optimized: use a cursor
SELECT * FROM orders
WHERE order_time < '2023-01-01 12:00:00'
ORDER BY order_time DESC
LIMIT 10;

By using a reference point (order_time in this case), the database can skip unnecessary rows and directly locate the next batch, which improves performance significantly for large datasets.

6. Database and Table Partitioning

6.1 Horizontal Database and Table Partitioning

If splitting a single table is still not enough to meet performance demands, you can distribute the data across multiple databases and tables.

Common partitioning strategies include:

Modulo by user ID
Partition by time

By distributing data across different storage locations, read and write loads can be spread out, reducing contention and improving scalability.

7. Caching and Asynchronization

7.1 Use Redis to Cache Hot Data

For high-frequency queries, store the data in Redis to reduce direct access to the database.

Example:

// Read data from cache
String result = redis.get("orders:user:123");
if (result == null) {
    result = database.query("SELECT * FROM orders WHERE user_id = 123");
    redis.set("orders:user:123", result, 3600); // Cache for 1 hour
}

This approach is especially effective for data that doesn’t change frequently, such as user order history within the last 30 days.

7.2 Use Message Queues to Handle Writes Asynchronously

In high-concurrency write scenarios, write operations can be pushed into a message queue (such as Kafka), and then written to the database in batches asynchronously. This greatly reduces the pressure on the database.

For example, placing order creation requests into a queue and letting background workers persist them to the database can significantly improve system throughput and responsiveness.

8. Real-World Case Study

Problem:

An e-commerce system has an orders table with 50 million records. When users query order details, the page takes more than 10 seconds to load.

Solution:

Vertical sharding of the orders table: Move detailed fields to a separate table.
Create composite indexes: Add indexes on user_id and order_time.
Use Redis caching: Cache the user's last 30 days of order data in Redis.
Pagination optimization: Use search_after (or similar cursor-based techniques) instead of LIMIT for deep pagination.

These optimizations collectively reduced query response time from over 10 seconds to under 500 milliseconds.

Summary

Optimizing large table performance is a systematic task that requires holistic consideration—from table structure and indexes to SQL queries and architectural design.

A dataset with tens of millions of records may seem massive, but with appropriate sharding, index design, and caching strategies, databases can handle such workloads with ease.

The most important takeaway is to choose the optimization strategy that best fits your specific business scenario—don’t blindly pursue “fancy” or overengineered solutions.

We hope these lessons and techniques are helpful to you!