7 Retry Patterns You Should Know

Preface In business systems, failure is a norm. Whether it's due to network fluctuations, service overload, or unstable third-party interfaces, a system must possess "self-healing" capabilities to handle occasional anomalies. Retry mechanisms are one of the core components of a system’s self-recovery ability. However, retries are a double-edged sword: when designed well, they improve success rates and enhance user experience; when designed poorly, they can lead to request storms, cascading failures, and even amplify problems into incidents. This article discusses 7 commonly used retry strategies. 1. Brute-force Looping Problem Scenario A user registration SMS-sending interface repeatedly calls a third-party SMS API in a while loop. Code example: public void sendSms(String phone) { int retry = 0; while (retry e instanceof TimeoutException) .build(); // 2. Circuit breaker config: trigger if failure rate > 50% CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom() .slidingWindow(10, 10, CircuitBreakerConfig.SlidingWindowType.COUNT_BASED) .failureRateThreshold(50) .build(); // Combined use Retry retry = Retry.of("payment", retryConfig); CircuitBreaker cb = CircuitBreaker.of("payment", cbConfig); // Execute business logic Supplier supplier = () -> paymentService.pay(); Supplier decorated = Decorators.ofSupplier(supplier) .withRetry(retry) .withCircuitBreaker(cb) .decorate(); Result After deploying this solution, one company's payment API timeout rate dropped by 60%, and the frequency of circuit breaker triggers fell by nearly 90%. 4. MQ Queue Use Case High-concurrency, delay-tolerant asynchronous scenarios (e.g., logistics status sync). Implementation Principle If the initial request fails, the message is sent to a delay queue. The queue retries message consumption after a preset delay (e.g., 5s, 30s, 1min). If max retries are reached, the message is moved to a dead letter queue for manual handling. RocketMQ code snippet: // Producer sends delayed message Message message = new Message(); message.setBody("Order data"); message.setDelayTimeLevel(3); // RocketMQ level 3 = 10s delay rocketMQTemplate.send(message); // Consumer retries @RocketMQMessageListener(topic = "DELAY_TOPIC") public class DelayConsumer { @Override public void handleMessage(Message message) { try { syncLogistics(message); } catch (Exception e) { // Retry with incremented delay level resendWithDelay(message, retryCount + 1); } } } RocketMQ automatically retries failed consumer operations. 5. Scheduled Task Use Case For tasks that do not require real-time response and allow batch processing (e.g., file import), scheduled jobs can be used. Example using Quartz: @Scheduled(cron = "0 0/5 * * * ?") // run every 5 minutes public void retryFailedTasks() { List list = failedTaskDao.listUnprocessed(5); // query failed tasks list.forEach(task -> { try { retryTask(task); task.markSuccess(); } catch (Exception e) { task.incrRetryCount(); } failedTaskDao.update(task); }); } 6. Two-Phase Commit Use Case For scenarios requiring strict data consistency (e.g., fund transfers), the two-phase commit mechanism can be used. Key Implementation Phase One: Record the transaction in the database (status set as “pending”). Phase Two: Call the remote interface and update the transaction status based on the result. Compensation Task: Periodically scan and retry “pending” transactions that have timed out. Sample code: @Transactional public void transfer(TransferRequest req) { // 1. Record the transaction transferRecordDao.create(req, PENDING); // 2. Call bank API boolean success = bankClient.transfer(req); // 3. Update transaction status transferRecordDao.updateStatus(req.getId(), success ? SUCCESS : FAILED); // 4. Retry asynchronously if failed if (!success) { mqTemplate.send("TRANSFER_RETRY_QUEUE", req); } } 7. Distributed Lock Use Case In scenarios involving multiple service instances or multi-threaded environments where idempotency is crucial (e.g., flash sales), a distributed lock can be used. Example using Redis + Lua for distributed locking: public boolean retryWithLock(String key, int maxRetry) { String lockKey = "api_retry_lock:" + key; for (int i = 0; i

Apr 17, 2025 - 05:23

Preface

In business systems, failure is a norm. Whether it's due to network fluctuations, service overload, or unstable third-party interfaces, a system must possess "self-healing" capabilities to handle occasional anomalies.

Retry mechanisms are one of the core components of a system’s self-recovery ability.

However, retries are a double-edged sword: when designed well, they improve success rates and enhance user experience; when designed poorly, they can lead to request storms, cascading failures, and even amplify problems into incidents.

This article discusses 7 commonly used retry strategies.

1. Brute-force Looping

Problem Scenario

A user registration SMS-sending interface repeatedly calls a third-party SMS API in a while loop.

Code example:

public void sendSms(String phone) {
    int retry = 0;
    while (retry < 5) { // brute-force loop
        try {
            smsClient.send(phone);
            break;
        } catch (Exception e) {
            retry++;
            Thread.sleep(1000); // fixed 1-second delay
        }
    }
}

Incident

The SMS server experienced overload, causing every request to be delayed by 3 seconds.

This brute-force code triggered tens of thousands of retries within 0.5 seconds, overwhelming the SMS platform and triggering a circuit breaker, even rejecting normal requests.

Lessons:

No delay interval adjustment: fixed delay caused retry bursts
Ignored exception types: retried even on non-transient errors (e.g., invalid parameters)
Fix: introduce random delays and filter out non-retriable exceptions

2. Spring Retry

Use Case

Spring Retry is suitable for small to medium-sized projects, enabling quick implementation of basic retries and circuit breaking (e.g., order status query APIs) through annotations.

Using the @Retryable annotation implements retry logic.

Configuration Example

@Retryable(
    value = {TimeoutException.class}, // retry only on timeout
    maxAttempts = 3,
    backoff = @Backoff(delay = 1000, multiplier = 2) // 1s → 2s → 4s
)
public boolean queryOrderStatus(String orderId) {
    return httpClient.get("/order/" + orderId);
}

@Recover // fallback method
public boolean fallback() {
    return false;
}

Advantages

Declarative annotations: clean code, decoupled from business logic
Exponential backoff: automatically increases retry intervals
Circuit breaker integration: combined with @CircuitBreaker to block failure traffic quickly

3. Resilience4j

Advanced Scenario

For more complex systems requiring custom backoff algorithms, circuit breaker strategies, and multi-layer protection (e.g., core payment APIs), Resilience4j is recommended.

Core code:

// 1. Retry config: exponential backoff + jitter
RetryConfig retryConfig = RetryConfig.custom()
    .maxAttempts(3)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        1000L, // initial 1s delay
        2.0,   // exponential multiplier
        0.3    // jitter factor
    ))
    .retryOnException(e -> e instanceof TimeoutException)
    .build();

// 2. Circuit breaker config: trigger if failure rate > 50%
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
    .slidingWindow(10, 10, CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
    .failureRateThreshold(50)
    .build();

// Combined use
Retry retry = Retry.of("payment", retryConfig);
CircuitBreaker cb = CircuitBreaker.of("payment", cbConfig);

// Execute business logic
Supplier<Boolean> supplier = () -> paymentService.pay();
Supplier<Boolean> decorated = Decorators.ofSupplier(supplier)
    .withRetry(retry)
    .withCircuitBreaker(cb)
    .decorate();

Result

After deploying this solution, one company's payment API timeout rate dropped by 60%, and the frequency of circuit breaker triggers fell by nearly 90%.

4. MQ Queue

Use Case

High-concurrency, delay-tolerant asynchronous scenarios (e.g., logistics status sync).

Implementation Principle

If the initial request fails, the message is sent to a delay queue.
The queue retries message consumption after a preset delay (e.g., 5s, 30s, 1min).
If max retries are reached, the message is moved to a dead letter queue for manual handling.

RocketMQ code snippet:

// Producer sends delayed message
Message<String> message = new Message();
message.setBody("Order data");
message.setDelayTimeLevel(3); // RocketMQ level 3 = 10s delay
rocketMQTemplate.send(message);

// Consumer retries
@RocketMQMessageListener(topic = "DELAY_TOPIC")
public class DelayConsumer {
    @Override
    public void handleMessage(Message message) {
        try {
            syncLogistics(message);
        } catch (Exception e) {
            // Retry with incremented delay level
            resendWithDelay(message, retryCount + 1);
        }
    }
}

RocketMQ automatically retries failed consumer operations.

5. Scheduled Task

Use Case

For tasks that do not require real-time response and allow batch processing (e.g., file import), scheduled jobs can be used.

Example using Quartz:

@Scheduled(cron = "0 0/5 * * * ?") // run every 5 minutes
public void retryFailedTasks() {
    List<FailedTask> list = failedTaskDao.listUnprocessed(5); // query failed tasks
    list.forEach(task -> {
        try {
            retryTask(task);
            task.markSuccess();
        } catch (Exception e) {
            task.incrRetryCount();
        }
        failedTaskDao.update(task);
    });
}

6. Two-Phase Commit

Use Case

For scenarios requiring strict data consistency (e.g., fund transfers), the two-phase commit mechanism can be used.

Key Implementation

Phase One: Record the transaction in the database (status set as “pending”).
Phase Two: Call the remote interface and update the transaction status based on the result.
Compensation Task: Periodically scan and retry “pending” transactions that have timed out.

Sample code:

@Transactional
public void transfer(TransferRequest req) {
    // 1. Record the transaction
    transferRecordDao.create(req, PENDING);

    // 2. Call bank API
    boolean success = bankClient.transfer(req);

    // 3. Update transaction status
    transferRecordDao.updateStatus(req.getId(), success ? SUCCESS : FAILED);

    // 4. Retry asynchronously if failed
    if (!success) {
        mqTemplate.send("TRANSFER_RETRY_QUEUE", req);
    }
}

7. Distributed Lock

Use Case

In scenarios involving multiple service instances or multi-threaded environments where idempotency is crucial (e.g., flash sales), a distributed lock can be used.

Example using Redis + Lua for distributed locking:

public boolean retryWithLock(String key, int maxRetry) {
    String lockKey = "api_retry_lock:" + key;
    for (int i = 0; i < maxRetry; i++) {
        // Attempt to acquire distributed lock
        if (redis.setnx(lockKey, "1", 30, TimeUnit.SECONDS)) {
            try {
                return callApi();
            } finally {
                redis.delete(lockKey);
            }
        }
        Thread.sleep(1000 * (i + 1)); // wait before retrying
    }
    return false;
}

Summary

Retry mechanisms are like fire extinguishers in a data center—you hope you never have to use them, but when disaster strikes, they might be your last line of defense.

Which solution should we choose at work?

Don’t just follow the latest tech trends. Choose based on the balance of offense and defense that your business requires.

The key to system stability lies in always treating retries with respect.