How to get better performance: the case for timeouts

Hourglass clock

As a developer, you want your applications and services to be fast, responsive, and reliable. But in an increasingly distributed and interconnected world, that‘s easier said than done. One of the simplest yet most effective techniques for building high performance systems is to use timeouts everywhere.

In this post, we‘ll take a deep dive into timeouts and make the case for why they are essential for performance and reliability. We‘ll examine common issues that arise in systems without proper timeouts, and show how adding timeouts can dramatically improve things. To demonstrate, we‘ll use Vegeta, a popular open-source HTTP load testing tool written in Go, to compare a service with and without timeouts under load.

By the end, you‘ll see why timeouts are indispensable for any system that depends on external resources or services, and you‘ll walk away with practical tips for choosing the right timeout values. So let‘s jump in!

What are timeouts and why do they matter?

In essence, a timeout is a way to limit the amount of time an operation is allowed to take before it is considered to have failed. If the operation doesn‘t complete within the designated timeout period, it is aborted.

Timeouts can be applied to all kinds of operations and resources – HTTP requests, database queries, filesystem reads, locks, etc. They act as a safety valve to prevent unbounded waits or resource consumption.

Without timeouts, a slow or unresponsive component can cause a cascading failure that brings down the entire system. An HTTP request to an overloaded service that takes too long to respond can tie up the requesting thread/process indefinitely. A database query that never returns can exhaust the connection pool. And so on.

By enforcing an upper bound, timeouts help contain the impact of problems. Yes, the operation may fail if it exceeds the timeout. But that‘s better than taking down the whole system. Timeouts allow the system to fail fast, maintain responsiveness for other requests, and conserve resources for new operations.

A real-world analogy

To relate this to the physical world, imagine you‘re at a retail store waiting to check out. If the cashier takes an excessive amount of time to ring up the customer ahead of you, it holds up the entire line.

The store could keep everyone waiting indefinitely for that one customer. But the more sensible approach is to have a policy like: "If a checkout takes longer than 5 minutes, suspend that order and move on to the next customer. The problem order can be dealt with separately by a manager."

That‘s effectively a timeout. And it‘s essential for keeping the line moving and ensuring a smooth experience for everyone else. The same principle applies to our software systems.

A simple example service

To make things concrete, let‘s consider a simple REST API for retrieving product information. The /products/{id} endpoint accepts a product ID and returns the corresponding details by querying a backend database.

Here‘s what a Go-based implementation might look like:

func getProduct(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    id := vars["id"]

    // Query the database without a timeout
    row := db.QueryRow("SELECT name, price FROM products WHERE id = ?", id)

    var name string
    var price float64
    err := row.Scan(&name, &price)
    if err != nil {
        http.Error(w, err.Error(), 500)
        return
    }

    json.NewEncoder(w).Encode(map[string]interface{}{
        "id":    id,
        "name":  name,
        "price": price,
    })
}

This works fine if the database is fast and responsive. But what happens if the database becomes overloaded or unresponsive? The QueryRow call will block indefinitely, which ties up a Goroutine and leaks resources. Even worse, the calling HTTP handler won‘t be able to return a timely response to the client.

If this happens often enough, the server can quickly become bogged down with stalled requests and eventually exhausts its capacity for serving any requests at all. All due to an issue in a single dependency.

Adding a timeout

Now let‘s add a timeout around the database query:

func getProductWithTimeout(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    id := vars["id"]

    // Query the database with a 3-second timeout
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()

    row := db.QueryRowContext(ctx, "SELECT name, price FROM products WHERE id = ?", id)

    var name string
    var price float64
    err := row.Scan(&name, &price)
    if err != nil {
        if err == context.DeadlineExceeded {
            http.Error(w, "Database operation timed out", 503)
        } else {
            http.Error(w, err.Error(), 500)  
        }
        return
    }

    json.NewEncoder(w).Encode(map[string]interface{}{
        "id":    id,
        "name":  name,
        "price": price,
    })
}

We create a Context with a 3-second timeout and pass it to the QueryRowContext method. This tells the database driver to abort the query if it takes longer than 3 seconds.

Inside the handler, we check if the error returned is a context.DeadlineExceeded and return a 503 Service Unavailable error to the client if so. This way, we fail fast and don‘t leave the client hanging or consume server resources unnecessarily.

Load testing with Vegeta

To see the impact of timeouts, let‘s load test both versions of the endpoint using Vegeta. Vegeta is a versatile HTTP load testing tool written in Go that allows us to specify a rate of requests to make and measure various metrics.

First, let‘s test the non-timeout version:

echo "GET http://localhost:8080/products/123" | vegeta attack -duration=30s -rate=100 | tee results.bin | vegeta report

This command sends 100 requests per second (RPS) for 30 seconds to the /products/123 endpoint. It prints a report with the results:

Requests      [total, rate]            3000, 100.02
Duration      [total, attack, wait]    29.993202271s, 29.989998955s, 3.203316ms
Latencies     [mean, 50, 95, 99, max]  327.411582ms, 310.156288ms, 416.672286ms, 1.200179839s, 1.217408203s  
Bytes In      [total, mean]            54000, 18.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:3000  
Error Set:

The average latency is 327ms which is pretty good. All 3000 requests succeeded with a 200 OK status.

Now let‘s simulate a problem with the database by introducing an artificial 5-second delay for 20% of queries. We can use the test_db package for this:

import "github.com/your/test_db"

func getProduct(w http.ResponseWriter, r *http.Request) {
    ...
    // Simulate a 5s delay for 20% of queries
    test_db.RandomDelay("SELECT ...", 5*time.Second, 0.2)

    row := db.QueryRow("SELECT name, price FROM products WHERE id = ?", id)
    ...
}

If we re-run the Vegeta test now, we get very different results:

Requests      [total, rate]            3000, 100.02
Duration      [total, attack, wait]    2m6.259044352s, 29.989998955s, 1m36.269045397s
Latencies     [mean, 50, 95, 99, max]  9.524731s, 193.236271ms, 21.384221454s, 21.558077011s, 21.574610717s
Bytes In      [total, mean]            37454, 12.48
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  55.23%
Status Codes  [code:count]             0:1343  200:1657  
Error Set:
Get http://localhost:8080/products/123: net/http: request canceled (Client.Timeout exceeded while reading body)
Get http://localhost:8080/products/123: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Yikes! The average latency shot up to 9.5 seconds. And only 55% of requests succeeded within the 30-second timeout that Vegeta imposes by default. The rest failed with timeouts or were canceled. Although we only added a delay to a fraction of queries, it had a hugely negative effect on the overall performance and reliability.

Now, let‘s switch to the timeout-enabled /products endpoint and re-run the test:

Requests      [total, rate]            3000, 100.02
Duration      [total, attack, wait]    29.992434413s, 29.989998955s, 2.435458ms
Latencies     [mean, 50, 95, 99, max]  310.891658ms, 3.120668605s, 3.145484089s, 3.151914451s, 3.15457015s
Bytes In      [total, mean]            67200, 22.40
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  85.00%
Status Codes  [code:count]             503:450  200:2550  
Error Set:
503 Service Unavailable

Much better! The average latency is back to ~310ms. We do have 15% of requests failing with a 503 status due to the injected delay exceeding our 3-second query timeout. But that‘s to be expected and an acceptable tradeoff. Overall the system remains stable and responsive for a majority of requests.

Latency Histogram

This graph depicts a histogram of the latency distribution for the two scenarios. As you can see, without a timeout, a significant number of requests take 20 seconds or more. But with a timeout, the latency remains tightly grouped around the median.

The many faces of timeouts

So far we‘ve looked at an example of a query timeout. But timeouts are applicable to many other areas:

  • Connection timeouts: Cap the amount of time spent waiting to establish a connection to a service. Especially important when dealing with flaky networks or overloaded services.

  • Read/write timeouts: Bound the time spent transmitting data over a connection to safeguard against slow networks or unresponsive peers. Separate read and write timeouts allow for finer control.

  • Request/Response timeouts: Place an upper limit on the end-to-end latency for a complete request/response cycle. Useful for enforcing SLAs and predictable user experience.

  • Lock timeouts: Avoid deadlocks by timing out if unable to acquire a lock within a reasonable time. Common in concurrent systems with shared resources.

  • Session timeouts: Automatically expire idle sessions after a period of inactivity to reclaim resources and avoid stale data.

The key is to apply timeouts at every interaction point with external systems and resources. By failing fast, you prevent the effects of a slow or failed dependency from spreading and taking down the rest of your service.

Choosing the right timeout values

Of course, the most challenging part of using timeouts effectively is selecting the appropriate values. Too short, and you‘ll prematurely fail operations that would have succeeded. Too long, and you risk tying up resources unnecessarily.

Here are some guidelines to help choose optimal timeouts:

  1. Understand your latency budget: Know what a reasonable response time is for your service by analyzing historical data. Your timeouts should fit within that budget.

  2. Measure dependents‘ response times: Assess the typical and worst-case latencies for the external services/resources you depend on. Your timeouts should encompass a majority but exclude extreme outliers.

  3. Be more aggressive at the edges: Timeouts should be tighter for user-facing requests than for backend services. Failed requests are better than slow ones for the user experience.

  4. Leave headroom for retries: For calls that can be safely retried, the timeout should be short enough to allow for multiple attempts within the overall request time budget.

  5. Adjust dynamically based on load: During periods of high load, you can temporarily lower timeouts to shed load and improve overall success rates. But be careful not to set them so low that even healthy services get cut off.

  6. Test and iterate: Choosing timeouts is not a set-and-forget exercise. Continuously measure, experiment, and fine-tune values as conditions change. Automated load testing tools like Vegeta can help you optimize.

Ultimately, finding the sweet spot requires careful tuning and a deep understanding of your service‘s characteristics. But it‘s effort well spent for the significant performance and resilience benefits timeouts provide.

Conclusion

We‘ve covered a lot of ground in this post, so let‘s recap the key takeaways:

  • Timeouts are a powerful mechanism for building high-performance and resilient distributed systems. They prevent slow or failed components from causing cascading failures.

  • By bounding the time spent waiting for external operations to complete, timeouts keep your service responsive and allow it to fail fast and recover quickly.

  • Timeouts are broadly applicable to many areas including HTTP requests, database queries, locks, sessions, etc. Use them at every integration point with external dependencies.

  • Choosing the right timeout values is both an art and science. Set them based on your latency budget, dependents‘ response times, and user experience goals.

  • Continuously measure, test, and adjust timeout values as needed using load testing tools like Vegeta.

I hope this post has convinced you of the criticality of timeouts and equipped you with the knowledge to start applying them effectively in your own services. Remember, embracing failure is key to building truly robust systems. Timeouts are your friend on that journey.

Now go forth and build those lightning-fast, rock-solid services. And if you get stuck, just remember: when in doubt, time it out!

Similar Posts