Tracking Down and Fixing a Memory Leak in Go with pprof

Memory leaks are an insidious class of bugs that can slowly eat away at your application‘s resources until things come crashing down. A memory leak occurs when memory is allocated by a program but never released, leading to ever increasing memory consumption as the program runs. The escaped memory can‘t be reclaimed by the garbage collector.

In garbage collected languages like Go, memory leaks are less common than in languages with manual memory management, as the runtime takes care of a lot of the cleanup work for you. However, memory leaks can still rear their ugly head if you‘re not careful. Hanging references, never-ending goroutines, and even subtle things like substrings can all lead to leaked memory.

I recently had to track down a suspected memory leak in a large Go service I work on. Left unchecked, the leak caused our service to gradually consume more and more memory over time until it would eventually be OOM killed and restarted. Thankfully, Go provides some excellent tooling in the form of the pprof package that makes it possible to analyze your program‘s memory usage and track down the source of leaks.

In this post, I‘ll walk through my process for investigating this memory leak, the various pprof tools and techniques I employed, and how I ultimately tracked down and fixed the problem. Hopefully this will be helpful the next time you need to go hunting for leaks in your own Go programs.

Obtaining heap profiles with pprof

The first step in any memory leak investigation is getting visibility into your program‘s memory usage. This is where pprof comes in. pprof is a profiling tool that comes bundled with the Go runtime. It allows you to easily obtain snapshots of various profile data from a running Go program, including CPU profiles, execution traces, and in our case, heap memory profiles.

There are a few different ways to enable pprof in your program:

  1. Import the net/http/pprof package and start an HTTP server. This will expose the pprof endpoints at /debug/pprof on the server.

  2. Call the runtime/pprof.WriteHeapProfile function to write a heap profile to a file.

  3. Use the runtime/debug.WriteHeapDump function to write a full heap dump to a file (requires Go 1.13+).

For my investigation, I used the net/http/pprof approach since our service already ran an HTTP server. I simply added the following code to our main function:

import _ "net/http/pprof"

Then, while the service was running, I could obtain a heap profile by making an HTTP request to the /debug/pprof/heap endpoint:

curl -sK -v http://localhost:8080/debug/pprof/heap > heap.out

This downloaded the current heap profile data and saved it to the heap.out file for later analysis with the pprof tool.

Analyzing heap profiles

With a heap profile in hand, it was time to start analyzing. The first thing I did was launch the pprof interactive shell to get an overview of the profile data:

go tool pprof heap.out

This brought me into the pprof shell, which told me some basic info about the profile, like the profile type (heap), total memory in use, and number of allocated objects.

The first command I ran was top, which showed me a list of the functions responsible for the most memory allocations, sorted by the amount of in-use memory they had allocated.

The top entries were all in our own code, which was a good sign that the leak was likely in our application and not in some third-party dependency. The cumulative memory consumption of the top functions also accounted for a large portion of the total memory in use, further pointing to a probable leak.

Next, I used the list command to see the actual lines of code where the top allocations were happening. This required a bit of guesswork to figure out the right regex to match the function names, but eventually I was able to narrow it down to a specific area of our codebase.

It appeared that a disproportionate amount of allocations were happening in our entity cache, which maintained an in-memory representation of data stored in our database. This wasn‘t too surprising, as the cache was central to our application and used by many different components.

To get a better visual sense of things, I used the web command to fire up pprof‘s browser-based UI. This provides an excellent flame graph visualization that makes it really easy to see the overall allocation picture and spot hot spots.

The flame graph made the outsized allocations in our cache code even more obvious. It was definitely the place to focus my initial investigation.

Comparing profiles over time

A single snapshot can show you what‘s being allocated, but to confirm a true leak, you need to look at how the memory usage changes over time. To do this, I took several more heap profile snapshots spaced out by 5-10 minute intervals.

By diffing pairs of snapshots with pprof‘s base and diff commands, I could see where memory growth was happening between the two points in time:

go tool pprof -base heap1.out heap2.out

The diffs showed that a couple of internal map-based data structures in our cache were steadily growing over time. The amount of memory they consumed would shoot up after certain operations and never come back down, even after invoking GC.

I found I could reproduce the pattern in a much shorter time window by repeatedly hitting a couple of specific API endpoints that accessed the problematic cache structures. Taking snapshots before and after a few iterations of these requests made the memory growth even more apparent.

Tracking down the leak

At this point, I had a pretty good idea of where the leak was happening. But to actually fix it, I needed to figure out why the maps were growing and why the memory wasn‘t being reclaimed.

I started by looking more closely at the map keys and values, trying to understand what references might be keeping them alive. One thing I noticed was that the map values included a nested struct that contained a pointer to a much larger object. This smelled like it could be the source of the leak.

Tracing through the code paths that accessed these nested objects, I realized that there was a missing null check on one of the pointer fields. In some cases, this pointer was left dangling after its parent object had been deleted from the database. But because it was still reachable in our cache via the internal map, the Go garbage collector wasn‘t able to reclaim the memory.

I instrumented the suspicious code paths with counters to track how often this was happening. Watching the output, I could see the number of dangling pointers steadily climbing in the cache over time, never coming back down, and correlating with the growing memory usage.

I had my culprit! A subtle nil pointer dereference bug was causing us to inadvertently hang on to memory for objects that should have been eligible for cleanup. The effect was only noticeable over longer periods of time, as more and more objects were leaked, which explained why it had previously escaped detection.

Applying the fix

The actual fix for the bug was quite simple – just a couple lines of code to add the missing null check and avoid leaving the dangling pointer in the cache. I also added some defensive cleanup code to remove any accidentally leaked cache entries if we did encounter the unexpected null pointer case in the future.

With the fix deployed, I monitored the memory usage of the service over the next several hours. Where previously I‘d seen the telltale sawtooth pattern of memory slowly growing until the process finally crashed, now the memory usage stayed stable and flat. Heap profile snapshots confirmed that the leaked memory was now being correctly garbage collected and that the maps were no longer growing out of bounds. Problem solved!

Lessons learned

This memory leak debugging adventure reinforced a few key lessons about working with Go and pprof:

  1. pprof is an incredibly powerful tool for understanding your program‘s memory usage. Take the time to learn its features and profile early and often.

  2. Leaks often only become apparent over time as leaked memory accumulates. Taking multiple snapshots at regular intervals is crucial to seeing memory growth trends.

  3. Comparing profiles with the base and diff commands is a quick way to see where memory usage is changing between two points in time.

  4. The web-based flame graph UI is fantastic for getting a high level visual overview of your allocations.

  5. Using -nodefraction=0 is often necessary to see all allocation data, as the default filters out a lot of noise.

  6. list is your friend for drilling down into the specific code paths and lines responsible for allocations.

  7. Leaks often happen in subtle ways, like hanging references to nested objects. Careful code inspection is needed to sleuth them out.

Limitations and future work

While pprof is great, there are still some limitations and gaps in Go‘s current tooling for memory analysis. One big one is the lack of full core dump support. While you can generate a heap dump with debug.WriteHeapDump in recent versions of Go, actually analyzing that dump to find the specific leaking objects and references is still difficult, and only supported on Linux.

There are a few third-party tools and projects in this space, like go-hprof and felixge/fgprof, but they don‘t yet provide a full solution, especially for large real-world applications. Better core dump introspection support is an area where pprof and the Go team still have room for improvement.

Another limitation is that the heap dump file format has changed a few times across Go versions, making backwards compatibility a challenge. You have to be careful to use the right version of pprof for the version of Go that generated the heap dump. Stabilizing the format would help with tooling consistency.

Conclusion

Despite these limitations, I‘m continually impressed with Go‘s pprof tool and how much it lets you do, even without full core dump support. It made tracking down a real memory leak in a large, complex codebase possible, and dare I say, even kind of fun!

If you take only a few things away from my experience, let it be these: 1) Leaks happen, even in garbage collected languages like Go. 2) pprof is an indispensable tool for tracking them down. 3) With a bit of patience and practice, it‘s possible to analyze and fix even subtle, long-lived leaks.

Hopefully this post has inspired you to give pprof a try the next time you suspect a memory leak in your own Go programs. Happy leak hunting!

Similar Posts