Software Engineer
I still remember one morning from my early days on a new team at my full time job.
The team owned a service that had to stay highly available, so production issues were treated seriously. Around 08:00 AM, an alert came in because the service had started returning 503 for a while, and Grafana showed that one of our pods had died.
Since I was still in my first days on the team, I did not have a solid mental model of the service yet. I spent the next hour checking logs, traffic, and the usual dashboards.
Around 07:50, the 503 count started spiking, so I narrowed my log search to that time range. At first everything looked normal enough to be annoying. Traffic was there, requests were coming in, and nothing looked obviously wrong.
Then I found one panic log in that window. That was the part that explained almost everything.
Finding the culprit
The issue came from a process that used concurrency but still modified a shared map while multiple goroutines were touching it.
“Do not communicate by sharing memory; instead, share memory by communicating.”
It looked roughly like this.
shared := map[string]int{}
for _, item := range items {
go func(v string) {
shared[v]++
}(item)
}The traffic pattern had looked normal because the failure was not caused by a slow dependency or a sudden traffic spike. The process itself had a race, and when it happened in the wrong moment, the pod panicked and died.
That was also why it was hard to see from the start. Most of the logs looked fine until the one line that did not.
Reproducing it
Once I had that clue, I stopped staring at production and tried to make the problem happen locally.
The first thing I did was duplicate the request shape and the data as closely as I could to what we had in production. Then I ran the same process manually on my machine until I could reproduce the failure in a controlled way.
After that, the pattern became clearer. The code was doing work concurrently, but the map was still being written without protection.
I used the race detector to make sure I was not guessing.
go run -race .That made the issue much easier to defend. It was no longer a suspicion from one panic log. I could reproduce it, and the race detector pointed to the same area.
Fixing it
The fix itself was not complicated. We kept the concurrency, but we stopped letting multiple goroutines write to the same map without coordination.
var mu sync.Mutex
shared := map[string]int{}
for _, item := range items {
go func(v string) {
mu.Lock()
shared[v]++
mu.Unlock()
}(item)
}We used a mutex around the shared write, then verified the flow again with the same reproduced case and with -race.
Once it looked good locally, we pushed it to staging and asked QA to test the same flow there. After that we deployed it to production as soon as we could, retested it, and the service stayed stable.
Looking back
The mutex part was straightforward once the issue was clear.
The harder part was getting from a vague production symptom to something I could reproduce and prove.
It made me more careful with shared state in concurrent code. When several goroutines touch the same data, a mutex is one of the first things worth considering.