Debugging an evil Go runtime bug
I’m a big fan of Prometheus and Grafana. As a former SRE at Google I’ve learned to appreciate good monitoring, and this combination has been a winner for me over the past year. I’m using them for monitoring my personal servers (both black-box and white-box monitoring), for the Euskal Encounter...
Pretty intense debugging story - I like it because it exposed me to a bunch of concepts while being a good reminder to not take anything for granted and continue debugging until you really know what’s going on.
The conclusion was also quite satisfying.
“Unsurprisingly, upstream’s first guess was that it was a hardware issue. This isn’t unreasonable: after all, I’m only hitting the problem on one specific machine. All my other machines are happily running node_exporter. While I had no other evidence of hardware-linked instability on this host, I also had no other explanation as to what was so particular about this machine that would make node_exporter crash. “Posted on 2020-12-27T06:05:24+0000