The AWS Heisenbug

13 Apr 2012 in Tech

In my post about the Tweetdig tech stack, I mentioned a Heisenbug that I encountered whilst trying to use NodeJS on a micro EC2 instance.

Just incase you didn't read the last post, here's a quick refresher:

We receive a realtime stream of data from Twitter which is then buffered into a (separate) Redis server. Then, a different node process picks up the tweet from Redis and does everything that we need to be able to start serving it on the site.

We initially ran both of these processes on a single micro instance and it was running fine, until one day a backlog suddently appeared. Tweets were going into Redis fine, but were taking 5, then 10, then 20, then 45 minutes to appear on the site. Something odd was going on, as we'd only added 50 or so new users and no code had changed, so what could it be? I ssh'd into the box, and a quick look at the logs revealed nothing.

The server load wasn't too bad so we chalked it up to a one off occurance, fired up an 8 core machine (did I mention how much I love AWS?) and cleared our backlog of 25k+ tweets in less than a minute for less than $1. I kept an eye on the server for a while afterwards, but tweets seemed to be coming through in < 10 seconds and there we no processes with spiking CPU so I logged out and carried on working on something else.

Then it happened again. This time just as I was about to leave work for the day. I logged in and went through the same motions with the same results. A normal volume of tweets, no new signups, no processes consuming all of the CPU and leaving none for node. So, up came the 8 core behemoth again and the tweets ended up in our database fairly quickly.

I had a think about it on the train home, but I couldn't think of anything that could be causing it. Infuriatingly, it never happened when I was logged into the server watching.

Once we had the large server running, tweets were appearing almost instantly. This ruled out any database bottlenecks. The code we're using also ran fine on the large server, so I ruled that out too. That left me with the micro instance. It would have been easy to just bump up the size of the instance and be done with it, but I wanted to know why it was behaving as it was. AWS can get very expensive quite quickly and we couldn't just make the server bigger every time something bad happened. The code + resources should have easily been able to handle the volume of tweets we were processing. It had been up until that point.

After what felt like an eternity reading AWS support threads and blog posts with the resolution "move to Rackspace" I finally stumbled across a post by Greg Wilson which solved all my problems.

Suddenly everything made sense. As I never noticed that the tweets were taking too long to come through until it was too late (we've since added monitoring to the queue), by the time I logged in to take a look the instance was out of the steal period and behaving normally. A real heisenbug!

We've since upgraded the server to be a small instance and it's all running perfectly. As the stream requires constant throughput, a small instance is the best one for the job for now.

To think, all of this could have been avoided if I'd just read the AWS documentation a bit more carefully. The relevant docs can be found here, and they explain the burst/throttle capability perfectly.