Unplanned Downtime Post Mortem: Ruby!

July 27, 2008

Maybe I could learn to be a truck driver. Mav, do you still have the number for that truck driving school we saw on TV? TruckMaster I think it is. I might need that.

Late Friday night (it’s always a Friday night with these things, isn’t it?) I decided to cap off a great week by upgrading the server and system software that supports Jetrecord. A number of security and performance updates have been released over the last few months and I wanted to install these before AirVenture and before people really started using the embedded maps.

As traffic had died down about 11:30pm MST I pulled the trigger and ran the update scripts. Everything installed just fine. I was very happy. Until I rebooted the machine to pick up the updated software.

Boom! Down goes Jetrecord. Hmm, that sounds familiar.

Oh, boy. So was it Jetrecord again like last time or was it something else? I pulled up Jetrecord on my local machine and did all my usual tests. No problems. No errors. Nothing in the logs. Everything seemed fine. I checked my email to see if Jetrecord had sent any notices. (I have a script that does this when a major error occurs.) Nope, nothing.

I went back to the server and started testing everything manually. The updates I installed touched almost all of the major software for running the server and Jetrecord. So which one was it?

I didn’t want to drag this out by testing everything line by line if it wasn’t necessary. This isn’t a nuclear reactor. Most likely I probably just made a stupid mistake. See the previous example.

What was I seeing? Apache, PHP, and MySQL were working fine and I could actually load up this blog which runs on WordPress. I already knew that the application was working fine because it was running prior to updating the software and it was also running fine locally.

That could mean two things. It was either Ruby or it was PostgreSQL. Jetrecord runs on Ruby on Rails and uses Postgres for the data store. I hadn’t updated Ruby on Rails and I hadn’t deployed any new releases so it most likely wasn’t the framework or the code that sits on top of it.

I tried connecting to Postgres via the command line. No problems. So it’s Ruby, then.

Ruby was recently updated to address some security issues and it was one of the reasons I was upgrading in the first place. I did some searching and discovered that the latest releases of Ruby were segfaulting on almost every OS while running a Rails app. This was Saturday afternoon. Jetrecord had been down for 12 hours at this point. I slept for 2 of those hours.

I started doing some research (hard, research-scientist, Google research) and found people saying that an earlier version of Ruby was working fine, even with patched security updates. I didn’t care at this point. Just get my site up, somebody!

I went back to my server and started working to uninstall the recent release of Ruby that killed Jetrecord. Unfortunately, I’m a novice at Solaris administration. Needless to say it was a good learning experience not only about debugging a Ruby app on Solaris but also how Solaris works with pkgsrc to do its thing and how all the dependencies work together.

I worked as hard as I could to understand the problem but I got tired around 9pm. I’m not in my twenties anymore. I went to bed at 10. Jetrecord was still down. I was exhausted but my mind probably raced into the night for another hour or so. I dreamt of electric sheep and losing all of Jetrecord’s users and their data, followed by all of my hair.

Sunday. A day of rest for some. My family and I got up and went to church. That was probably the best thing I could have done. When we got home I felt strangely at peace. We ate lunch and then I went back to work. Around 1:30pm I figured out how to get Ruby 1.8.6.230 off the server and put 1.8.6.111 back on with the security patches.

I went back to my manual tests on the server and everything appeared fine. I restarted the application in its production configuration and voilĂ , Jetrecord is back up as of 2pm MST.

Some Lessons Learned

That is all I can think of right now. I truly believe Jetrecord is stable again and actually performing better with the updates, so log away.

Cheers, Harry

Jetrecord is an online logbook for pilots like you.

2 Responses to “Unplanned Downtime Post Mortem: Ruby!”

  1. Avatar Chris Johnson said:

    Wow–that doesn’t sound like fun at all. I do the Unix system administration on my union’s servers, and nothing makes me more nervous than doing major software upgrades, especially on those bits of software with which I’m not intimately familiar.

    It’s great to see that you got it all put back together again, and I hope you can catch up on your sleep now!

  2. Avatar Harry Love said:

    Thanks, Chris. I’m feeling better today and I’m taking today and tomorrow off to regroup and write up some plans for going forward from here.

Yes, Please Leave a Comment