This is a guest post by Rachel Kibler.
At 1-800 Contacts, we practice DevOps, and one way we do this is that everyone involved in the coding process takes an on-call shift to handle production issues. For a website, this means we rotate through all our developers and testers in one-week increments, and each person is on call roughly twice a year.
Being on call means being available (and in cell service) after hours and on the weekend, monitoring slowness on the site as well as orders placed, calling “severities” (issues affecting production that require immediate attention), and seeing those severities through the process of a hotfix or a rollback, whichever is appropriate.
One man at Contacts is notorious for having bad on-call weeks. He has severities called all the time, sometimes multiple ones at once. I had heard stories about his weeks, and as my first week approached, I was nervous.
My first day was fairly smooth. We had some bumps with slowness, but they resolved quickly. I became more familiar with looking at our logs and digging through them. “The slowness happened for about x minutes and has resolved itself. I will continue to monitor” became my regular response.
On my second day, we had a severity. My ever-supportive team hopped on the call to see how they could help. I found the appropriate team that owned the code, they joined too, and we resolved it within an hour. The next day there was another severity, with the same process.
Friday morning at 6 a.m., I got a call from the help desk, which always monitors things. Something had gone awry, and though they usually waited for the second slowness email to be sent, they thought I should look at it right away. I dragged myself out of bed, and within about five minutes of watching things and trying to figure out what went wrong, it resolved itself. A server during a routine upgrade had been fussy (yes, that’s my technical term for it) and just needed some time to right itself.
My team had been working on a complicated piece of code for weeks. I had done a lot of testing on it, but I was nervous about it. They wanted to release it the week I was on call. I pushed back. I kept finding bugs, and they kept fiddling with it. They wanted to release on Tuesday, then Thursday, and, finally, they wore me down to release on a Friday afternoon. Yes, a Friday afternoon. I was hesitant, but we released. The APIs had a hiccup when restarting, but we ended up not having to roll back or call a severity. I breathed more easily.
My weekend was easy. There were no calls and no problems. I had canceled my camping plans as soon as I realized I was on call and what that meant. A few bugs came in from our call center, but those aren’t usually worth dealing with outside of normal hours.
To be honest, I’m looking forward to my next time on-call (well, maybe just a little bit). I’m getting more comfortable with reading logs, looking at user sessions, and troubleshooting issues. It’s not nearly as scary as I initially thought it was, and having my team support me helps a lot. I’m sure I’ll change my tune if I get a call at 2 a.m. or if I have the luck of the cursed man, but my company makes sure people aren’t alone and that they get what they need to be successful.
I’m grateful for my on-call experience, and I recommend trying it out, even if it means shadowing someone. I learned a lot, and it’s made me a better tester.