redbluemagenta

a blog curated/written/whatever by christian 'ian' paredes

And Now for Something Completely Different

So, I’ve been doing a lot of other things other than work lately. This blog doesn’t seem to reflect that all too well, but I figure this is a great place to talk about it more than anything.

Recently, I’ve concentrated a lot more on my life outside of work than I did before, starting with recovering my social life and getting myself all situated with living alone for the first time in a while. I’ve started hosting game nights with a lot of my friends, socializing and getting to know all sorts of folks from different backgrounds, and got myself a pretty awesome place near my work.

I think it’s quite important to get these things under control - I’m of the opinion that you work to live life, versus living to work. After getting these things squared away, I feel like I’m in a much better place - I’m quite happy where I’m at so far, and I feel it can only get better.

I’ve also travelled a bit recently as well - I’ve travelled over to Berlin and San Francisco, and will be going back to Berlin in a few months for about a week (too short of a time, IMO!) Also finally checked out Olympic National Park last summer, even though it’s right there at my doorstep (well, not quite. But still, a three hour drive to see the coast ain’t that bad.)

Next trip will be over to Westport, WA this weekend - I’ll be going there for one night, while checking out the beach over there along with some hiking trails and fish and chips action. I’m super excited - been meaning to go out to some of the trails around here in Washington (even though I’ve lived here all of my life), so I’m looking forward to our trip over there.

More updates soon!

(and what little work things I can talk about might resume again, but it might be more personal stuff for a while.)

Review of Airbnb’s SRE Interview Process

TL;DR: they have a strong design culture, but not a great engineering culture from what I can tell with the interview process.

It all started with an email with the recruiter - I was immediately interested, so I replied back, waited a day or two, then was issued the SRE Challenge, where I was given a box to play around with, and had to solve a typical systems engineering problem with whatever tools I wanted. I opted for configuring the system and solving the problem with a combination of Chef, a few Ruby scripts, and a shell script to get everything all bootstrapped on the box.

After that, I had a phone interview with another SRE, who asked me various sysadmin trivia. After answering his questions somewhat sufficiently, I was called in to fly over to San Francisco for two days, with one day dedicated to technical and soft skills, and the other day dedicated to interviews with the founders.

I didn’t get to go on to the second day, I was eliminated after the first round of in person interviews - a strenuous 4-5 hours worth, but I did not feel like it was completely representative of my abilities (I’m not saying I’m awesome at what I do, just that the questions seemed unfair in judging my abilities.)

Note: I’m going to compare a lot of this to my experience with Amazon’s interview process, as I think Amazon’s interview process is similar in many ways, yet I feel that Amazon’s interview process is way more realistic than Airbnb’s.

Much of the questions were even more trivia and vocabulary. There was an awesome algorithms question that was meant to see how I broke apart the problem, but I felt like there was too much hand holding when I was trying to discover a solution to the problem. Amazon’s interviews usually have someone write a ton of code or diagrams on the board before anyone asks any questions - it’s to discover the person’s natural thinking process, and ask probing questions after the solution is naturally discovered.

A few of the questions were downright hostile. I was told by one of the engineers interviewing me, in no uncertain terms, that “Chef and Puppet sucked,” and was asked what I thought about that. I know what he was getting at, which was to see if I had any strong opinions on the matter, and to see how I defended my stance. But this is analogous to going up to a liberal and saying, “yeah, so, fuck health care and lazy poor people. What do you think about that shit?” (Aside: I’m a liberal. I’m also a huge supporter of configuration management systems.) I did ask why he thought they sucked, and was told that both Chef and Puppet tended to wreck perfectly good systems. But so do shell scripts and Ruby programs, given enough recklessness…

I also did notice throughout the entire interview, that there was a lot of dogma surrounding the kinds of tools that Airbnb uses. I was asked how I could coordinate actions between many machines, and I told them that I could probably use Chef to do a lot of that, which was met with a huge red “no” response. I then had to play bingo and landed on the “correct answer” of “Zookeeper.” Great, I chose the right tool I guess, though I’ve always had a different philosophy about tool choice: use what works, and it’s almost always about the methodology and approach, rather than the tools used.

At the tail end of the interview loop, I was asked how I would improve Airbnb’s site. I gave some suggestions, though I wasn’t sure if this was supposed to touch upon the “inventiveness” quality that Airbnb is looking for. I’m guessing it was.

They were quick in letting me know whether I would go on to the founder interviews, which I didn’t. Least I got some Airbnb credit at the end. :)

At the end, I wasn’t terribly impressed with the engineering organization in Airbnb - I came away not really wanting to work with Airbnb as a systems engineer. There was too much emphasis on tool choice/religion, Linux trivia, and getting the ‘right answer’ versus solving engineering problems, where multiple tools might have to be glued together to come up with one of many solutions.

I realize this blog post might end up getting me black balled by Airbnb. My only response is this: I really do hope the interview process is improved, and for that to happen, I think the engineering culture probably has to change, too. In some ways, Amazon’s culture can be stultifying, as there can be a lot of roadblocks before your code can hit production (depending on the team, of course.) But I think, at least for me, I can appreciate the emphasis on operations and engineering quality - it’s not just the user interface and smoothness of the site that matters to the customer, but also how fast the results are delivered to the client (latency, how much load we can sustain), how often that data is available to the client (high availability), and sometimes, correctness of that data (system architecture, emphasizing different parts of the architecture for consistency depending on the problem.)

Acknowledgements

I just wanted to write a new blog post about all the folks who have helped me, mentored me, and otherwise gave a ton of support during the past few years of my career. I’m incredibly thankful for everyone’s support over the years, and I couldn’t have made it to this point without you guys. Thanks.

If I forgot you, let me know.

  • John Keithly: for your technical support class in Ballard High School. You’ve taught all of us a ton of knowledge on how computers worked, how to troubleshoot these infernal machines, and for allowing us to break and fix Linux and Windows machines.
  • Richard Green: for taking the chance on having me administer a few of your UNIX boxes at the UW. It was my first real job where I had to do that, and I learned a ton from you about UNIX, scripting, and making sure shit doesn’t go up in flames.
  • Hoang Ngo: for your sage advice on surviving in the computing industry, for your knowledge on networking, and for being an overall awesome, humble guy to work with. Your dedication to the craft is pretty damn amazing, and I’m incredibly humbled by your huge swath of knowledge.
  • Michael Fox: for being one of the most awesome bosses ever, for shielding our team from politics, keeping us sane during those intense days of setting up old machines for the school, and for helping us climb out of the hole we were in.
  • Stephen Balukoff: though we squared off quite a bit during my tenure, you were probably the most knowledgeable person in UNIX that I’ve ever known, and you understood lots of the things I was going through at the time. I’m also incredibly humbled by the knowledge you had of databases, UNIX, and all things systems administration.
  • Benjamin Krueger: you have and always had been an awesome mentor - you’ve helped me a lot in my career, taught me a lot about systems administration, and showed me the awesomeness of configuration management.
  • Lee Damon: for also being an awesome mentor - you’ve seen a lot of what the tech industry has to offer, and you’ve definitely gave me a lot of advice on my career as I’ve advanced from a junior systems administrator to more of a mid-level systems administrator. Thanks so much.
  • Mike Sawicki: though we haven’t really worked together much, your insight and advice have helped me a lot. You’ve also provided a ton of awesome career advice and have shown me a lot of awesome shit in the industry, too.
  • David Lemphers: for being an awesome mentor. We differed a lot in how we approached problems, but it was damn healthy, and it was some of the better discussions I had while approaching computing problems. Further, you’ve showed me a lot of awesome ways to create distributed systems.

Again, let me know if I’ve missed anyone. (If you’re not listed, it doesn’t mean I don’t appreciate you - I certainly do! I’ve been blessed with knowing so many awesome and knowledgeable folks in my short career - everyone’s been awesome, and I can’t thank everyone enough for everything up to this point.)

Amazon

On the 27th of February, I start at Amazon as a systems engineer in the AWS Cloudfront team.

I can’t even say how excited I am - I’ve been wanting to be at Amazon for the longest time now, and I finally did it. I’m at fucking Amazon, working with a large installation that has way more machines than I could even fathom.

I’ll continue to blog here, but when it comes to work related things, I’ll have to keep a tight lip. Sorry folks. I’m not sure what this means exactly in terms of the kinds of posts I’ve made so far (mostly reviews and quick guides of various software), but I’d imagine those types of posts will likely be a lot less frequent. What I would like to do instead, is talk about other things that I might be doing on my off time - this means board games, other OSS software that I’m working on, music, and movies.

Anyway, this doesn’t mean the death of this blog - I hope this actually spurs more discussion on a wider variety of things.

Cheers, gl, hf, etc. etc.

RunDeck

There’s times when we need to enforce adhoc control over our machines. This blog post by Alex Honor does a way better job than I could to explain why we still need to worry about adhoc command execution, but a quick summary would probably be the following:

  1. Sometimes, we just need to execute commands ASAP - unplanned downtime and adhoc information gathering both spur this activity.
  2. We need to orchestrate any set of commands across many machines - things like getting your DB servers up and prepped before your web servers come online is one example of this.

RunDeck can handle these two scenarios perfectly. It’s a central job server that can execute any given set of commands or scripts across many machines. You can define machines via XML, or use web servers that respond with a set of XML machine definitions (see chef-rundeck for a good example of this.) You can code in whatever language you want for your job scripts - RunDeck will run them on the local machine if the execute bit is set (or, if you’re running jobs across many machines, it will SCP them onto the machine and execute them via SSH.)

You can define multiple job steps within a job (including using other jobs as a job step) - thus, you can, for example, have a job step that pulls down your web application code from git across all web application servers, another job step that pulls down dependencies, then one more to start the web application across all of those machines, all of which is coordinated by the RunDeck server. As another example, you can also coordinate scheduled jobs that provision EC2 servers at a certain time of the day, run some kind of processing job across all of those machines, then shut all of them down as soon as the job is finished.

There’s another way these jobs can be started, not only just by manual button pushing and cron schedules: they can also be started by hitting a URI via the provided API with an auth token. To expand on the previous example, if say you don’t have a reliable way to check to see if the processing is finished from RunDeck’s point of view, you can probably have some kind of job that’s fired from the workers, which hits the API and tells RunDeck to run a ‘reap workers’ job.

mcollective vs. RunDeck?

A few people have asked me how this compares with mcollective.

I would probably say that they’re actually complimentary with each other - mcollective is incredibly nice, in that it has a bit of a better paradigm for executing commands across a fleet of hosts (pub/sub vs. SSH in a for loop.) You can actually use mcollective as an execution provider in RunDeck, by simply specifying ‘mco’ as the executor in the config file (here’s a great example of this.) RunDeck is nice, in that you can use arbitrary scripts with RunDeck and it’ll happily use them for execution - plus, it has a GUI, which makes it nice if you need to provide a ‘one button actiony thing’ for anyone else to use. RunDeck can also run things ‘out of band’ (relative to mcollective) - for example, provisioning EC2 machines is an ‘out of band’ activity (though you can certainly implement this with mcollective as well, mcollective’s execution distribution model doesn’t seem to ‘fit’ well with actions that are meant to be run on a central machine.)

But again, you can tie mcollective with RunDeck, and they’ll both be happy together. I can see right away that the default SSH executor will likely be a pain in the ass once you get to about 100+ machines or so.

Conclusion

RunDeck is pretty damn nice. We’ve been putting it through its paces in our staging environment, and it’s held up very nicely with a wide variety of tasks that we’ve thrown at it. The only worries I have are the following:

  1. Default SSH command executor will likely start sucking after getting to about 100+ machines.
  2. Since it can execute whatever you throw at it, it’s possible that you might end up with yet another garbled mess of shell and Perl scripts. This is probably better solved with team discipline, however.
  3. There doesn’t seem to be a clear way to automate configuration changes for RunDeck - thus, be sure to keep good backups for now (though, this is probably because of my own ignorance - the API is pretty full featured, and I believe you can add new jobs via the API, but it would’ve been awesome to be able to just edit job definitions in a predictable location on the disk.)

Despite these worries, I highly recommend RunDeck - it’s helped us quite a bit so far, it’s quick to get setup, and quick to start coding against for fleet wide adhoc control.