Seven signs of a bad sysadmin

The title for this post was originally going to be “things that will get you fired from Anchor”, but we realised they might be more generally applicable and actually more cerebral that they first appear.

These aren’t just any old sysadmin WTFs. To be sure, some are serious “what ever made you think that could possibly be a good idea!?” material, but most are a bit more subtle and deep seated.

Some of them are massively widespread and we’d be remiss to not acknowledge that it’s hard to fix – being sysadmin is suffering, and it can be an act of heroism to dig yourself out of someone else’s hole to get on top of things.

  1. Rebooting as an instinctive reaction to problems
    Sometimes it’s just so tempting to reboot a server when you’ve seen it fix a problem before. As much fun as it is to poke fun at Windows, it’s getting kind of old these days, and properly-managed systems really don’t just crash for no reason. That means there’s a reason, which means you can fix it, and the problem will stop happening. If you’re being paid to maintain servers, it’s your job to get those problems fixed!
  2. Not documenting or logging their work
    This one is especially common, and the reply is always “I don’t have time for that”. It’s something you must make time for, because you’ll never get it done “later”.

    If you tell your boss that it’ll take three days to deploy this new server, your estimate had better include time taken for documentation. You do a perfect job and it hums along for years without problems, but eventually someone will need to fix it, possibly you. Without documentation, a lot of time will be wasted rediscovering all the basic details of the system. Even basic notes with an outline of how things work is a massive boon if you’re working on a system you’ve never seen before.

    Having documentation also means you can hand the reins over to someone else, whether you’re changing jobs or just want a holiday. Being the single person who knows how a critical system works is poor for your bus factor, and a misguided idea of job security.

  3. chown -R tomcat /
    There’s a reason we don’t give customers root access on fully-managed servers, but sometimes a would-be sysadmin gets it and just doesn’t know what they’re doing…

    Apparently the reason for this was “to make their tomcat app work”. I’m guessing it worked, but it broke the rest of the system in the process and a fresh system install was the fastest solution. Seriously, wtf?

  4. Wanting to rebuild from scratch rather than fixing existing systems
    Sometimes you’ll come across a legacy system that’s truly horrid. We’ll put our hands up, we’ve had a few. Replacing that old Tcl codebase sounds great on paper, and we’ll even be able to give it a fresh coat of paint.

    We’ll let you in on a secret: sure you could go and code up a replacement with a weekend’s work, but you’ve just wasted your weekend. There’s many reasons why legacy systems stick around, and unless you understand how it all fits in you’ll never understand the real cost of upgrading. That’s not to say it’s not worth doing, but it’s dangerous to think you can knock up a replacement and expect it to work seamlessly.

  5. Only ever fixing symptoms instead of the real problem
    This is really a more general case of rebooting servers to fix a problem. A good sysadmin is methodical and rigorous when it comes to solving problems. An excellent sysadmin is also tenacious in hunting down the root cause and pragmatic in resolving it.

    A fuller understanding of the problem equips you with better information on which to base decisions, and experience for dealing with problems in future. By solving the real problem, you can be confident that it won’t crop up again.

    You don’t always have the perfect solution – that’s one for pragmatism. Sometimes you can’t fix the customer’s poorly-coded website, or the service that leaks memory all the time. You do what you can to patch things up and move on to the next task.

  6. Not automating things
    Another classic case of “no time for that”. The thing is, any time you can spend on automation will pay off, and it keeps paying off the more you use it. Anything that will be done more than a couple of times should be automated.

    Some systems are easier to automate than others, but every sysadmin needs to know how to script for their chosen platform. Aside from making things faster, properly-done automation makes them more reliable by removing the human element. Even if an automated process isn’t quick, the choice between “15min spent waiting to click a button a few times” and “15min reading a book while waiting for the script to finish” is a no-brainer.

  7. 22gb of swap space
    We saw this horror at the datacentre one evening on a customer’s colocated server. To be fair, there are times when this can make sense, but you have to ask yourself what you’re doing when it’s for a mail server with 3gb of physical RAM (with the tower chassis lying on the floor with the case off). Seriously, wtf??

A word we kept coming back to while thinking about this was professionalism – unfortunately there’s not a lot of it in the industry. IT is fairly well formalised and mature these days, but really doesn’t have standards of its own for professional ethics and accountability. Licensing addresses that in theory, but it’d be a generational change, not something that happens overnight.

We hope this provides some good intellectual fodder. If you’ve got any thoughts or comments, we’d love to hear them.

3 Comments

  • oliver says:

    Admittedly I’ve been banging on the testing drum a lot for almost two years now, but I’d be remiss if I didn’t mention it. You can take a step back from points 4 and 6 and say “how do I even know a rewrite/rebuild or the result of automation is doing the right thing?”. I’m convinced this doesn’t just apply to code but to infrastructure as well.

    The certainty you get from adding tests means you know that what you have automated has done the right thing (or at least matches your specification/acceptance criteria). Regardless of if you rewrite/rebuild from scratch or just refactor you still need to know that what you’ve changed still does the right thing. Again – testing is the only way to get that certainty. Often I find that simply adding tests makes it more clear what you need to change in order for things to be a reasonable improvement but far more understandable and/or elegant.

    Some of the other points can suffer from being time poor. Sadly over the last couple of years I’ve found bumping up swap space, fixing symptoms and rebooting is just easier when there are larger fires to put out but they should never be the “hammer of thor” you first reach for. On the other hand, I accept no excuses for lack of documentation – regardless of how much or little time there was!

    • Barney Desmond says:

      Hi Oliver,

      Agreed, testing is definitely important to avoid doing it wrong. For the sake of the listing we limited ourselves to seven, as it alliterates nicely with the rest of the title. 🙂

      The importance of testing your work probably can’t be overstated, but we find that clueful sysadmins learn that lesson fairly quickly after a “trivial change” to a live system blows up in their face. We’re lucky in our environment to have more experienced sysadmins that can impart their wisdom on new employees.

  • cody says:

    I don’t disagree with anything you said, though I wish the bus factor was applicable in IT and I don’t think it is.

    Aside from hoarding of knowledge and lack of documentation being so commonplace that it’s an expected stereotype everywhere I’ve seen, I’ve also witnessed the death spirals of a couple software houses that refuse to die no matter how many people management run over 😉

    It seems that as long as source code exists, businesses will find ways to get people to compile it and maintain it even if they don’t understand it. In the right niche and a just large enough customer base (on recurring subscriptions/licenses), these businesses can continue indefinitely by doing nothing.

    It’s probably a big different in a high-churn market like web hosting, of course.