The title for this post was originally going to be “things that will get you fired from Anchor”, but we realised they might be more generally applicable and actually more cerebral that they first appear.
These aren’t just any old sysadmin WTFs. To be sure, some are serious “what ever made you think that could possibly be a good idea!?” material, but most are a bit more subtle and deep seated.
Some of them are massively widespread and we’d be remiss to not acknowledge that it’s hard to fix – being sysadmin is suffering, and it can be an act of heroism to dig yourself out of someone else’s hole to get on top of things.
- Rebooting as an instinctive reaction to problems
Sometimes it’s just so tempting to reboot a server when you’ve seen it fix a problem before. As much fun as it is to poke fun at Windows, it’s getting kind of old these days, and properly-managed systems really don’t just crash for no reason. That means there’s a reason, which means you can fix it, and the problem will stop happening. If you’re being paid to maintain servers, it’s your job to get those problems fixed!
- Not documenting or logging their work
This one is especially common, and the reply is always “I don’t have time for that”. It’s something you must make time for, because you’ll never get it done “later”.
If you tell your boss that it’ll take three days to deploy this new server, your estimate had better include time taken for documentation. You do a perfect job and it hums along for years without problems, but eventually someone will need to fix it, possibly you. Without documentation, a lot of time will be wasted rediscovering all the basic details of the system. Even basic notes with an outline of how things work is a massive boon if you’re working on a system you’ve never seen before.
Having documentation also means you can hand the reins over to someone else, whether you’re changing jobs or just want a holiday. Being the single person who knows how a critical system works is poor for your bus factor, and a misguided idea of job security.
- chown -R tomcat /
There’s a reason we don’t give customers root access on fully-managed servers, but sometimes a would-be sysadmin gets it and just doesn’t know what they’re doing…
Apparently the reason for this was “to make their tomcat app work”. I’m guessing it worked, but it broke the rest of the system in the process and a fresh system install was the fastest solution. Seriously, wtf?
- Wanting to rebuild from scratch rather than fixing existing systems
Sometimes you’ll come across a legacy system that’s truly horrid. We’ll put our hands up, we’ve had a few. Replacing that old Tcl codebase sounds great on paper, and we’ll even be able to give it a fresh coat of paint.
We’ll let you in on a secret: sure you could go and code up a replacement with a weekend’s work, but you’ve just wasted your weekend. There’s many reasons why legacy systems stick around, and unless you understand how it all fits in you’ll never understand the real cost of upgrading. That’s not to say it’s not worth doing, but it’s dangerous to think you can knock up a replacement and expect it to work seamlessly.
- Only ever fixing symptoms instead of the real problem
This is really a more general case of rebooting servers to fix a problem. A good sysadmin is methodical and rigorous when it comes to solving problems. An excellent sysadmin is also tenacious in hunting down the root cause and pragmatic in resolving it.
A fuller understanding of the problem equips you with better information on which to base decisions, and experience for dealing with problems in future. By solving the real problem, you can be confident that it won’t crop up again.
You don’t always have the perfect solution – that’s one for pragmatism. Sometimes you can’t fix the customer’s poorly-coded website, or the service that leaks memory all the time. You do what you can to patch things up and move on to the next task.
- Not automating things
Another classic case of “no time for that”. The thing is, any time you can spend on automation will pay off, and it keeps paying off the more you use it. Anything that will be done more than a couple of times should be automated.
Some systems are easier to automate than others, but every sysadmin needs to know how to script for their chosen platform. Aside from making things faster, properly-done automation makes them more reliable by removing the human element. Even if an automated process isn’t quick, the choice between “15min spent waiting to click a button a few times” and “15min reading a book while waiting for the script to finish” is a no-brainer.
- 22gb of swap space
We saw this horror at the datacentre one evening on a customer’s colocated server. To be fair, there are times when this can make sense, but you have to ask yourself what you’re doing when it’s for a mail server with 3gb of physical RAM (with the tower chassis lying on the floor with the case off). Seriously, wtf??
A word we kept coming back to while thinking about this was professionalism – unfortunately there’s not a lot of it in the industry. IT is fairly well formalised and mature these days, but really doesn’t have standards of its own for professional ethics and accountability. Licensing addresses that in theory, but it’d be a generational change, not something that happens overnight.
We hope this provides some good intellectual fodder. If you’ve got any thoughts or comments, we’d love to hear them.