What's the point of all this maintenance?
September 15, 2011
As you can guess from our schedule of activities for this Saturday, the point of maintenance boils down to keeping stuff up to date and fixing stuff that's broken. So why do we need to keep things up to date and why can't we fix the broken stuff some other time?
Why bother with updates?For a while, not updating systems was the norm and this was for all the reasons that are likely to be obvious to you - downtime is bad, change is bad, and nothing bad ever happens anyway. And this was all fine until bad stuff started to happen with systems that were never updated. Really bad stuff. The Oxy campus got hit pretty hard by the Blaster worm (as did just about everyone else on the Internet). Back then, we didn't update our employee desktops, we didn't require students to register their computers, and we didn't regularly patch our servers. Rather than do those things, though, we focused solely on what is known as network perimeter security. It was expensive and difficult to manage but it ultimately worked. Until it didn't. So we'd shore up our network perimeter again. Lather. Rinse. Repeat. In the meantime, we had servers getting infected which meant we had to take them down, rebuild them and restore them from backup, leaving departments without their data for days. We had students bring down infected machines to add to our queue which, on bad weeks, numbered up in the hundreds and patiently waited up to a week to get it back. And we had our employee desktops getting infected, requiring ITS to come out, wipe them clean, restore them to default configuration. It's worth pointing out that nearly all of the massive worm infestations I linked to above were entirely preventable - the necessary security updates were available, they just weren't being installed. So we learned our lesson. We install updates now. We know it inconveniences everyone but hopefully everyone can see that the alternative was pretty lousy.
You said something about fixing broken stuff?Oh yes, that's the other bit. On any given day, lots of things break. Most of it is easily fixable. If a particular fix requires a reboot or downtime but it affects a relatively simple system or if the fix is limited in scope and impact, we'll usually fix those late at night or early in the morning. But not all fixes are simple. And when the thing that's broken is an enterprise database application, a critical piece of networking equipment, or a complex bundle of disks, fixing the problem can be a most decidedly non-trivial task. The main concern here is the potential impact if something goes wrong while fixing it. The technology involved is pretty complex with lots of moving parts and lots of dependencies. Often times, this means lots of different folks need to be involved and if things go wrong, lots of different vendors might need to be contacted and additional folks from ITS may need be brought in as well. So why Saturday morning? As trivial as this might sound, one of the reasons is because the aforementioned people tend to be awake. There are practical benefits to this. Not having to wait while the second-tier support engineer wakes up, showers, gets dressed and commutes into the office is definitely a plus when your entire network is down. But the bigger concern is that if you're doing things late at night or early in the morning, people tend to be tired and being tired is bad. Depending on the circumstances, it can be really, really bad. So by picking a time when everyone is awake, we reduce the chance of mistakes being made when working with our most complex systems. Networking equipment can be tricky enough to work with, even when you're awake and alert. A group of network engineers found this out last week, as you may recall. It's not perfect and we know the work of the College is not confined to business hours. With few exceptions, just about any time we pick for maintenance inconveniences someone (the Jackson-Buffet principle) and the fact that everyone has been so understanding of this is a big part of why I like working here.
But what about...Any other questions? Comments? Criticisms? Are my arguments invalid? Please feel free to leave a comment below or send me an email.
- Info Center:
- Technology Helpdesk:
(323) 259-2880 firstname.lastname@example.org
- IR Operations Offices: (323) 259-2832