So what is central storage and why are you doing so much maintenance because of it?
March 15, 2012
my team and I have to shut everything down in order to work on our "central storage equipment" or some other technobabble. Centralized storage amounts to one or more highly specialized servers that have loads of hard drives and can share all of that space with other servers. In Oxy's case, our centralized storage comes in the form of a Sun (now Oracle) 7410, which you can see to the left. The 7410 is crammed in the middle, sandwiched between pairs of big boxes with lots of disks. Those boxes are called J4400s and we used to only have two of them. On January 7th, we added the other two. Or at least, we tried. We ran out of space back in October of 2011, which is never a good thing. We needed more disks but it turned out that was easier said than done. Still, towards the end of 2011, we secured two used J4400s and it seemed like it was a pretty straightforward process of hooking them up and enjoying the additional free space. There was a catch, though. Because we were installing second-hand parts, we needed Oracle techs to come out and "recertify" everything so that we could be eligible for support and maintenance. In addition to making sure everything was installed properly, they wanted us to install the latest software updates. So on January 7th, we began the upgrade process. It has not gone smoothly. For the most part, we got hampered by the need to replace or add parts. First, we needed to buy 4 extra disks. The used J4400s we got came with 22 disks each and they needed 24. Then, we needed to replace the controller cards on the 7410, which Oracle shipped out to us. Remember how I mentioned earlier that we needed to upgrade the software? Well, we couldn't because 4 out of the now 102 disks attached to our 7410 were not functioning properly. At first, Oracle thought it was a problem with the cables. We ordered new ones and swapped them out. No dice. Then Oracle figured we could shut everything down, unplug everything, then plug everything back in. We did this. No luck. Then we did it again. Remember that software update we couldn't install? Oracle told us we could force it to ignore the problem with the 4 disks and install anyway, but that it was a bad idea. Until, apparently, it wasn't and we end up forcing the install anyway. So now we have the latest software which made us eligible to receive all of that technical support that we had been availing ourselves of anyway. Finally, it came down to something called a SIM board. What's a SIM board? I don't know but there are 8 of them, 2 in each J4400, and 1 of them is faulty. Oracle can narrow it down to 2 possible culprits. They ship us a new one and wish us luck. Finally, on March 14th (happy pi day, by the way), we power down our 7410, replace one of the SIMs, then turn everything back on. And it's finally fixed! Oh wait, I replaced the wrong one. Shut everything down again, replace the other SIM, then power everything back on. And it's finally fixed! I'm glossing over the details here but each time we shut down the 7410, it involves shutting down nearly 50 servers that rely on it for storage, along with the entire Oxy Virtual Computer. We've had lots of practice at this, obviously, but this still takes about 30 minutes. The 7410 itself takes about 15 minutes to start up, and then it takes another 30 minutes or so to start up those 50 servers and the Oxy Virtual Computer again. So there you have it. That all happened and it required a lot of downtime and a lot of people were inconvenienced by all of the maintenance we had to schedule to try to fix this. I'm sorry about this. I always say that your patience and understanding is appreciated - yes, I say that all the time so I'm sure it's wearing thin but seriously, your patience and understanding is really truly sincerely appreciated.
- Info Center:
- Technology Helpdesk:
(323) 259-2880 firstname.lastname@example.org
- IR Operations Offices: (323) 259-2832