Friday, October 13, 2006

Status Update

We're still backing up the rest of user data on our disk array to some spare space we have on our servers. Since we have upwards of 400 GB of data, and we're transferring most of it over NFS (regular Ethernet and not SCSI or Fibre Channel), it's taking a long time.

Some users have asked about data loss during this recovery. Most mail daemons should be smart enough to retry delivery once service to the OCF is restored. If our downtime ends up becoming prolonged, we will try to figure out a way to queue mail so it doesn't end up getting bounced.

In regards to user data (ie., anything other than mail), we're pulling the data off the disk array as quickly as possible. So far, it seems like most user data is intact; we're only getting about 1-2% corruption. That's not to say that that 1-2% of data is lost; we're just pulling the good data from the disk array at the moment. We haven't even begun to run the Unix equivalent of Scandisk, so it seems like there's a good chance we'll have 100% data recovery. Keep your fingers crossed, though.

Beyond the fact that we're working with such a massive amount of data, one of the holdups on our recovery is acquiring a LSI Logic PCI-X SAS/SATA host controller that supports Sun Solaris SPARC so we can setup a staging area to backup our disk array. If you don't understand what all those acronyms mean, let's just say you can't walk into any CompUSA or BestBuy and find that card. The only place that seems to carry the card is Newegg, but it's $300, and, even with overnight shipping, the earliest we're getting it is Monday.

Yury and I (the current site-managers) have been taking long shifts in the OCF to get user data back, and most of the other staff have been around to provide assistance (thanks sluo for saving us when we don't know Solaris 10!), so, we're working on it!