Wednesday, September 21, 2011

Unexpected downtime, RAID failure

Taking precautionary measures after RAID failure on our main disk array, we have taken down all services that depend on NFS, including web hosting, mail service, and home directory file access.

We are doing our best by working overnight so that service will be restored as soon as possible.

We apologize for the inconvenience.

Update Thurs Sep 22 05:45am

First, we want to apologize (again!) for the delay. We were hoping to restore service yesterday morning. Things didn't exactly work out...

While we don't want to blame the delay on excuses, we would also like to be straightforward about what is going on.

Last Sunday, our backup server began to fail, and by Tuesday, the largest volume, which contained backups of home and web directories, appeared to be unrecoverable. Data is stored in a RAID volume (meaning it is resilient to a certain number of hard drive failures), so the simultaneous hard drive failures/corruption that would have been required suggest hardware problems with the server itself and not (or not only) the hard drives. To be honest, hard drive or RAID controller failure is not completely unexpected for an aging machine with aging hard drives (the server easily predates all of current staff, so we don't know exactly how old). In 20/20 hindsight, we could have acquired and set up new hardware, but for a machine that hosts backup copies of data and is not directly accessible, the extra expense in time and money did not appear to be worthwhile.

On Tuesday at 10:30pm, the main disk array, which exports most data stored in NFS (home and web directories, and mail folders but not mail inboxes). Because RAID adds redundancy, no data was lost, but redundancy was lost, meaning future failures result in data loss. This is why we then took down all services that can access or edit this data and modified others (like printing) to not depend on it (this will however prevent you from being able to see your print quota, you will need to ask a staff member).

The disk array is in its third year of service, so hardware problems with the server itself are not really expected but not improbable either. However, even the most reliable hard drives, accessed constantly 24/7, can fail. We did not have any hard drives other than those in the disk array with capacities greater than or equal to 1 TB, meaning we could not immediately begin rebuilding the RAID volume (again, in hindsight, this was probably a mistake on our part), let alone any (more expensive) "enterprise-class" 1 TB hard drives as would be proper for (and which are currently used in) the disk array.

On Wednesday morning, we bought a temporary "desktop-class" hard drive. When we mounted the new hard drive in the disk array, it was detected but unrecognized and marked as "bad" on reboot. We tried unsuccessfully to work around the problem. Other hard drives (of smaller capacity; they cannot be used to rebuild the RAID volume) were recognized and usable for other purposes without errors. It seems highly unlikely that a brand new hard drive would be bad, and we could not find any sign of errors when testing and running diagnostics on the hard drive in other machines, so the disk array is suspect, but since it may work with other hard drives, is not clearly at fault either. (There appear to be firmware restrictions on intermixing "desktop class" and "enterprise class" drives.)

On Wednesday evening, as another precautionary measure, we planned out a procedure to replicate the data on other machines so that if another hard drive or the disk array were to fail, we would not have data loss or corruption. To prioritize, we are copying data in alphabetical order from enabled (i.e., not disabled) accounts to another hard drive on the disk array. We will remove this hard drive when done for safekeeping, and also copy the same data over our internal network to another server with a RAID 1 (mirror) setup.

We will try our best to restore service as soon as possible. We don't want to sound deceiving by suggesting a time earlier than what might end up happening, especially since we need to first make sure that existing data is safely backed up. Service downtime through the weekend is not acceptable but it is possible, and depending on any obstacles encountered, the length of downtime could be longer or shorter.

Our Board of Directors (comprised of interested OCF members, volunteer staff and "users" alike) currently meets weekly on Thursday at 6:45pm in the OCF lab. Our next meeting is today, and if you have any advice or comments, related or not to the downtime, we encourage you to attend.

Update Sat Sep 24 06:30pm

The ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to rebuild the array and bring services back online. We may be able to mount the disk array read-only before then if the local copy is complete.

Update Sun Sep 25 12:00am

The local copying of non-disabled accounts that was started on Thursday morning is about 90% finished. We're expecting it to be finished by the morning.

Update Sun Sep 25 10:30am

The login and mail servers are now mounting home directories read-only. SSH/SFTP will give you read-only access to your files, IMAP/POP/mutt/webmail will give you read-only access to your mail.

Unfortunately, the ASUC Auxiliary is closed on weekends, and as a result we won't be able to obtain the package of hard drive replacements that we ordered until Monday, at which point we will be able to repair the array and bring all services back up.

Update Sun Sep 25 11:30am

The web server is now serving web pages read-only. This may break some sites that require writing to the home or web directory. For the time being, you can optionally use our error message to give an HTTP 503 Service temporarily unavailable error with an explanation.

Update Mon Sep 26 02:30pm

We obtained and added the new hard drives to the array at 9am this morning. If there are no errors, we expect the resync to be complete by 5pm. We will then mount home directories with full read and write access in the state they were originally.

Update Mon Sep 26 05:15pm

Finally, all services are operational as before. This will hopefully be the last update...