Sunday, October 15, 2006

Status Update

The fsck didn't go so well. We're restoring our secondary backup of the disk array and going for another attempt at fixing the file system. This will be our final attempt at repairing the file system; we don't want to prolong our downtime since the process of restoring the backup to the disk array takes upwards of 10 hours. If we are unable to restore the file system, we'll wipe the disk array clean, create a new UFS file system, and rebuild user data from the tar archives we created on Friday and Saturday.

That is, our first attempt at repairing the file system failed. We're going to try again, but we're trying to balance our efforts at recovery with minimizing downtime. If we can't repair our file system, we're just going to wipe the slate clean and pull data from an archive we made, which may be missing a very small fraction of user data (basically the data that was damaged during the initial hardware failure). Our worst case estimate is around 1% data loss; most users won't be affected, and for the users that were, most files that we were unable to be recover seem to be unimportant files (browser cache files, temporary lock files, etc).

So, just to be clear: we're trying our best to get 100% data recovery, but doing so while minimizing downtime is difficult. Our worst case scenario is bringing back the OCF with about 99% of the data intact and working with users to recover any important data from the 1% that may be lost.