Open Computing Facility Status: Random Outages

Over the course of the past day and a half, we have been experiencing some random errors with our primary authentication system. These errors have led to some difficulties logging in for some users and some other problems in the physical lab (printer queue jams, frozen terminals, etc). We're not quite sure of what's causing the problems, but we have a pretty good idea it's related to us using NIS+ (an old standard developed by Sun Microsystems that has been deprecated). Thanks to sluo, we were able to recover from these errors, so everything should be up and working now.

In regards to our other services:

The mail queue is still being processed, but there's still a huge chunk of mail that's left in the queue.

MySQL databases should be restored as well as we can restore them. Users with data that we have identified as problematic in recovering will be individually contacted via email tomorrow evening (I'm consolidating a list of the errors we received so I can send it all in one pass).

PostgreSQL is currently being looked at and debugged.

We've found a way to get our disk array serviced, but it means sending back critical parts of our disk array. Since downtime is unacceptable, we're going to build a temporary disk array out of commodity parts and use that in a 'hot-swap' manner with our current disk array. I'm waiting for the parts in the mail, though...

Sorry about not updating this blog in the past couple days, but I've been rather busy, and I only got to leave the OCF around 4 AM yesterday morning.

Thursday, October 26, 2006

Random Outages