Why it was down
For background, all of the OCF's production infrastructure is supposed to live on two physical servers: jaws and pandemic. There's a third legacy physical server named hal which hosts some testing machines and our backups.
Due to a problem removing a backup logical volume which led to a deadlock and many processes in uninterruptable sleep after Tuesday (believed to be a kernel bug), a staff member gave 15 minutes warning before restarting hal today (Thursday) to try to fix the issue. Since hal isn't supposed to hold important services, this should be totally safe and is considered an acceptable warning period, since normally the only people who will even notice are other staff.
Unfortunately, two servers, pollution (the print server) and maelstrom (the MySQL server) were on hal due to some temporary migrations. They should have been moved back to jaws about a week ago, but weren't.
When hal went down, it took down these production services, killing MySQL (which also took down many websites, the OCF's website, Request Tracker, ...) and printing in the lab. This was realized as soon as monitoring triggered, and the staff member phoned another staffer currently in the lab after hal wasn't coming back up.
Due to a misconfiguration, hal entered maintenance mode, and the other staffer had to enter the root password and fix the filesystem configuration before hal would boot. As soon as hal booted, MySQL and printing were started and service was restored.
- 6:35pm "15 minutes until hal restart" email goes out froms taffer at home
- 6:50pm hal is restarted remotely
- 6:52pm staffer realizes hal had production VMs and isn't coming back online; phones another staffer in the lab
- 7:04pm staffer in lab fixes boot config, hal is restarted; remote staffer leaves home toward ocf
- 7:09pm hal is back on and services are available
- 7:15pm original staffer arrives in lab to find everything already fixed