Saturday, June 20, 2015

Ongoing downtime due to server crash

At 1:07pm today, hal, our primary production server, froze. We are on-site and working to restore it. We are moving the important servers to another machine while we investigate, as hal continues to experience issues.

Update 2:07pm: Service has been restored, but we are continuing to move servers to a different machine. There will be some downtime as we continue the migration, but it will affect single services only.

Remaining to migrate: (updated 5:04pm)

  • firestorm (ldap)
  • death (www)
  • pestilence (dns, dhcp)
  • supernova (admin)
  • maelstrom (mysql)
  • tsunami (ssh)
  • anthrax (smtp)
  • sandstorm (group smtp)
  • biohazard (apphost)
  • lightning (puppet)
  • earthquake (accounts)
  • typhoon (rt)
  • blight (wiki)
  • flood (irc)
  • reaper (jenkins)
  • dev-earthquake (dev-accounts)
  • pollution (cups)

Update 5:20pm: All VMs are migrated to jaws, and all services should be restored. We'll be debugging and rebuilding hal in the near future, and will be scheduling downtime some time in the next few weeks to move VMs back. We'll post a followup here when we have a date in mind.