Saturday, June 20, 2015

Ongoing downtime due to server crash

At 1:07pm today, hal, our primary production server, froze. We are on-site and working to restore it. We are moving the important servers to another machine while we investigate, as hal continues to experience issues.

Update 2:07pm: Service has been restored, but we are continuing to move servers to a different machine. There will be some downtime as we continue the migration, but it will affect single services only.

Remaining to migrate: (updated 5:04pm)

  • firestorm (ldap)
  • death (www)
  • pestilence (dns, dhcp)
  • supernova (admin)
  • maelstrom (mysql)
  • tsunami (ssh)
  • anthrax (smtp)
  • sandstorm (group smtp)
  • biohazard (apphost)
  • lightning (puppet)
  • earthquake (accounts)
  • typhoon (rt)
  • blight (wiki)
  • flood (irc)
  • reaper (jenkins)
  • dev-earthquake (dev-accounts)
  • pollution (cups)

Update 5:20pm: All VMs are migrated to jaws, and all services should be restored. We'll be debugging and rebuilding hal in the near future, and will be scheduling downtime some time in the next few weeks to move VMs back. We'll post a followup here when we have a date in mind.

Sunday, June 14, 2015

Directory listings re-enabled by default June 19th

On June 19th, we will re-enable Apache directory listings by default for both virtual hosts and userdir web hosting. You can disable these by creating a file named .htaccess in your web root with the line "Options -Indexes".

Server maintenance 6/19

We will be updating our physical servers on Friday, June 19th around 9pm PDT. All OCF services will be affected, though we expect downtime to be less than 15 minutes.

Thursday, May 21, 2015

Mail server maintenance 5/22

Our internal mail server (anthrax) will be unavailable for about an hour tomorrow. This only affects individual accounts with email forwarding, and mail sent from the OCF. Virtual hosts with email forwarding will still work.

No mail will be lost; instead, it will be queued until the server is available again. At the worst, your mail may be delayed by an hour or two.

Staff alumni with Google Apps accounts won't be affected at all.

Online account tools maintenance tonight

The online account tools (used for requesting accounts, changing passwords, etc.) will be offline for about an hour tonight as we rebuild the server. No other services will be affected; you can still change your password via SSH if desired.

Wednesday, May 20, 2015

Directory listings disabled by default May 31th

On May 31th, we will disable Apache directory listings by default for both virtual hosts and userdir web hosting. You can re-enable these by creating a file named .htaccess in your web root with the line "Options +Indexes".

Tuesday, May 19, 2015

Server maintenance 5/24

We will be updating our physical servers on Sunday, May 24th around 9pm PDT. All OCF services will be affected, though we expect downtime to be less than 15 minutes.

jaws maintenance 5/20

We will be performing updates on jaws during Thursday, May 20th. jaws is a testing machine which hosts no public services, though staff VMs (and any services they provide) will be unavailable.

Saturday, May 09, 2015

Another WordPress XSS vulnerability; please update!

Another vulnerability was recently discovered in WordPress which affects a large number of OCF web hosting users. The vulnerability can potentially allow a malicious person to hijack your session and compromise your website.

All users should update immediately to the latest version of WordPress. Version 4.2.2 (i.e. the latest version) is the only version we consider safe.

Updating WordPress is extremely easy; it's just a single click after logging in to the admin panel.

Recent versions of WordPress come with automatic updates enabled for minor releases, which can help to protect you from future vulnerabilities. We strongly recommend not disabling this feature!

If we've contacted you and you need help updating your site, please don't hesitate to get in touch so that we can help!

We will be emailing affected users in the near future and offering to upgrade WordPress on their behalf. If you'd like us to not do this, please confirm that either (a) you have updated it yourself, (b) you've removed WordPress entirely, or (c) you'd like to close your OCF account.

Thanks for your help!

Tuesday, May 05, 2015

Network downtime (resolved)

Around 1:15am Tuesday morning, we starting experiencing high latency on our internal network. The high latency resulted in NFS reads/writes blocking for periods of several seconds, causing a backlog of processes on the web server and other servers. This resulted in timeouts when trying to access web pages, and eventually complete downtime when we took the servers offline.

We had four different volunteer staff in the lab troubleshooting the issue around 1:30am. It was difficult to pin down because the actual cause was intermittent, so downtime was slightly more than 30 minutes. (We tried various steps such as searching for network loops, removing different servers from the network, disconnecting from campus, etc.)

The ultimate cause was a broken backup script run by one of the student groups we host. From what we can tell, a daily backup script they had scheduled exceeded their disk quota, then continued thrashing the network trying to write blocks (which failed after exceeding the disk quota).

We're monitoring the network now to ensure everything continues to operate normally, and will work on methods for limiting individual accounts' ability to cripple the network. We'll also improve our ability to monitor the network (our existing tools weren't granular enough for us to see the problem without directly witnessing it in iotop).