Tuesday, May 05, 2015

Network downtime (resolved)

Around 1:15am Tuesday morning, we starting experiencing high latency on our internal network. The high latency resulted in NFS reads/writes blocking for periods of several seconds, causing a backlog of processes on the web server and other servers. This resulted in timeouts when trying to access web pages, and eventually complete downtime when we took the servers offline.

We had four different volunteer staff in the lab troubleshooting the issue around 1:30am. It was difficult to pin down because the actual cause was intermittent, so downtime was slightly more than 30 minutes. (We tried various steps such as searching for network loops, removing different servers from the network, disconnecting from campus, etc.)

The ultimate cause was a broken backup script run by one of the student groups we host. From what we can tell, a daily backup script they had scheduled exceeded their disk quota, then continued thrashing the network trying to write blocks (which failed after exceeding the disk quota).

We're monitoring the network now to ensure everything continues to operate normally, and will work on methods for limiting individual accounts' ability to cripple the network. We'll also improve our ability to monitor the network (our existing tools weren't granular enough for us to see the problem without directly witnessing it in iotop).

Wednesday, February 11, 2015

Downtime Sunday for maintenance

On Sunday night (February 15), we will be performing one-time maintenance on the OCF file server. Total downtime should be no more than two hours (and probably much less).

Thursday, January 29, 2015

Firewall maintenance Feb. 17

IST will be performing maintenance on OCF's firewall on Tuesday 2/17 from 5:30am to 7am. OCF services may be unavailable during this window.

update: IST rescheduled the maintenance to 2/17

Tuesday, January 27, 2015

Downtime Tuesday for security updates

All servers will be restarted the night of Tuesday, Jan 27 to apply security updates. Sorry for the inconvenience.

Tuesday, January 13, 2015

Downtime Monday for kernel updates

The login server (SSH) will be restarted Monday (Jan 19) night to apply security and performance updates. Total downtime should be less than 10 minutes.

Edit: Originally, the downtime was only intended for the login (SSH) server. We're expanding it to include all servers to include recent security updates. Total downtime should be less than 20 minutes.

Friday, December 26, 2014

Update on scheduled downtime Dec 27-28 and Jan 3-4

Update Jan. 04: The outage is over; all services have been restored.

Update Jan. 02: We've just migrated most services to the offsite server, and taken the others offline for the second (and last) scheduled outage. We expect to be back online for good Sunday evening.

Update Dec. 28: Power was restored at 7pm PDT as expected, and all services are now back online. Everything we had planned (powering on the servers remotely via IPMI, copying files/db from the offsite host, etc.) worked great during both the transition to and away from the offsite server. We will do the same thing next weekend. If you still notice any problems, please contact us.

As we found out earlier this month, there will be a power outage in Hearst Gym during the weekend of December 27-28 and January 3-4.

Normally, this would result in all services being completely unavailable. However, we've put in a lot of effort to reduce the impact by transferring as much content to an off-site server as possible. Here's a summary of what to expect:

  • Web hosting will keep working for most accounts. All student group websites have been copied, and almost all individual accounts.

    We copied all individual accounts which have had web traffic in the past month.

    We copied all student groups websites, but student groups with email virtual hosting will not be able to use the offsite server, and will be down during the weekends. Unfortunately, we aren't able to switch A records for these virtual hosts, so there's no way for us to keep these sites available during the downtime while complying with university policy on off-site hosting. Only about 7% of student groups are using email virtual hosting; the rest will be able to use the off-site server.
  • Email hosting and forwarding will be unavailable. There's not much we can do about this, unfortunately. Mails will be delayed by the sending server automatically, and you'll receive them shortly after the outage ends.
  • MySQL will be available on the off-site server. If your website requires MySQL, it will continue to work.
The main OCF website will be available, but the wiki will not. Other services (like SSH, F/OSS mirrors, etc.) will be unavailable.

We've spent a lot of time trying to minimize the impact of the power outage, but there are some things we can't do (we're extremely limited by the university's policies on off-site hosing, and our own lack of resources).

If you have any questions, you can email us at help@ocf.berkeley.edu; we'll be able to view and respond to mail during the outage.

Wednesday, December 10, 2014

Scheduled downtime: Dec 18, Dec 27-28, Jan 3-4

We found out yesterday that, due to construction, Hearst Gym will have no power on Dec 27-28 and Jan 3-4. Unfortunately, all OCF services will be affected by the power outage.

We're looking into ways to reduce the impact, but currently you should expect the following impacts:

  • Web hosting: All web hosting, including student group hosting, will be unavailable. We're working on providing a descriptive error page, rather than simply having requests time out.
  • Email hosting: Email sent to students or to groups with virtually-hosted mail will be delayed until the outage ends. Senders might receive a notice that delivery has been delayed, but you will still receive the messages shortly after the power returns.
These services will be completely unavailable:
  • Database (MySQL) access
  • Shell (SSH/SFTP to tsunami)
  • F/OSS Mirrors (mirrors.ocf.berkeley.edu)
We're working now to try to minimize the impact of the outage, and will post updates here. Please email us if you have any questions.

Update 12/10: We are scheduling downtime during the evening of Thursday, December 18th to test our ability to start all servers and services remotely. Total downtime should be less than 30 minutes.

Update 12/18: Maintenance for tonight is completed. Total downtime was about 45 minutes (instead of the expected 30) due to a problem with a switch after we restored power. The good news is that we caught it now rather than in a week when nobody will be around to fix it. Everything else worked as expected.