Thursday, February 11, 2016

MySQL downtime Feb 11 to fsck disks

MySQL will be offline for about 10 minutes tonight for emergency maintenance.

This is in response to unscheduled downtime about an hour ago due to a kernel deadlock which took down all MySQL services. We don't really think it could be caused by filesystem corruption, but because of recent corruption at the OCF which affected nearly all servers (caused by Debian #788062) we think it's worth checking.

Friday, February 05, 2016

Upgrading user-facing servers to jessie

In the past year we've upgraded our entire infrastructure to Debian jessie, with the exception of user-facing machines.

The time to upgrade them is now. We've prepared upgraded versions of each of these servers and will swap them out early morning on Wednesday, Feb. 10th.

The servers that will be upgraded are:

  • tsunami, the public login server
  • biohazard, the app-hosting server
  • death, the web server

Most users won't notice the update, except that most software will be a newer versions. The one exception is users who have dynamically-linked binaries somewhere in their home directories.

Because many libraries will be upgrading, most of these programs will fail to run after the upgrade. The best solution is to recompile the binaries (or find newer, pre-compiled versions).

One specific case is with environment managers like Python's virtualenv, Ruby's rbenv or rvm, and Node's nodeenv or nvm. These often put fully-compiled versions of the interpreter in your home directory, and in most cases, this will fail to work. After the upgrade, you'll need to rebuild these.

For application hosting, you can find instructions on our website:
https://www.ocf.berkeley.edu/docs/services/webapps/

During the server swap, you should expect a small amount of downtime (about 5 minutes).

If you have any questions or need assistance feel free to reach out to help@ocf.berkeley.edu.

Update Feb 07: We're going to push this back until early morning Wednesday (originally it was Monday) to give us a little more time to ensure a smooth upgrade.

Update Feb 09: For biohazard (app hosting), we'll be reaching out to individual groups using the server to coordinate a smooth upgrade. biohazard will continue to be available (and unupgraded); we'll be moving groups one-by-one to the new server (named werewolves).

Unexpected downtime Feb. 5

There was around half an hour of unexpected downtime for all OCF web services from 12:15-12:45am on February 5 as a server had to be recovered from an accidental configuration. If you recognize any issues in the next few days, please report them to help@ocf.berkeley.edu.

Thanks,
OCF Staff

Sunday, January 31, 2016

Campus internet access degraded (resolved)

The UC Berkeley campus is currently experiencing highly degraded internet (50+ ms latency and 30+% packet loss).

This is affecting access to all OCF servers from outside of campus. It is also affecting AirBears, ResComp, EECS machines, and other campus resources.

Update 4:12pm This appears to have resolved itself.

Thursday, January 21, 2016

Degraded internet access (resolved)

The OCF is currently experiencing degraded internet. We are investigating.

Update: This has been resolved since 4:34pm.

Wednesday, January 20, 2016

Scheduled downtime Thursday night

We plan to perform maintenance on Thursday, January 21st around 9pm. All services will be unavailable during this time. Total downtime should be less than 20 minutes

Tuesday, January 19, 2016

Username length limit raised to 16 characters

The maximum username length has been raised from 8 to 16 characters. Enjoy your longer usernames!

Note that some commands, such as ps and w, may truncate usernames longer than 8 characters or do other strange behaviors. If you have any scripts which use such commands or otherwise assume usernames are 8 characters or fewer, beware that they may have to be revised.

Wednesday, December 16, 2015

Scheduled downtime Thursday night to fsck disks

Due to a flurry of I/O errors on one of our physical servers during a routine update tonight, we're scheduling downtime tomorrow (Thursday) night after 9pm to fsck all virtual servers.

The total downtime should be no more than an hour and will have rolling outages. Key services like web hosting will only be unavailable for maybe 15 minutes.