Tuesday, November 06, 2018

Upgrading user-facing servers to Debian stretch

In the past year and a half we've upgraded our entire infrastructure to Debian stretch, with the notable exception of any user-facing machines.

Edit (2018-11-25): We've now upgraded all of the servers listed below! Please contact help@ocf.berkeley.edu if you have any questions about the upgrade or notice anything broken.

The time to upgrade them is now! We've prepared upgraded versions of each of these servers and will swap them out on Sunday, November 18th, 2018 the evening of Sunday, November 25th, 2018. This has been postponed from the 18th since some apphosting groups did not have their applications ready on the new server to make the migration possible.

The servers that will be upgraded are:
  • tsunami, the public login server
  • werewolves, the apphosting server (we've reached out to groups using this server and will be replacing with a new server named vampires)
  • death, the web server
Most users won't notice the update, except that most software will have newer versions. The one main exception is users who have dynamically-linked binaries somewhere in their home directories.

Because many libraries will be upgrading, most of these kinds of programs will fail to run after the upgrade. The best solution is to recompile the binaries (or find newer, pre-compiled versions).

One specific case is with environment managers like Python's virtualenv, Ruby's rbenv/rvm, and Node's nodeenv/nvm. These often put fully-compiled versions of the interpreter in your home directory, and in most cases, these will fail to work once the server is upgraded. After the upgrade, you'll need to rebuild these to get them to work again. Here are some major versions of programs that will be upgrading:

- Ruby 2.1.5 -> Ruby 2.3.3
- Python 3.4.2 -> Python 3.5.3
- Python 2.7.9 -> Python 2.7.13
- NodeJS 0.10.29 -> NodeJS 4.8.2
- PHP 5.6.36 -> PHP 7.0.30
- Perl 5.20.2 -> Perl 5.24.1

For upgrading any apps using our application hosting, you can find more detailed instructions on our website: https://www.ocf.berkeley.edu/docs/services/webapps/

During the server swap, you should expect a small amount of downtime (about 5-10 minutes) as the new servers are swapped into place of the old servers.

If you have any questions or need assistance feel free to reach out to help@ocf.berkeley.edu.

Thanks for flying OCF!

Friday, October 12, 2018

Account Creation and Password Resets Temporarily Down, October 12

Due to ongoing maintenance, account creation and password resets are down today.

At roughly 6:15PM, we there will be brief NFS downtime as we attempt to fix the issue.

Thanks for flying OCF!

Scheduled maintenance on night of 2018-10-12

The OCF is anticipating a short period of intermittent service unavailability in order to perform some additional maintenance on our hypervisors as a followup to last week's maintenance event. Specifically, we intend to migrate NFS to our new fileserver, reinstall our hypervisors onto new disks, and possibly migrate our mirrors server to new hardware. We are scheduling this event for for low-utilization hours at night to minimize any disruption to our users.

Thanks for flying OCF and send us an email if you have any questions!

Monday, October 08, 2018

IPv6 Connectivity Issues on October 8

Starting last night, the OCF has been experiencing some connectivity issues to our public SSH server over IPv6. If you are having trouble logging in to ssh.ocf.berkeley.edu, please try using IPv4 to connect. To do this, you can add -4 to your SSH connection command, like so

ssh -4 <OCF username>@ssh.ocf.berkeley.edu

If your SSH client does not support the -4 flag, you can also connect directly to our server's IPv4 address. To do this, just connect to `169.229.226.25` instead of `ssh.ocf.berkeley.edu`.

Some other services such as MySQL may also experience issues over IPv6. If neither IPv6 nor IPv4 is working for you, please let us know.

Thank you for being patient while we restore full connectivity.

UPDATE 2018-10-09 1:45AM: IPv6 connectivity should be restored to all our user-facing services.

Wednesday, October 03, 2018

Downtime on October 6

The OCF will be experiencing downtime, due to scheduled maintenance, on October 6th, from 9PM-12AM. Hosted websites will experience downtime as we briefly reboot the servers to apply critical security updates.  Once our servers are rebooted, users accessing our public login server and our apphosting server will not be able to write files as we are moving NFS to a new host. This read-only period will last no longer than 15 minutes and all operations should behave as normal by midnight at the latest.

Thanks for flying with the OCF!

8:53 Update: We've powered off the servers to do networking hardware and kernel updates.
11:03 Update: We've decided not to do the NFS migration tonight, but networking updates have been performed and services should be back up in the next hour.
11:42 Update: Most of our public servers should be back now, apart from our public mirrors and our own website (vhosts should be fine).
12:17 Update: Our website is back, but our software mirrors are not back yet.
12:51 Update: Everything should be back and working now except our HPC control server, which we are still debugging.
1:35 Update: This is the all clear, everything should be working now! Feel free to let us know by emailing help@ocf.berkeley.edu if you notice anything wrong.

Friday, June 01, 2018

Small amount of downtime today (6/1)

We had a short amount of downtime today with a few of our services between 10:10 PM and 11:19 PM today (6/1/2018) due to some networking issues on one of our hypervisors:

  • OCF website: Down from 10:24 PM to 10:47 PM, intermittently up and down until 11:19 PM.
  • phpMyAdmin: Down from 10:10 PM to 10:48 PM, intermittently up and down until 11:19 PM.
  • OCF mirrors (rsync/ftp/http/https): Down from 10:28 PM to 11:09 PM.
We have fully recovered all affected services and put a couple protections in place to avoid this kind of issue again.

Monday, March 19, 2018

Short downtime tonight (3/19) for security updates

There will be a brief period of downtime tonight (3/19/2018), around 8:15 PM. It will likely last about 30-45 minutes as we restart our servers to apply kernel updates to them.

Edit (8:20 PM): We have started. See you on the flip side!
Edit (8:40 PM): We have rebooted a majority of our public services, they should be coming up soon. Public mirrors are still to come.
Edit (8:50 PM): Our MySQL database did not start correctly, so we have started that now. We are also in the process of rebooting our public software mirrors host.
Edit (9:00 PM): Restarting our servers is fully complete, thanks for your patience!

Friday, February 09, 2018

NFS downtime on 2018-02-09

NFS (containing home directories, cron jobs, and public_html web directories) was down for a short period of time on 2018-02-09 from 4:41 PM to around 5:06 PM. In this time, files stored in user home directories and all hosted websites (including https://www.ocf.berkeley.edu) were unavailable. Logins were also affected, both in the OCF lab and from outside the lab to our SSH server. Thank you for your patience while we brought everything back online.

Please contact us if you have any questions or concerns!