Thursday, May 26, 2016

Power outage, services down

At 9:35am, a power outage in MLK caused our servers to go down. We are working to bring most services back up by using power from a different circuit. We are in contact with facilities to restore power.

Update 1:03pm: All services should be restored.

Sunday, May 15, 2016

Home directories read-only Sunday for a few minutes

We are in the process of upgrading our NFS storage. To do this, we'll need to make home directories read-only for a few minutes on Sunday afternoon, then once again late Sunday evening.

We anticipate the read-only mode to last only about 15 minutes, but there may be an additional 5-10 minutes of full downtime while we transition into it.

Most websites should continue to work during read-only mode. SSH will continue to work, but with limited capabilities.

Friday, April 15, 2016

Downtime tonight for hardware upgrades

There will be up to 45 minutes of downtime tonight for hardware upgrades. We don't expect to use the full window, but are reserving it in case of problems. It will begin some time around 10pm.

Sorry for the inconvenience -- this should make future operations much more stable.

Tuesday, March 29, 2016

Misleading Daily Cal article incorrectly suggests OCF connection to printer abuse

We were disappointed to find that the Daily Cal has published a misleading article about recent printer "hacking" at UC Berkeley.

The online edition of the article contained (and still does) a photo of the OCF lab taken yesterday. The print version is even more misleading:


The OCF was not involved in this attack in any way. Our printers are not exposed directly to the internet, and our volunteer staff take security very seriously. We were quite shocked to see ourselves on the Daily Cal article.

It's really unfortunate that a lot of people will see this image and think that the OCF was compromised. Our volunteer staff have put a lot of work into the OCF, and we don't like to see it tarnished this way.

We appreciate that the article itself did not reference the OCF. And we'd greatly appreciate help from the Daily Cal in replacing the misleading photo and issuing a clarification.

Update March 30: The Daily Cal has updated the caption of the photo, but refuses to correct their mistake by replacing the photo, citing "a pretty strict policy on retraction".

While we're glad they've recognized the mistake and added the clarifying caption, we still believe they should act with integrity to correct this mistake. It is not responsible journalism to place the photo of a widely recognized but completely unrelated organization above an article about poor security practices.

Many people don't read Daily Cal articles in full, but instead see snippets posted on Facebook or elsewhere. In these snippets, a picture of the OCF is still prominently featured, and the work of our volunteer staff is still being devalued.

This could have been prevented if the staff of the Daily Cal had taken a few minutes to shoot us a quick email prior to publishing that article. A lot of damage has already been done. The Daily Cal should take responsibility for their error and correct it.

Update April 1: The Daily Cal has now replaced the photo with one of a printer which was actually affected by the attack. We thank them for correcting the mistake.

Short webserver downtime 3/29 for maintenance

We'll be performing some short maintenance on our webserver tonight (3/29) to install instrumentation that would let us diagnose the kernel panics we've been experiencing recently on it.

Saturday, March 19, 2016

MySQL read-only Saturday 3/19

As part of our work to transition from Percona to MariaDB for our MySQL server, we'll be migrating user data tonight around 9pm.

To do this, we'll put the existing Percona server into read-only mode, then make a final import to the new MariaDB host. We believe this will take about an hour and don't anticipate any issues (we've already tested imports from our regular backups without problems).

Read-only mode is necessary during the import to ensure we get a consistent backup, and so that writes made during the transition are not lost.

Some sites may experience downtime while the server is in read-only mode (if they require writing to the database to show pages). Most sites will experience some level of degradation (e.g. can't log in to admin or edit posts).

Update 10:01pm: We have entered read-only mode.

Update 10:10pm: The backup is complete and is being imported into MariaDB now.

Update 10:14pm: ETA 35 minutes.

Update 10:35pm: The import was interrupted when the new server ran out of memory. We're increasing memory / reducing memory use by mysqld and starting the import again. Still in read-only mode.

Update 10:45pm: ETA 38 minutes.

Update 11:11pm: Import has finished, we're now swapping out MariaDB for Percona (which will involve about 2 minutes of downtime).

Update 11:16pm: We've noticed some issues with the import (views were not correctly copied) so we'll need to re-do the import. Still in read-only mode, expect another hour or two in this state. Sorry for the trouble!

Update 11:44pm: The view problem is fixed, so we're proceeding to move MariaDB into production. Expect about 2-3 minutes of downtime now.

Update 11:55pm: All work is completed and we are now on MariaDB. Total downtime was about 3 minutes, with read-only mode lasting about two hours.

Thursday, March 10, 2016

MySQL and printing unavailable for 34 minutes (resolved)

MySQL and printing were unavailable today for about 34 minutes due to an unscheduled outage.

Why it was down
For background, all of the OCF's production infrastructure is supposed to live on two physical servers: jaws and pandemic. There's a third legacy physical server named hal which hosts some testing machines and our backups.

Due to a problem removing a backup logical volume which led to a deadlock and many processes in uninterruptable sleep after Tuesday (believed to be a kernel bug), a staff member gave 15 minutes warning before restarting hal today (Thursday) to try to fix the issue. Since hal isn't supposed to hold important services, this should be totally safe and is considered an acceptable warning period, since normally the only people who will even notice are other staff.

Unfortunately, two servers, pollution (the print server) and maelstrom (the MySQL server) were on hal due to some temporary migrations. They should have been moved back to jaws about a week ago, but weren't.

When hal went down, it took down these production services, killing MySQL (which also took down many websites, the OCF's website, Request Tracker, ...) and printing in the lab. This was realized as soon as monitoring triggered, and the staff member phoned another staffer currently in the lab after hal wasn't coming back up.

Due to a misconfiguration, hal entered maintenance mode, and the other staffer had to enter the root password and fix the filesystem configuration before hal would boot. As soon as hal booted, MySQL and printing were started and service was restored.

Timeline

  • 6:35pm "15 minutes until hal restart" email goes out froms taffer at home
  • 6:50pm hal is restarted remotely
  • 6:52pm staffer realizes hal had production VMs and isn't coming back online; phones another staffer in the lab
  • 7:04pm staffer in lab fixes boot config, hal is restarted; remote staffer leaves home toward ocf
  • 7:09pm hal is back on and services are available
  • 7:15pm original staffer arrives in lab to find everything already fixed

Downtime Friday 3/11 for security updates

There will be about 20 minutes of downtime Friday night to apply security updates.

Sunday, February 28, 2016

mirrors.ocf.berkeley.edu read-only for about 20 hours

We're moving mirrors.ocf.berkeley.edu, our free-and-open-source software mirror, from its current hardware (a recycled desktop with some extra hard drives) onto a new server (with server-grade harddrives, RAID, etc.).

To do this with the minimum amount of downtime, we're going to be copying the disk from our current mirror to the new server. To ensure consistency, we need to first make it read-only. We expect the copy to take about 8 hours, after which point we'll make the replacement server the main mirror. At this point, mirrors will be about 8 hours old, but will quickly catch back up when the cronjobs start running.

Update 8:30pm 2/28: This is starting now.

Update 5:33pm 2/29: Maintenance is complete.