Saturday, March 10, 2007

State of the OCF Address

Brief History, Current State, and Future Plans

Executive Summary: Downtime in the near future to migrate disk array. Check back for exact date/times.

Some users have voiced concerns about being kept out of the loop regarding the state and future plans of OCF. Hopefully, this should bring everybody up to speed on what's been happening and what we have planned.

Many of our problems seem to be caused by the experimental hardware used on the server holding our backup disk array. Now that the primary array is ready to be brought back into service, we are expecting that a move back should fix many of our problems (mainly the webserver uptime and printing problems). After much consideration, we determined it would be best to connect the disk array to the machine that runs the webserver (famine), in part because this would eliminate a large amount of network traffic.

Before migrating data to the primary disk array, we decided it was best to make sure famine was completely up to date first. It had been running an older version of Solaris 10, and the latest update contained a a lot of bug fixes and security fixes. famine (through the use of virtual servers) runs our webserver, print server, and database server. This update was going to happen, and downtime was needed. By doing the upgrade first, users could still access our servers and their data during the downtime.

While it is possible to update the version that was already installed, we found that it would be better to use a clean install of the latest version. According to our research, it seems that the normal upgrade path often result in unexpected errors with the virtual servers (called Containers or Zones). The general recommendation is to backup the Containers and perform a clean install. The process boiled down to five major steps:
1. back up zone data and configuration files (for the 8 installed zones)
2. install Solaris 10 Update 3 and set up raid devices
3. install and configure new zones
4. manually merge data from zone backups into new zones
5. restore services

This process takes a long time to complete, so we scheduled this upgrade for a weekend (3/2-3/4). Unfortunately, it seems that insufficient warning was given about this downtime. I apologize for the insufficient warning and any inconveniences this many have caused. We will work on getting downtime warnings out earlier.

While we had originally planned to migrate the disk array data at the end of this upgrade process, we ran into some problems. During the upgrade, we started getting hardware fault errors concerning one of the CPUs and memory modules. At that point, we decided it would be best to hold off on the disk array migration until after the hardware issues had been resolved, so we finished with the upgrade and restored services.

After many days of troubleshooting and several on-site SUN service calls, we determined the problems were caused by a faulty memory controller on the mainboard. The mainboard has been replaced and it looks like we are ready to move ahead with the disk array migration. Since most of the setup has already been done, this is not too difficult a task, but it will require another extended downtime. The process boils down to:
1. make a recent sync of all data
2. kick all users off, shutdown web server
3. make final update of any changed data
4. reconfigure servers
5. bring everything back up

The final sync of user data must be performed when nobody can access the data so we don't get inconsistencies in the copies. This will require all users to be logged off and the webserver shutdown (i.e. downtime) sometime in the near future. This process will probably take something like 10 hours or so, depending on how much data changes between rsyncs. After the data is synced and the servers are reconfigured, we can bring everything back up and everybody should be happy.

Please post comments or email staff@ocf.berkeley.edu if you have any questions or concerns.