[DONE] Major upgrade of Colossus 2024-08-05 07:00 to 2024-08-12

The operating system (OS) on Colossus will be upgraded starting Monday August 5th. This is a major upgrade, and will take one week. During the upgrade, Colossus will be unavailable.

Files in home directories and project areas will be accessible during the upgrade.

The main reason for the upgrade is that the current OS is very old, and soon will soon reach end of life.

During the downtime, we will reorganize the cluster, upgrade the networking and upgrade the OS from CentOS7 to Rocky9. Rocky, like CentOS, is a RedHat clone.

The software stack (available via "module load") will be reinstalled. We will install toolchains 2021a and newer. This means that not all
old versions of all software will be available after the upgrade. If you use software older than this, please start using newer versions, if possible. If something is missing, please submit the software request form.

Before or during the downtime, new RedHat9 submit hosts will be made available to replace the existing RedHat7 submit hosts, so you can start testing and installing software. The new submit hosts are named pXX-hpc-01 (-02 etc, if your project has multiple submit hosts). You will most likely have to recompile any software on the new RedHat9 submit hosts that you have installed on the existing RedHat7 submit hosts today (e.g. R packages). After the upgrade the RedHat7 submit hosts will be removed.

When the upgrade is done, Colossus will still run the same "flavour" of Linux (RedHat) as now, so you should be able to work mostly as you do today.  However, a few commands may have changed or work differently, so you will probably have to do some changes.

During the downtime, please follow this operational log for updates.

Update 2024-05-10: Downtime postponed from 2024-05-27 to 2024-06-10.

Update 2024-06-04: Downtime postponed from 2024-06-10 to 2024-08-05.

Update 2024-08-09: The downtime will be extended over the weekend, and we plan to open up again on Monday, hopefully early in the day.

Update 2024-08-12: The first 30+ compute nodes and dragen-3 are back in production. The gpu nodes, hugemem nodes, dragen-3 and remaining compute nodes will be added to Slurm as they become available. To start submitting jobs or reinstalling software packages, use the new RHEL9 submit hosts, called pXX-hpc-01 (-02..-0N, if your project has multiple submit hosts).

Update 2024-08-13: gpu-1 (lcbc reservation) and gpu-[2-3] (tsd reservation) have been added.

Update 2024-08-21: gpu-[4-6], all hugemem nodes and another 20 compute nodes have been put in production. The old RHEL7 submit hosts will be powered off Wednesday 2024-08-28.

Published Apr. 30, 2024 4:14 PM - Last modified Sep. 24, 2024 2:25 PM