TSD Operational Log - Page 14
EDIT: Something was wrong with the new packages (RPMs), so we rolled back the upgrade, and are now back in production with the old version. We will fix the rpms and try again later.
We will upgrade the queue system on Colossus today, at around 13:00. This will lead to the queue system commands (squeue, sbatch, etc) being unavailable for a while. We estimate 10--15 minutes. In the mean time, running jobs will continue as normal.
We do not expect any user visible changes after the upgrade.
TSD added more capacity to its VM cluster on Monday. A misconfiguration during that process resulted in some thinlinc VMs mounting the filesystem as read-only. This affects login, among other things. In order to fix this we will have to restart the affected VMs. Projects that are affected will not be able to log in to these VMs at the moment, so we will go ahead with the reboots.
Some of our TSD-users are having problems with Log on via ThinLinc. The engineering team is actively working to correct the issue.
TSD@USIT
Yesterday at around 15:15, about half of the Colossus nodes were reinstalled. Unfortunately, one slurm plugin was out of sync, which made jobs fail to start properly on the nodes. This resulted in about 40 jobs exiting with an empty slurm-NNN.out file before the nodes were automatically taken out of production. The problem was discovered and fixed within an hour, but the failed jobs must be resubmitted.
We apologize for the inconvenience.
We experienced a file system related issues with mounting of /cluster/project on Friday evening and as a result of this Linux VMs faced problems accessing this area. Jobs submitted to the Colossus may have been affected as well.
08-10 - The original issue is solved, but if you still can not access /cluster/project please logout (all users) and send us a mail requesting to reboot it.
We are ready to deploy the new automatic failover mechanism allowing the entire TSD infrastructure to shift on the second gateway/router if the primary is down, without the users noticing it. This new feature will increase the stability and significantly improve the user experience of the TSD. The final testing of the production setting will be done on the 25/09 between 12:00 and 15:00. Most likely you will not experience any disruptions during the maintenance window. But if you will notice any malfunctioning, please mail us (tsd-drift@usit.uio.no).
Francesca@TSD
The File Lock service is down and our engineering team is working to resolve this issue.
TSD is inaccessible for the moment. The engineering team is actively working to correct the issue.
TSD@USIT
Our engineering team is performing minor infrastructure upgrades. We anticipate that these upgrades will not disrupt our services.
Due to a jumphost failure the TSD services are disrupted. The engineering team is actively working to correct the issue.
Maintenance stop of the Colossus and Thinlinc infrastructures on the 6/09-2017 from 12:00 CEST to 14:30 CEST. During the downtime logon to the linux VMs will not be possible.
14:41 - We are experiencing an issue submitting jobs to Colossus and working on fix
15:44: FIXED: The issue was due to a system configuration issue leading to incomplete clean-up in the slurm config
All failed jobs must be re-submitted manually, If you have any doubt about the status of the job please do contact us with the jobid.
sorry for the inconvenience.
We are currently experiencing a mounting issue with the /cluster/project, and are working to solve the problem as soon as possible.
We apologize for the inconvenience.
TSD is inaccessible via ThinLinc. This should not affect log in through VMWare Horizon.
We doing our best for this issue to be resolved ASAP
The TSD infrastructure is not accessible at the moment. We are working to understand the cause and solve the problem as soon as possible.
We apologize for the inconvenience.
Francesca@TSD
Dear TSD-users,
The File Lock service is down and we are working to resolve this issue.
We apologize for any inconvenience this may cause you.
Best regards,
TSD-Team
Dear TSD-Linux users,
There will be a downtime of the TSD-Linux infrastructure, during which we will reboot our ThinLinc servers to upgrade their kernel.
We apologize for any inconvenience this may cause you.
Best regards,
TSD-Team
We are facing an issue with the mount and working on a fix
The upgrade and root is taking longer then expected. We are now rebooting the last machines, expecting to be finished by this evening (around 18:00).
Please check the operational log later today.
-------------------
Due to a security vulnerability discovered in the linux RED HAT kernel, the linux machines will be rebooted on Thursday 29/06 at 14:00 CET (one hour). All the processes running on the machines will die, so we strongly recommend to stop all the programs/processes running locally on the machine before the maintenance.
We apologise for the inconvenience this might cause to you.
We have finished the maintenance of the Colossus cluster. The outcome of this outage is that the HugeMem node will be much cheaper! Please read the post in the "News".
We apologize for the inconvenience.
Francesca@TSD
UPDATE: Databases back up, and upgrade postponed. Our apologies for the inconvenience.
The new downtime windows are as follows.
Tuesday, 20th of June, 08:00 - 14:00
p11, p22, p23, p33, p38 and p40.
Wednesday, 21st of June, 08:00 -14:00
p47, p58, p76, p96, p158, p175 and p189
Thursday, 22nd of June, 08:00 - 09:30
p32, p225 and p244
We will update this message to keep you posted on the status of the upgrades.
Best regards,
TSD-team
We are doing some maintenance work on Colossus on Monday 12/06 at 8:00 am CET for the entire day. Jobs that will be schedule during the downtime or short before will be put on queue and will start after the maintenance.
We apologize for the inconvenience this might cause to you.
Regards,
Francesca@TSD
Dear TSD users,
Regretfully, the planned downtime is prolonged until further notice. Our experts are working on the case, and the TSD-services will be available fairly soon.
We apologize for any inconvenience this may cause.
Best Regards,
TSD-Team
Dear TSD users,
We are currently in the maintenance mode and the availability of the services is disrupted.
We apologize for any inconvenience this may cause.
Best Regards,
TSD-Team
Dear TSD users,
We are getting into the hot part of the configuration/deployment of the new gateway machines. We need to put them in production and therefore we need a downtime of the TSD. The downtime will be on the 08/06 at 12:00 CET and will last for 3 hours. During the downtime the login to TSD will not be possible. The jobs on Colossus will keep on running. But the linux VMs most likely will need a reboot at the end of the maintenance.
We apologise for this outage with such a short notice but we are doing it for the good of the entire infrastructure.