Norwegian version of this page

TSD Operational Log - Page 14

[SOLVED] Login problems for Linux-hosts in TSD.

Published Feb. 12, 2018 1:02 PM

Dear TSD users,

Unfortunately, we had a network outage last night, and this is causing some issues with login and mounting of file systems in TSD at the moment.

We are hard at work resolving the issue as quickly as possible.
Our apologies for the inconvenience.

--
Best regards,
TSD

[SOLVED]?File system crash on Colossus.

Published Feb. 7, 2018 1:25 PM

The file system on Colossus crashed at around 13:10 today. We are currently working on solving the problem as quickly as possible.

UPDATE:

The file system went down due to a fuse going. We have investigated and moved cables around to make sure that we do not get a repeat of this incident. Our apologies for the inconvenience.

[SOLVED] Problem with jobs on Colossus

Published Feb. 2, 2018 12:14 PM

Currently, jobs on Colossus have problem starting. They seem to hang in "CONFIGURING" (CF) state, but eventually start after some minutes. We are investigating the problem, and will come back with more information when we know more.

We are sorry for the inconvenience.

Update 14:28: The problem is fixed, and the jobs are starting as normal. Thank you for your patience.

[SOLVED] Dragen down due to maintenance, Monday, January 29

Published Jan. 29, 2018 9:17 AM

Dragen is down due to security patching

[Completed] TSD Maintenance window: 2018-01-05 09:00 - 2018-01-08 17:00

Published Jan. 5, 2018 9:40 AM

Dear TSD user

Due to a serious security vulnerability in modern processors, which in practice affects all operating systems, and therefore almost all IT services, we need to perform maintenance on the entire TSD infrastructure. This means that all TSD services will be affected. Please do not start any critical work during this period.

We will try our best to complete the work tomorrow between 09:00 and 17:00 but due to the scope of work and the short planning horizon it is possible that parts of TSD will remain under maintenance until Monday 8th of January 17:00.

If you want more information please consult UiO security [1]. This has also been widely covered in Norwegian press [2,3]. Those who want more detail about the vulnerability can refer to more technical explanations [4,5].

Regards

Leon du Toit

[1]...

TSD: Unplanned reboot of some machines

Published Nov. 19, 2017 6:00 PM

On Saturday morning (18/11-2017 around 11:00 am) there has been a crash of one of the virtualization cluster in TSD. The failover mechanism has automatically moved all to the machine to the other clusters, but in this process were the machines rebooted.

We will investigate together with the vendor the causes of this failure.

In the meantime we apologize for the inconvenience.

Francesca@TSD

[SOLVED] /cluster filesystem crash

Published Nov. 15, 2017 4:40 PM

One of the /cluster filesystem IO nodes went down at 15:00, leading to parts of the /cluster filesystem being unavailable. It was restarted at 16:00, and we are currently checking the Linux VMs for nfs hangs.

EDIT: Our tests didn't indicate any nfs related hangs on the Linux VMs.

[SOLVED] Slurm upgrade on Colossus

Published Oct. 30, 2017 12:46 PM

We will retry the upgrade of the queue system on Colossus today, starting at around 12:45. This will lead to the queue system commands (squeue, sbatch, etc) being unavailable for a while. We estimate about 15 minutes. In the mean time, running jobs will continue as normal.

We do not expect any user visible changes after the upgrade.

Update: The upgrade has been done now, and seems to have gone well.

[SOLVED] Upgrade of queue system on Colossus

Published Oct. 26, 2017 12:43 PM

EDIT: Something was wrong with the new packages (RPMs), so we rolled back the upgrade, and are now back in production with the old version. We will fix the rpms and try again later.

We will upgrade the queue system on Colossus today, at around 13:00. This will lead to the queue system commands (squeue, sbatch, etc) being unavailable for a while. We estimate 10--15 minutes. In the mean time, running jobs will continue as normal.

We do not expect any user visible changes after the upgrade.

[SOLVED] Thinlinc VM maintenance: reboot required

Published Oct. 18, 2017 12:43 PM

TSD added more capacity to its VM cluster on Monday. A misconfiguration during that process resulted in some thinlinc VMs mounting the filesystem as read-only. This affects login, among other things. In order to fix this we will have to restart the affected VMs. Projects that are affected will not be able to log in to these VMs at the moment, so we will go ahead with the reboots.

[SOLVED] Problems with ThinLinc

Published Oct. 16, 2017 9:01 AM

Some of our TSD-users are having problems with Log on via ThinLinc. The engineering team is actively working to correct the issue.

TSD@USIT

[SOLVED] Job error on Colossus nodes

Published Oct. 10, 2017 12:46 PM

Yesterday at around 15:15, about half of the Colossus nodes were reinstalled. Unfortunately, one slurm plugin was out of sync, which made jobs fail to start properly on the nodes. This resulted in about 40 jobs exiting with an empty slurm-NNN.out file before the nodes were automatically taken out of production. The problem was discovered and fixed within an hour, but the failed jobs must be resubmitted.

We apologize for the inconvenience.

[SOLVED] Issue with /cluster/project -- affects Linux Vms

Published Oct. 8, 2017 11:32 AM

We experienced a file system related issues with mounting of /cluster/project on Friday evening and as a result of this Linux VMs faced problems accessing this area. Jobs submitted to the Colossus may have been affected as well.

08-10 - The original issue is solved, but if you still can not access /cluster/project please logout (all users) and send us a mail requesting to reboot it.

[SOLVED] TSD Maintenance: Tues. 26/09 between 12:00 and 15:00 CEST

Published Sep. 26, 2017 10:31 AM

We are ready to deploy the new automatic failover mechanism allowing the entire TSD infrastructure to shift on the second gateway/router if the primary is down, without the users noticing it. This new feature will increase the stability and significantly improve the user experience of the TSD. The final testing of the production setting will be done on the 25/09 between 12:00 and 15:00. Most likely you will not experience any disruptions during the maintenance window. But if you will notice any malfunctioning, please mail us (tsd-drift@usit.uio.no).

Francesca@TSD

[SOLVED] The File Lock service is down

Published Sep. 22, 2017 10:06 AM

The File Lock service is down and our engineering team is working to resolve this issue.

[SOLVED] Problems with login to TSD

Published Sep. 21, 2017 2:13 PM

TSD is inaccessible for the moment. The engineering team is actively working to correct the issue.

TSD@USIT

[SOLVED] TSD Maintenance work during week 37

Published Sep. 12, 2017 11:16 AM

Our engineering team is performing minor infrastructure upgrades. We anticipate that these upgrades will not disrupt our services.

[Solved] TSD not accessible

Published Sep. 5, 2017 9:02 AM

Due to a jumphost failure the TSD services are disrupted. The engineering team is actively working to correct the issue.

[SOLVED] TSD: Downtime of Colossus and Thinlinc on the 6/09-2017 extended to 14:30 CEST

Published Aug. 25, 2017 3:05 PM

Maintenance stop of the Colossus and Thinlinc infrastructures on the 6/09-2017 from 12:00 CEST to 14:30 CEST. During the downtime logon to the linux VMs will not be possible.

[Solved] Problem running jobs

Published Aug. 15, 2017 2:41 PM

14:41 - We are experiencing an issue submitting jobs to Colossus and working on fix

15:44: FIXED: The issue was due to a system configuration issue leading to incomplete clean-up in the slurm config

All failed jobs must be re-submitted manually, If you have any doubt about the status of the job please do contact us with the jobid.

sorry for the inconvenience.

Previous page 10 11 12 13 14 15 16 17 18 19 Next page

Feed from this page