TSD Operational Log - Page 4
Starting at 09:30 on 2023.07.10 we will be upgrading the databases for our core services.
Due to this our services will at times be partially or fully unavailable at times during this upgrade.
We will update this message as we go along, and notify you when it's done.
--
On behalf of TSD
We are currently experiencing instability in access to storage for multiple projects, affecting all services.
We are still investigating the problems.
-----
Update 09:05:
We have remounted the storage for the affected machines, and they seem to work now.
The instability affected around 30 projects from around 4am this morning.
-----
Update 10:00:
There are still reports of instability, and we will investigate further.
-----
Update 2023-07-07:
The reason for the instability was found and addressed yesterday. All systems should have worked normally since about 11am yesterday.
Maintenance is being performed on our storage systems. We expect minimal issues. Some linux hosts may need to be rebooted.
Any paths under /cluster (e.g. software and projects) are unavailable. This affects software modules and project areas on Linux submit hosts (and other hosts with a /cluster directory). The cluster directory can still be reached via /tsd/pxx/cluster instead.
SCCM group will be upgrading internal SCCM-site database in TSD on Thursday 2023-06-08 Software Center on all Windows VMs in TSD will be unavailable between 12:00 and 16:00.
We are currently experiencing instability in access to storage for multiple projects, affecting all services.
We are still investigating the problems.
-----
Update: The reason for the unstability is identified and resolved.
We are working to fix the issue
The TSD Identity Management System will go for maintenance for a short time today 24.05.2023 from 17-18
We're experiencing login issues through vmware on Linux and slow storage over NFS.
The Consent System will be temporarily unavailable for upgrade.
https://data.tsd.usit.no was not accessible between approximately 12:51 and 14:45 today due to an expired security key. This affected both login and data import/export. The issue has been resolved since then.
Since around 09:45 we're experiencing NFS storage issues.
We're working to fix it and hosts will be rebooted in the process.
From 10:00 the NFS servers in TSD will be upgraded.
We expect this to take approximately 1 hour and may cause network/NFS interruptions.
We advice you to keep an eye on this operational log for any updates.
The Consent System will be temporarily unavailable for upgrade. The upgrade will last until the end of the day. After the upgrade, only the consents from the last 2 months will be available initially. In the next day, the old data will appear (there is no data lost).
The availability of nodes (compute, bigmem, accel, dragen) will be reduced on Colossus over the next several days for maintenance.
A handful of nodes will be taken down for maintenance at a time. We apologize for any inconvenience this may cause.
Like yesterday, we are experiencing storage problems due to our storage provider, IBM.
We are currently experiencing technical difficulties with core services like file import/export, and are working to restore operations. Sorry for the inconvenience caused by this.
On the 11th of April at 15:00, TSD will migrate the shared directory to the new storage system. The shared directory is a read-only export available to all TSD projects ("/tsd/shared" on Linux or "\\tsd-evs\shared" on Windows).
Please note that the migration will not impact regular shared project directories such as pXsharedpY and its variants.
We need to conduct a routine upgrade of TSD's storage system between 10:00 and 11:00 on Wednesday, April 19, 2023.
During the downtime, it will still be possible to log in to VMs, but durable storage (M:\ and N:\ in Windows VM) will be unavailable. Therefore, we recommend closing all open files and potentially logging out to avoid data loss. Colossus is unaffected and will operate normally throughout the upgrade.
ESS storage is currently unavailble due to a technical problem. There was a firmware-bug in our IBM setup, the bug caused a metadata-server to crash, and a global disk failure. As there was a minor risk of data-loss or data-corruption we had to spend about 9 hours yesterday working together with the international crisis team in IBM. System was up at about 1900 yesterday (23/3-23) and we believe there was no data-loss or corruption.
We sincerely apologize for this unplanned downtime, but third party bugs in firmware setups is a very unforeseen happening that we could not have easily prevented.
During the downtime we also fixed the MTU settings that IBM had misconfigured and may have been the root cause of previous truble. We also attached more storage, which will soon be put into production.
Due to a previous misconfiguration from IBM, we will reconfigure our storage on Thursday March 23 from 12:00. This may cause network storage downtime on Windows and Linux clients. The Colossus compute nodes might be affected, but we consider that very unlikely.
We're experiencing NFS hangs due to third party maintenance on our storage system.
We'll be rebooting hosts to resolve the issue.
We're experiencing NFS hangs.
We're working to fix it and will have to reboot hosts in the process.
Affected services:
- login to Nettskjema with TSD
- login to TSD virtual machines
- selfservice portal
- data portal (and command-line imports)
- publication portal
- consent portal
We are working to fix this.