TSD Operational Log - Page 15
Due to a failure of the main gateway, the TSD is not reachable at the moment.
We are working extra time to solve the problem as soon as possible.
We apologize for the inconvenience.submissions
03/06 21:46 - Login issue fixed. But still issues with some linux VMs, module load and job submission to Colossus may face problems.
04/06 -12:01 - TSD back to normal and Linux VMs can use "module load"
Affects - Linux Vms, job submission,listing or any other operation on /cluster,
We are experiencing a mounting issue of the /cluster. This will lead to problems when submitting jobs to Colossus. The cost command may also not work.
We are working on this (07:40, 02/06) and sorry for the inconvenience.
NB NB : We did find a redhat rtcbind-bug and this is what caused the crash due to our usage of it. We have downgraded and this problem should now be totally fixed.
10:05, 02/06 : Most VMs are fixed. Still testing
Some of our Linux VMs are currently having issues mounting a network drive. As a result of this, logging in to the affected machines will not work.
We're hard at work to solve this as quickly as possible. Our apologies for the inconvenience.
UPDATE: Looks like it is possible to log on to some of the VMs, but these are still having trouble reaching the /cluster area.
--
Best regards,
TSD-team
Due to a jump host failure earlier this morning, some of the Linux-VMs inside of TSD require reboot and we are progressively fixing the machines now.
Our apologies for the inconvenience.
Users may currently face problems importing/exporting files to/from TSD. We are doing our best to get this issue resolved ASAP
TSD@USIT
Due to a jump host failure, services in TSD are not accessible now. .
We are working to investigate the causes in order to solve the problem as soon as possible.
We apologize for the inconvenience.
TSD@USIT
Solved:11:00
The log onto the linux machines in the TSD infrastructure is not possible at the moment. We are working to solve the situation as soon as possible.
We apologize for the inconvenience.
Due to a jump host failure, TSD was not accessible 09:15 - 14:31 today. The failure has also caused NFS hangs on some Linux VMs, due to which we had to reboot those VMs. All thinlinc VMs now should be accessible. Some other Linux VMs my still have problems. We are working in fixing those ASAP
TSD@USIT
Due to NFS hangs, some Linux hosts are now inaccessible. The issue may result in TSD being inaccessible via thinlinc. This should not affect log in through VMWare Horizon.
We doing our best for this issue to be resolved ASAP
TSD@USIT
---------------------------------------------------
The issue was resolved after the affected hosts have been rebooted
There will be a maintenance stop on the 02/05 at 13:00 CET for one hour. During the downtime, the linux and windows VMs will not be accessible. The Colossus cluster will be under maintenance.
We are having an issue with the main gateway and the login to TSD in this moment is not possible. We are working to solve the problem as soon as possible.
We apologize for the inconvenience.
----
Update 14/04, 12:15 -- TSD partially operational. Still not possible to submit jobs to Colossus
----
Update 18/04, 09:40 -- Still there might be issues when submitting jobs to Colossus
The problem was solved on Friday around 22:00. It might be that some of the linux VM are not yet reachable. In that case please mail us and we will restart the machine.
----
TSD is unreachable at this moment. We are trying to solve the problem as soon as possible. We apologize for the inconvenience.
The TSD gateway was not reachable between 15:43 and 16:15 today, the issue has been resolved and users should be able to log in again.
Some linux machines might still not be available, if you are unable to log into ThinLinc please contact us.
The problem was solved the same day around 18:10. It might be that some of the linux VM are not yet reachable. In that case please mail us and we will restart the machine.
----------------
TSD's main gateway is not reachable at the moment and the TSD infrastructure is not accessible. We are trying to solve the situation as soon as possible. More information will appear on the operational log on Monday 27/02 in the morning.
We apologize for the inconvenience.
The problem was solved on Friday around 17:15. It might be that some of the linux VM are not yet reachable. In that case please mail us and we will restart the machine.
------------------
TSD is unreachable at this moment. We are trying to solve the problem as soon as possible. We apologize for the inconvenience.
We have had a failure of the primary jumphost but the failover machanism took over and moved the system to the secondary one. The infrastructure shall be back to normal very soon. We are investigating what caused the failure.
We apologize for the inconvenience.
We are having some network problem at the moment and the services in TSD is not reachable. We are working to investigate the causes in order to solve the problem as soon as possible.
We apologize for the inconvenience.
Regards
Nihal@TSD
Dear TSD-users!
There is an issue accessing the services on colossus at the moment.
This may result in:
- /cluster/projects/pXX being inaccessible
- Software modules being inaccessible
- Issues submitting to Slurm
We are working on this issue to get it resolved ASAP.
We apologize for the inconvenience.
Regards,
Nihal @TSD
Dear TSD-users!
There is an issue accessing the colossus storage at the moment. This may result in /cluster/projects/pXX being inaccessible.
We are working on this issue to get it resolved ASAP.
We apologize for the inconvenience.
Regards,
Nihal @TSD
Dear TSD-users!
Due to a disk failure on tsd-fx01.tsd.usit.no which happened 2017-01-20, 22:59, the file import in TSD has been unavailable until 2017-01-23, 10:56.
The issue has been resolved, and the system is back in production.
We apologize for the inconvenience.
Regards,
Benjamin
There is an issue accessing the colossus storage at the moment. This may result in:
- /cluster/projects/pXX being inaccessible
- Software modules being inaccessible
- Issues submitting to Slurm
We are working on this issue to get it resolved ASAP.
We apologize for the inconvenience.
Regards,
Abdulrahman
Due the network problem we had yesterday morning, the Colossus disk was not properly mounted and exported to the project machines. This results in problem accessing the /cluster/projects partition, mounting modulefiles and running slurm. Some of the jobs that were running when the network issue occurred might have been affected by problem.
We are on the issue now, hoping to solve it as soon as possible during the day.
We apologize for the inconvenience.
Francesca
Update: Problem is fixed. You can now login into TSD.
---------------------------------------------------------------------------------------------------
We are having some network problem at the moment and the TSD is not reachable. We are working to investigate the causes in order to solve the problem as soon as possible.
We apologize for the inconvenience.
Regards,
Erik
Due to a failure in the heating system, the Colossus front-end and one of the rack went down this morning (05/12-2016) and the cluster is not available at the moment. We are working to reboot the system at the moment.
Jobs that were running on the rack that went down, unavoidably died while those running on the other rack are most likely running even though the front end is not available.
We apologize for the inconvenience.
Dear TSD users,
There has been a network problem, which has been safely bypassed by our failover mechanism. However some of the linux VMs are still mounting the filesystem and therefore are not accessible at the moment. The process will take 2 hours. If after 11:00 today you still experience problem with your linux VM, please let us know (tsd-drift@usit.uio.no).
We are investigating the cause of the network problem.
We apologize for the inconvenience.
Regards,
Francesca