Available hardware resources
Name | Status |
CPUs/ RAM(GiB) |
GPU | Shared home area | OS and software | Comments |
---|---|---|---|---|---|---|
ml1.hpc.uio.no ml2.hpc.uio.no ml3.hpc.uio.no |
Production | 28 cores (Intel Xeon)/128 | 4 X RTX2080Ti | Yes | RHEL 8.7 with module system |
on ML1 on 3 GPUs functional
|
ml4.hpc.uio.no |
Production | 32 cores (AMD)/128 | 2 X AMD Vega 10 XL/XT | Yes | RHEL 8.7 with module system |
|
ml6.hpc.uio.no | Reserved for a course | 32 cores (AMD)/256 | 8 X RTX2080Ti | Yes | RHEL 8.7 with module system |
|
ml7.hpc.uio.no |
Production | 32 cores (AMD)/256 | 8 X RTX2080Ti | Yes | RHEL 8.7 with module system |
|
Will be moved to Fox | ||||||
ml9.hpc.uio.no | Production | 2x 48 core (AMD)/1024 | 4 X NVIDIA GeForce RTX 3090 | Yes | RHEL 8.7 with module system |
reserved for INF5310 course |
How to get access
Apply for access at the following nettskjema.
How to login
The ml nodes are behind a jump host as a security measure. Which means that you need to be logged in to a UiO computer before you SSH to a ML node. You can achieve this in two ways.
- Login to a computer inside UiO network (login.uio.no)
- Login to the ml nodes from that computer
UIO-USER-NAME is your user name at University of Oslo
{MYUSER@laptop:~] $ ssh UIO-USER-NAME@login.uio.no [UIO-USER-NAME@gothmog ~]$ ssh ml1.hpc.uio.no
You could combine the above two steps using the following command
ssh -J UIO-USER-NAME@login.uio.no UIO-USER-NAME@ml1.hpc.uio.no
Login problems.
If you could not login to ML nodes, this could mean many things. So if you send us a mail asking for help with only "I can not login, it is difficult to provide a solution. Please go through the list and see what information you should gather.
- "The authenticity of host '....uio.no (129.240...)' can't be established. This can happen when we change the server key or you are login in for the first time . The solution is to get/update the key, for this refer the section "Key changed when trying to log in" below. After you verify that you are connecting to the correct machine, you should type "yes" to accept the new key
- Wrong username or password. For ML nodes you should use the UiO username and password. If you get the username-password combination wrong for more than three times, then your account would be blocked for that machine for one hour.
- Your password is case sensitive.
- Jump host. Make sure that you follow the jump host instructions above.
- Did you type the correct host-name. Please check the correct names in the above table (Available hardware resources).
- When sending support requests, please include the details below.
- Exact command you used to login with username used and hostname (the ML machine your are trying to login) . Never include password.
- Where are you login in from. Is it office ? from your laptop from home ?. Please send the IP address of the machine if you know how to get it (if you do not know what that is do not worry)
- If you are login from a terminal please send the full debug info. e.g.
- ssh -vvv MY_USERNAME@ml1.hpc.uio.no
Please note that you need to use the jump host when Uploading/Downloading files as well
How to load software
Module system
We use the Lmod module system for all AI hub machines. Please refer the modules document for details.
How to use Jupyter
Please see here for using jupyter with GPU support
How to install additional python packages
See the document: install-additional-python-packages
How much resources could I use.
Please note ML nodes are a shared resource and has a high demand. You should be be considerate of available resource of each machine (the machines are not the same). Following commands are useful to know the limits
To know number of processor cores. You should not use more than 1/4 of the value shown
[root@ml8 ~]# nproc 192 (So the maximum you should use should be be less than 48 cores or threads
To0 know amount of memory
[root@ml8 ~]# free -h total used free shared buff/cache available Mem: 1,0Ti 103Gi 837Gi 6,6Gi 66Gi 891Gi
Here you should not try to use more than the free limit.
If you violate above limits then the machine may crash and you and the fellow users will loose all ongoing work. If we can save others jobs, we might consider killing the jobs that violates the conditions.
Home area
The HOME area is shared between ml1, ml2, ml3, ml4, ml6, ml7 and ml8. i.e. you will see the same content when you login into any of these machines.
The home area is backed-up each night. To recover files one can access /itf-fi-ml/home/.snapshots/<time stamp>/<username>
where your home folder has been backed up.
Storage quota
ML nodes are not a place to store your data. You may copy input data to them, keep the outputs until the processing is done and then copy back the results. The following table describes the amount of space your are allowed to use.
Location | Maximum Limit |
---|---|
$HOME (your home directory) | 20GB |
/itf-fi-ml/shared/users/$USER |
100GB (500GB up to 14 days)
|
/scratch/users/$USER |
No limit and used for staging, i.e. only while the processing is ongoing. The available space will be limited by how much already used by others. Data not accessed for 14 days will be automatically deleted to provide others space. Users need to ask to be added manually to scratch storage - contact hpc-drift@usit.uio.no |
What will happen if you use more than the above limits ?
- Backup will stop (there is no way to get back data if you loose it)
- You will not be able to copy files or create new ones.
- If you have data that is not accessed in /scratch area for more than 14 days, this will be automatically deleted
- If there is a reboot for unforeseen reasons or a crash, all data in /scratch may be lost
Using /scratch
for large datasets
Since the home area of the ML machines is shared, the performance might not be the fastest when working with large datasets. To accommodate such workflows, each ML node has its own private scratch folder where users can store data temporarily when working on it. The scratch folder is local to each machine so when logging in to different ML machines users will see different content.
To start using the scratch folder simply upload data to /scratch/users/<username>
and access it from here. The scratch area is useful if you need to read and/or write a lot of data to files.
There are currently no usage limits on the scratch folders, but we retain the right to remove data that is not in active use when the scratch area of a machine is nearing full. If your workflow requires writing a lot of data to files we recommend you read and write from the scratch area and then move the results to your home area when the experiment is done.
Upload/Download files
Software requests
If you need additional software or want us to upgrade an existing software package, we are happy to do this for you (or help you to install it yourself if you prefer that). In order for us to get all the relevant information and take care of the installation as quick as possible, we have created a software request form. After filling in the form a ticket will be created in RT and we will get back to you with the installation progress.
https://nettskjema.no/a/usit-sw-request
Key changed when trying to log in
UiO has updated the SSH hostkey policy which decides the appropriate hashing function to use during SSH key exchange. For some of you this might mean that your previous setup is now telling you that the key has changed and that you might be a victim of a man-in-the-middle attack. When encountering such messages please check trusted sources to ensure that you are not being attacked and the proceed from there.
In the current case it simply means that one needs to refresh the hostkey of the ML node in question. To do this, use the following SSH commands:
ssh-keygen -R ml1.hpc.uio.no
(exchange for the applicable ML node)- Connect again as usual, through SSH, and paste the corresponding key from the table below.
ML node | RSA key | ED25519 key |
---|---|---|
ML1 | SHA256:pAw0j5DjOvXrgKO3DlGvTvF3EAzaxw2/tEPGaygayGw | SHA256:rMc5mseHIDPcwPZCWlE3fAEK155ad8sJ7kQUSgVPWVY |
ML2 | SHA256:yogcKQBA8uZDap7bIqS8xtwhzXxM3JI7UyEHCItzLJU | SHA256:/QaY71pRnimBkUWb+H/NGv4b+EGf91sQdk1h8Z3/kKU |
ML3 | SHA256:9ETM32UFHBJC6BQfmqnE0R0ECQts/RYQGDNN/lqUmYs | SHA256:PXTnLgrMueFcPGuKgb8TyP2s+eBmeXJzSvEEb7rq19A |
ML4 | SHA256:dv5VKLHZ/IIAmj5aCUqQ5IAmVgnq/EXcyQcZjoRBAjk | SHA256:zHr4djVT4zu2fGlI6pdjAH9yOjG1a1ifwOwxe8GA1A8 |
ML6 | SHA256:0zRe9JqlhDZwDgJwdXBNF6KIfs7Y81GaiEMx7cdL0iw | SHA256:2o+eqB6cltnXuMXTSv+87xSijdtBSisRts840hAs9iQ |
ML7 | SHA256:I1FeqkoKGsUEJ8B7jNQZsMVQjXsct7oCRTvKDvXqIJk | SHA256:QTpQ3sY5rF84gQDMend8KhXP6Y7aWEhJ/Rgl5wQcRC4 |
Bluemaster01 | SHA256:Sn7I6tHz9OeL9PkBLorS24LrILUMbH4l5fydaTlzl+g | SHA256:biNo079CAkTDvPhQzNL9yVWaGRkfcff9eMSBFz1DLpQ |
Citations and acknowledgements
Please use the following format when acknowledging ML nodes, if you use them in your research.
Machine learning infrastructure (ML Nodes), University Centre for Information Technology, University Of Oslo, Norway.
Contact
- If you need help with installing software on the ML nodes please fill in this software request form
- If you need other types of support please fill in this support request form
- hpc-drift@usit.uio.no