Installing Python packages on Fox
Using pip
inside a virtual environment is the easiest way to install Python packages as a user.
It is advised to use virtual environments since it is a straight forward way to isolate different
installations from each other. This makes it possible to have multiple versions of the same package
installed in your $HOME without problems of conflicting dependencies.
pip
is the main package installer for Python and included in every Python installation. It is easy
to use and can be combined with venv
to manage independent environments. It is recommended that
you at least have one virtual environment for each disparate experiment.
Key takeaways:
-
Always install packages inside a virtual environment
-
Do not install packages with "pip install --user". They will end up in
$HOME/.local
and from there they will leak into both containers and environments, thus breaking compatibility for other installations. -
Leverage the existing software stack for dependencies using the option "--system-site-packages"
-
When loading dependencies from the central software stack, always use the same toolchain (more info on this further down the text)
How to create a Python virtual environment and install a Python package inside of it
In this example we have used venv
which comes with with the Python standard library. In other
guides/documentation virtualenv
is used. The first is a subset of the latter and has all the
functionality we need.
First load the Python module with (use 'module avail Python' to see all):
$ module load Python/3.8.6-GCCcore-10.2.0
Create the virtual environment in your $HOME folder with an appropriate name:
$ python3 -m venv $HOME/pandas-env --system-site-packages
Activate the environment:
$ source $HOME/pandas-env/bin/activate
Install packages with pip. Here we install pandas.
(pandas-env) $ python3 -m pip install pandas
You are now ready to use the new environment. When you are done and want to get out of the environment, you simply type:
$ deactivate
Remember that you will always have to load the same module(s) before you activate your environment next time.
For more information, have a look at the official
pip
and
venvhttps://virtualenv.pypa.io/en/latestp /
documentations.
Note |
---|
When running software from your Python environment in a batch script, it is highly recommended to activate the environment only in the script (see below), while keeping the login environment clean when submitting the job, otherwise the environments can interfere with each other (even if they are the same). |
Choosing a Python version
If you need a specific version of Python for your installation, then you can search and see if that version is available on our system. This command will give you a list of all Python modules installed:
$ module avail python
Load the module you need and then create a virtual environment before you start installing packages inside of it.
Searching for dependencies and choosing a toolchain
The dependencies that are needed to install a certain Python package are usually listed in the
requirements.txt
file. This file is found in the sourcesfiles for the package you are interessted
in. The dependencies can sometimes be found under the variable install_requires
of the file
setup.py
(also found in the sourcefiles).
Since we already have hundreds of Python packages installed (in different versions) on our system, you can utilize those when installing the package you need using this small procedure:
- Search to see if any of your dependencies are available using
module spider
- Make sure the modules (which contain your dependencies) are built with the same toolchain (more on this further down)
- Load all the modules you need (if you do not get an error message, they are compatible)
- Create your virtual environment and install the Python package you need
If you already know some of the Python packages you want to use, you can search for them directly
with the module spider
command. Let us say you want to use the Python package numpy
:
$ module spider numpy
-----------------------------------------------------------------------------------------------------------------------------------------
numpy:
-----------------------------------------------------------------------------------------------------------------------------------------
Versions:
numpy/1.25.1 (E)
numpy/1.26.2 (E)
numpy/1.26.4 (E)
This will give you a list of all version of numpy
installed. In order to see what module contains
the version of numpy
you need, run module spider
again with the version number:
$ module spider numpy/1.26.4
-----------------------------------------------------------------------------------------------------------------------------------------
numpy: numpy/1.26.4 (E)
-----------------------------------------------------------------------------------------------------------------------------------------
This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy.
SciPy-bundle/2024.05-gfbf-2024a
In order to see what other Python packages you will get access to when loading this SciPy-bundle
module, run:
$ module spider SciPy-bundle/2024.05-gfbf-2024a
Included extensions
===================
beniget-0.4.1, Bottleneck-1.3.8, deap-1.4.1, gast-0.5.4, mpmath-1.3.0,
numexpr-2.10.0, numpy-1.26.4, pandas-2.2.2, ply-3.11, pythran-0.16.1,
scipy-1.13.1, tzdata-2024.1, versioneer-0.29
If you then load the module you can check which version of Python it comes with:
[ec-parosen@login-3 ~]$ module load SciPy-bundle/2024.05-gfbf-2024a
[ec-parosen@login-3 ~]$ which python3
/cluster/software/EL9/easybuild/software/Python/3.12.3-GCCcore-13.3.0/bin/python3
We see here that SciPy-bundle/2024.05-gfbf-2024a
is built on top of the
Python/3.12.3-GCCcore-13.3.0
module. You only need to load the first since the latter is a
dependecy and will be loaded automatically.
Note |
---|
If you want to combine several different modules that contains the Python packages you need, they all need to come from the same toolchain. For example foss/2023a or foss/2022b . Note that GCCcore-12.3.0 is a subtoolchain of foss/2023a and modules with either one of these postfixes are thus comptatible. Here is a list of all installed foss toolchains and the GCC versions included in them: |
foss/2021a -> 10.3.0
foss/2021b -> 11.2.0
foss/2022a -> 11.3.0
foss/2022b -> 12.2.0
foss/2023a -> 12.3.0
foss/2023b -> 13.2.0
foss/2024a -> 13.3.0
Using the virtual environment in a batch script
In a batch script you will activate the virtual environment in the same way as above. You must just load the python module first:
# Set up job environment
set -o errexit # exit on any error
set -o nounset # treat unset variables as error
# Load modules
module load Python/Python/3.12.3-GCCcore-13.3.0
# Set the ${PS1} (needed in the source of the virtual environment for some Python versions)
export PS1=\$
# activate the virtual environment
source $HOME/my_new_pythonenv/bin/activate
# execute example script
python pdexample.py
Sharing package configuration
To allow other researchers to replicate your virtual environment setup it can be
a good idea to "freeze" your packages. This tells pip
that it should not
silently upgrade packages and also gives a good way to share the exact same
packages between researchers.
To freeze the packages into a list to share with others run:
$ python -m pip freeze --local > requirements.txt
The file requirements.txt
will now contain the list of packages installed in
your virtual environment with their exact versions. When publishing your
experiments it can be a good idea to share this file which other can install in
their own virtual environments like so:
$ python -m pip install -r requirements.txt
Your virtual environment and the new one installed from the same
requirements.txt
should now be identical and thus should replicate the
experiment setup as closely as possible.