Difference between revisions of "HPC documentation"
(Documentation for the HPC) |
|||
(26 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== General information == | == General information == | ||
− | To access the HPC, start a terminal session and <code>ssh</code> into <code>wolfe.scem.westernsydney.edu.au</code> | + | To access the HPC, start a terminal session and <code>ssh</code> into: |
+ | * <code>cody.scem.westernsydney.edu.au</code> (for CPU only) | ||
+ | * <code>wolfe.scem.westernsydney.edu.au</code> (for GPU) | ||
+ | |||
+ | The current set up has the following partitions: | ||
+ | |||
+ | <nowiki> | ||
+ | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | ||
+ | k6000 up infinite 4 idle bd-client-[01-04] | ||
+ | cpu* up infinite 8 idle compute-[000-007] | ||
+ | a100-dev up 7-00:00:00 1 idle a100-dev | ||
+ | a100 up 7-00:00:00 6 idle a100-[000-005] | ||
+ | </nowiki> | ||
+ | |||
+ | Each <code>k6000</code> node has 8 CPUs, <code>cpu</code> node has 16 CPUs, <code>a100-dev</code> (i.e. Wolfe) has 8 CPUS, and the <code>a100</code> nodes have 32 CPU each. | ||
+ | |||
+ | Both <code>k6000</code> and <code>a100</code> have GPUs attached. The <code>a100</code> nodes are the most recent addition, and each has the full GPU capability of a A100 chip, that is 9612 cuda cores and 40GB memory. The <code>k6000</code> nodes have the GTX6000 chips, each with 4600 cuda cores and 24GB memory. | ||
+ | |||
+ | So for the A100 nodes: | ||
+ | * a100-dev: 8 CPUs, 32 GB RAM,10GB GPU RAM | ||
+ | * a100-00[0,1] nodes:32 CPUs, 128GB RAM,40GB of GPU RAM. | ||
+ | |||
+ | Note: we also have the node <code>a100-100</code>, which has 80GB of GPU RAM. Request this node if you need a lot of memory for your job. | ||
+ | |||
+ | |||
+ | === Copying files to and from the HPC === | ||
+ | |||
+ | Use <code>scp</code> to copy files to and from the HPC. | ||
+ | |||
+ | To copy a local file to the cluster (do this from your local machine): | ||
+ | |||
+ | <nowiki> | ||
+ | scp ./<filename> <user>@wolfe.scem.westernsydney.edu.au:<path> | ||
+ | </nowiki> | ||
+ | |||
+ | If you don't use the <code>path</code>, then it should appear under your home directory. | ||
+ | |||
+ | To copy a file from the cluster: | ||
+ | |||
+ | <nowiki> | ||
+ | scp wolfe.scem.westernsydney.edu.au:<path>/file . | ||
+ | </nowiki> | ||
+ | |||
+ | This will copy the file under <code>path</code> to your current directory. | ||
+ | |||
+ | |||
+ | === Note about the HDD === | ||
+ | |||
+ | There are no backups. If a file or files go missing, that is the end of them. The HDD for the HPC is not meant to store valuable data. It is a place to hold them for use in the HPC. Anything important should be stored elsewhere. | ||
+ | |||
''other information to be filled later'' | ''other information to be filled later'' | ||
+ | == Resources == | ||
+ | |||
+ | === GPU === | ||
+ | |||
+ | To request GPU for your jobs, you need to include the line: | ||
+ | <nowiki> | ||
+ | #SBATCH -P a100 | ||
+ | </nowiki> | ||
+ | in your bash script for submitting the jobs, to specify the a100 nodes (or the k6000 nodes). This is in addition to the gpu resource request. | ||
+ | |||
+ | Note that it is not possible to use the GPU on the head node (i.e. the node you logged in on), for testing code that requires the GPU, you'll need to request a temporary session with <code>sinteractive</code>, see relevant information below under SLURM. | ||
+ | |||
+ | === Scratch disk === | ||
+ | There is a small SSD local disk attached to the HPC: <code>/bigdata-local</code>, the disk is only 1TB in size thus will be cleaned regularly without prior warning. If your code is running too slow when reading data from your home directory, then putting your data on the local drive will greatly help. However, make sure that you clean up any of your data that is generated after your job has finished to allow other users to use the drive. | ||
+ | |||
+ | Note that if your job was run on <code>a100-000</code> node, then the scratch will be on the <code>/bigdata-local</code> directory on that node (ditto for the other nodes). To clean up manually, you'll need to ssh into the node. Or you can just add a <code>rm -rf</code> line in your SLURM scrip to clean it up. | ||
== Software == | == Software == | ||
+ | To check available modules on the HPC, use the command <code>module avail</code>, to load a module use <code>module load Python/Python3.7.0</code> for example. | ||
+ | |||
+ | |||
=== Python related === | === Python related === | ||
Line 16: | Line 84: | ||
* Python 3.10 + Torch 12.1 Cuda 11.3 | * Python 3.10 + Torch 12.1 Cuda 11.3 | ||
− | + | To view the available modules, use <code>module avail</code>. To load a module, use <code>module load modulename</code>. E.g. to load python3.10 with pytorch, use <code>module load PyTorch/Python3.10</code>. | |
− | * <code> | + | |
+ | |||
+ | === Conda environment === | ||
+ | |||
+ | If you have a specific set of packages you want to use for your project, then it's best to set up a conda environment for it (or use a singularity, see below). To do that, you need to first load the anaconda module, then create a new environment: | ||
+ | |||
+ | <nowiki>conda create -p ~/.conda/envs/<env_name> python==<python version> | ||
+ | </nowiki> | ||
+ | |||
+ | <b>note: as of 3 Oct 2024, the -p flag actually breaks conda, and conda create is creating the env dir in users' home dir.</b> | ||
+ | |||
+ | Note that this is not how we normally create conda environment. The way conda is set up on wolfe means it somehow wants to create the environment in the base directory, where users don't have write permission. So, you create the environment under you own directory. You can give it any directory name you want and it doesn't have to be <code>.conda/envs</code> under your home directory. | ||
+ | |||
+ | Once that's set up, you first need to initilise the shell before you can activate your environment: | ||
+ | |||
+ | <nowiki>conda init <shell name> | ||
+ | </nowiki> | ||
+ | The <code>shell name</code> can be <code>bash</code>, <code>zsh</code>, etc. I would go with <code>bash</code> if you don't know what any of the shell names mean. You'll then need to log off and log on wolfe to make sure the shell now works. | ||
+ | |||
+ | You can now activate your environment, the set up should tell you how to activate, but in general, it should be: | ||
+ | |||
+ | <nowiki>conda activate /rusers/<your group>/<your home dir>/<env path>/<env_name> | ||
+ | </nowiki> | ||
+ | |||
+ | You can then install the various packages for your project. | ||
+ | |||
+ | <b>Note for Pytorch</b>: The PyTorch set up is pretty specific, so far only Python 3.9 is tested, and you need to install pytorch with the following command: | ||
+ | |||
+ | <nowiki>pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html</nowiki> | ||
+ | This is because you'll need cuda 11+ for the A100, the previous versions doesn't have the chip in it's build. | ||
+ | |||
+ | <b>To submit a job</b>: instead of <code>conda activate</code>, you'll need <code>source activate</code> in your job script (see SLURM section below), no need to load anaconda. Everything else is the same. | ||
+ | |||
+ | === Singularity/Container === | ||
+ | |||
+ | <code>apptainer</code> is installed, but not as a module. Also according to David Minored "However, it has an issue with the current Kernel. Moving the OS up the the next version fixes that. However, it breaks at least one thing I have found. Once I have a fix for that, I'll do the upgrade.". | ||
+ | |||
+ | |||
+ | === 7-zip === | ||
+ | |||
+ | To unzip files using 7-zip use the command <code>7za</code> instead of <code>7z</code> | ||
+ | |||
+ | == Using SLURM == | ||
+ | |||
+ | The HPC uses SLURM to for its job scheduler. Commonly used commands to control SLURM are: | ||
+ | |||
+ | * <code>squeue</code> shows the currently queued and running jobs. | ||
+ | * <code>sinfo</code> provides information about the nodes. | ||
+ | * <code>sbatch</code> is used in conjunction with a SLURM script to submit jobs to the queue | ||
+ | * <code>scancel</code> to stop a currently queued job. | ||
+ | |||
+ | === SLURM Script === | ||
+ | Submitting jobs to SLURM requires a job script. Below is a sample script to get started with. | ||
+ | |||
+ | <nowiki> | ||
+ | #! /usr/bin/env bash | ||
+ | # | ||
+ | #SBATCH --job-name=simple | ||
+ | #SBATCH --output=simple.txt | ||
+ | # | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --time=05:00 # this sets the maximum time the job is allowed before killed | ||
+ | |||
+ | #SBATCH --partition=a100 | ||
+ | ##SBATCH --partition=cpu # the double hash means that SLURM won't read this line. | ||
+ | |||
+ | # load the python module | ||
+ | module load Python/Python3.10 # make sure to load the modules needed | ||
+ | |||
+ | python3.10 simple.py # the program that is run | ||
+ | </nowiki> | ||
+ | |||
+ | David said not to set <code>mem-per-cpu</code>, <code>cputime</code> or related items at the moment, there's some error with the system. It's on the to-do list. | ||
+ | |||
+ | Note: to request the 80GB node for your job, include in your script the line <code> #SBATCH --nodelist=a100-100</code> | ||
+ | |||
+ | === Submitting a Job === | ||
+ | |||
+ | Once the SLURM script is ready, the job can be submitted using <code>sbatch script.sh</code>, where <code>script.sh</code> is the name of the SLURM script. The progress of the job can be viewed using <code>squeue</code> or by examining the job output file (set to <code>simple.txt</code> in the above sample SLURM script). | ||
+ | |||
+ | |||
+ | === Interactive Jobs === | ||
+ | |||
+ | If you need to test code, it is highly recommended that you request time on the nodes via <code>sinteractive</code>: | ||
+ | |||
+ | <nowiki> | ||
+ | sinteractive -p ampere24 --gres:gpu:a30:1 | ||
+ | </nowiki> | ||
+ | Will open a shell on an ampere24 node - if available, for 10 minutes. You can override the default 10 minute time by adding eg: <code>--time 0:15:0</code> which is 15 minutes. | ||
+ | |||
+ | |||
+ | Note the --gres:gpu:a30:1 flag. This should be applied on ALL GPU jobs, because this tells slurm to not share a GPU node. | ||
+ | |||
+ | Note2: for the ampere80 queue it is (--gres:gpu:a100:1) and when the ampere40 queue comes back online also use --gres:gpu:a100:1. |
Latest revision as of 04:46, 3 October 2024
Contents
General information
To access the HPC, start a terminal session and ssh
into:
cody.scem.westernsydney.edu.au
(for CPU only)wolfe.scem.westernsydney.edu.au
(for GPU)
The current set up has the following partitions:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST k6000 up infinite 4 idle bd-client-[01-04] cpu* up infinite 8 idle compute-[000-007] a100-dev up 7-00:00:00 1 idle a100-dev a100 up 7-00:00:00 6 idle a100-[000-005]
Each k6000
node has 8 CPUs, cpu
node has 16 CPUs, a100-dev
(i.e. Wolfe) has 8 CPUS, and the a100
nodes have 32 CPU each.
Both k6000
and a100
have GPUs attached. The a100
nodes are the most recent addition, and each has the full GPU capability of a A100 chip, that is 9612 cuda cores and 40GB memory. The k6000
nodes have the GTX6000 chips, each with 4600 cuda cores and 24GB memory.
So for the A100 nodes:
- a100-dev: 8 CPUs, 32 GB RAM,10GB GPU RAM
- a100-00[0,1] nodes:32 CPUs, 128GB RAM,40GB of GPU RAM.
Note: we also have the node a100-100
, which has 80GB of GPU RAM. Request this node if you need a lot of memory for your job.
Copying files to and from the HPC
Use scp
to copy files to and from the HPC.
To copy a local file to the cluster (do this from your local machine):
scp ./<filename> <user>@wolfe.scem.westernsydney.edu.au:<path>
If you don't use the path
, then it should appear under your home directory.
To copy a file from the cluster:
scp wolfe.scem.westernsydney.edu.au:<path>/file .
This will copy the file under path
to your current directory.
Note about the HDD
There are no backups. If a file or files go missing, that is the end of them. The HDD for the HPC is not meant to store valuable data. It is a place to hold them for use in the HPC. Anything important should be stored elsewhere.
other information to be filled later
Resources
GPU
To request GPU for your jobs, you need to include the line:
#SBATCH -P a100
in your bash script for submitting the jobs, to specify the a100 nodes (or the k6000 nodes). This is in addition to the gpu resource request.
Note that it is not possible to use the GPU on the head node (i.e. the node you logged in on), for testing code that requires the GPU, you'll need to request a temporary session with sinteractive
, see relevant information below under SLURM.
Scratch disk
There is a small SSD local disk attached to the HPC: /bigdata-local
, the disk is only 1TB in size thus will be cleaned regularly without prior warning. If your code is running too slow when reading data from your home directory, then putting your data on the local drive will greatly help. However, make sure that you clean up any of your data that is generated after your job has finished to allow other users to use the drive.
Note that if your job was run on a100-000
node, then the scratch will be on the /bigdata-local
directory on that node (ditto for the other nodes). To clean up manually, you'll need to ssh into the node. Or you can just add a rm -rf
line in your SLURM scrip to clean it up.
Software
To check available modules on the HPC, use the command module avail
, to load a module use module load Python/Python3.7.0
for example.
To use PyTorch
, you'll need the following combinations:
- Python 3.7 + Torch 1.9 Cuda 11
- Python 3.9 + Torch 1.9 + cuda 11
- Python 3.10 + Torch 12.1 Cuda 11.3
To view the available modules, use module avail
. To load a module, use module load modulename
. E.g. to load python3.10 with pytorch, use module load PyTorch/Python3.10
.
Conda environment
If you have a specific set of packages you want to use for your project, then it's best to set up a conda environment for it (or use a singularity, see below). To do that, you need to first load the anaconda module, then create a new environment:
conda create -p ~/.conda/envs/<env_name> python==<python version>
note: as of 3 Oct 2024, the -p flag actually breaks conda, and conda create is creating the env dir in users' home dir.
Note that this is not how we normally create conda environment. The way conda is set up on wolfe means it somehow wants to create the environment in the base directory, where users don't have write permission. So, you create the environment under you own directory. You can give it any directory name you want and it doesn't have to be .conda/envs
under your home directory.
Once that's set up, you first need to initilise the shell before you can activate your environment:
conda init <shell name>
The shell name
can be bash
, zsh
, etc. I would go with bash
if you don't know what any of the shell names mean. You'll then need to log off and log on wolfe to make sure the shell now works.
You can now activate your environment, the set up should tell you how to activate, but in general, it should be:
conda activate /rusers/<your group>/<your home dir>/<env path>/<env_name>
You can then install the various packages for your project.
Note for Pytorch: The PyTorch set up is pretty specific, so far only Python 3.9 is tested, and you need to install pytorch with the following command:
pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
This is because you'll need cuda 11+ for the A100, the previous versions doesn't have the chip in it's build.
To submit a job: instead of conda activate
, you'll need source activate
in your job script (see SLURM section below), no need to load anaconda. Everything else is the same.
Singularity/Container
apptainer
is installed, but not as a module. Also according to David Minored "However, it has an issue with the current Kernel. Moving the OS up the the next version fixes that. However, it breaks at least one thing I have found. Once I have a fix for that, I'll do the upgrade.".
7-zip
To unzip files using 7-zip use the command 7za
instead of 7z
Using SLURM
The HPC uses SLURM to for its job scheduler. Commonly used commands to control SLURM are:
squeue
shows the currently queued and running jobs.sinfo
provides information about the nodes.sbatch
is used in conjunction with a SLURM script to submit jobs to the queuescancel
to stop a currently queued job.
SLURM Script
Submitting jobs to SLURM requires a job script. Below is a sample script to get started with.
#! /usr/bin/env bash # #SBATCH --job-name=simple #SBATCH --output=simple.txt # #SBATCH --ntasks=1 #SBATCH --time=05:00 # this sets the maximum time the job is allowed before killed #SBATCH --partition=a100 ##SBATCH --partition=cpu # the double hash means that SLURM won't read this line. # load the python module module load Python/Python3.10 # make sure to load the modules needed python3.10 simple.py # the program that is run
David said not to set mem-per-cpu
, cputime
or related items at the moment, there's some error with the system. It's on the to-do list.
Note: to request the 80GB node for your job, include in your script the line #SBATCH --nodelist=a100-100
Submitting a Job
Once the SLURM script is ready, the job can be submitted using sbatch script.sh
, where script.sh
is the name of the SLURM script. The progress of the job can be viewed using squeue
or by examining the job output file (set to simple.txt
in the above sample SLURM script).
Interactive Jobs
If you need to test code, it is highly recommended that you request time on the nodes via sinteractive
:
sinteractive -p ampere24 --gres:gpu:a30:1
Will open a shell on an ampere24 node - if available, for 10 minutes. You can override the default 10 minute time by adding eg: --time 0:15:0
which is 15 minutes.
Note the --gres:gpu:a30:1 flag. This should be applied on ALL GPU jobs, because this tells slurm to not share a GPU node.
Note2: for the ampere80 queue it is (--gres:gpu:a100:1) and when the ampere40 queue comes back online also use --gres:gpu:a100:1.