HPC documentation
Contents
General information
To access the HPC, start a terminal session and ssh
into:
cody.scem.westernsydney.edu.au
(for CPU only)wolfe.scem.westernsydney.edu.au
(for GPU)
The current set up has the following partitions:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST k6000 up infinite 4 idle bd-client-[01-04] cpu* up infinite 8 idle compute-[000-007] a100-dev up 7-00:00:00 1 idle a100-dev a100 up 7-00:00:00 2 idle a100-[000-001]
Each k6000
node has 8 CPUs, cpu
node has 16 CPUs, and the a100
nodes have 32 CPU each.
Both k6000
and a100
have GPUs attached. The a100
nodes are the most recent addition, and each has the full GPU capability of a A100 chip, that is 9612 cuda cores and 40GB memory. The k6000
nodes have the GTX6000 chips, each with 4600 cuda cores and 24GB memory.
Copying files to and from the HPC
Use scp
to copy files to and from the HPC.
To copy a local file to the cluster:
scp ./<filename> wolf.scem.westernsydney.edu.au:path
If you don't use the path
, then it should appear under your home directory.
To copy a file from the cluster:
scp wolfe.scem.westernsydney.edu.au:path/file .
This will copy the file under path
to your current directory.
other information to be filled later
Resources
GPU
To request GPU for your jobs, you need to include the line:
#SBATCH -P a100
in your bash script for submitting the jobs, to specify the a100 nodes (or the k6000 nodes). This is in addition to the gpu resource request.
Software
To use PyTorch
, you'll need the following combinations:
- Python 3.7 + Torch 1.9 Cuda 11
- Python 3.9 + Torch 1.9 + cuda 11
- Python 3.10 + Torch 12.1 Cuda 11.3
To view the available modules, use module avail
. To load a module, use module load modulename
. E.g. to load python3.10 with pytorch, use module load PyTorch/Python3.10
.
7-zip
To unzip files using 7-zip use the command 7za
instead of 7z
Using SLURM
The HPC uses SLURM to for its job scheduler. Commonly used commands to control SLURM are:
squeue
shows the currently queued and running jobs.sinfo
provides information about the nodes.sbatch
is used in conjunction with a SLURM script to submit jobs to the queuescancel
to stop a currently queued job.
SLURM Script
Submitting jobs to SLURM requires a job script. Below is a sample script to get started with.
#! /usr/bin/env bash # #SBATCH --job-name=simple #SBATCH --output=simple.txt # #SBATCH --ntasks=1 #SBATCH --time=05:00 # this sets the maximum time the job is allowed before killed #SBATCH --partition=a100 ##SBATCH --partition=cpu # the double hash means that SLURM won't read this line. # load the python module module load Python/Python3.10 # make sure to load the modules needed python3.10 simple.py # the program that is run
Submitting a Job
Once the SLURM script is ready, the job can be submitted using sbatch script.sh
, where script.sh
is the name of the SLURM script. The progress of the job can be viewed using squeue
or by examining the job output file (set to simple.txt
in the above sample SLURM script).