On this page, you will learn how to properly use the CPU and GPU clusters of the team, and especially the OAR submission system.
In particular, you are highly encouraged to read the section about the “rules” and “good practices” when submitting jobs to ensure a fair use of resources among all users from the team.
The CPU cluster of the team is now part to the “shared cluster” of the center where you can use computing resources from different teams. You have a priority access to the computing nodes from THOTH.
The CPU cluster frontal node is called access2-cp
. You can ssh to it, but you should not use it to do computations (it is just a bridge). For this, you have to submit jobs to the computation/worker nodes. A job is a task (generally a script in bash, python, etc.) that you cannot run on your working station (because of a lack of computing power/memory, etc.).
The GPU cluster is not shared and dedicated to our team. The frontal node is called edgar
. Same thing, it is not a machine to do computations on.
In both cases you cannot directly SSH to computations nodes, you can only use them by submitting jobs. The job management system is handled by a program called OAR. OAR is a resource manager, it allows you to reserve some computing resources (xx CPU cores, a GPU) for a certain amount of times to do a task (run your script). When you submit a job, OAR allocates to you the resources that you requested and your job is run on this resources.
Important: You have to estimate the resources and time that your job will require when you submit it. If you reserve 20 cpu cores for 6 hours, your job should finish before the end of its allocated time. If not, your computations will be lost. To avoid this, you can use checkpointing and intermediate saving.
The following command will be useful to use OAR, especially to submit and monitor jobs:
oarstat
You can pipe the result to a grep on your username to print your jobs, e.g. oarstat | grep username
or you can use the following option oarstat -u username
.
Note: any submitted job is assigned a job id or job number (denoted <job_ID> in the following) that can be used to monitor it.
To monitor a running or finished job, you can use (replace <job_ID>
by your job number):
oarstat -fj <job_ID>
oarsub
command. The simplest reservation looks like this:oarsub -I # Interactive
If the cluster isn't full, your job will be executed almost immediately. The -I
option stands for “interactive”, it opens a shell on a computation node where you can run your experiment. This interactive mode is useful to run some small tests (session limited in time, at most 2 hours) before using the standard submission mode.
oarsub my_script.sh
The command gives the prompt back immediately and will execute the job at a later time, this is called “batch mode”. The standard output and error are redirected to OAR.<job_ID>.stdout
and OAR.<job_ID>.stderr
in the current directory when you submitted the job.
Attention: Your script my_script.sh
should have execution right (more on Linux file permission can be found here). You can modify access right with the following command chmod u+x my_script.sh
.
oardel <job_ID>
(you can get the <job_ID>
from oarstat
)
A visual overview of the nodes is given by the Monika software (cpu cluster and gpu cluster).
Note: Some tools (python script) were designed in the team to parse oarstat results in a more human-readable format, check out the thoth_utils
Gitlab webpage. These scripts are also available in the following directory /home/thoth/gdurif/shared/thoth_utils
(that you can add to the PATH
in your .bashrc
or equivalent for instance). The more intersting scripts in here are gpustat
which summarize the GPU cluster current use and oarstat++.py
which cn be used on both clusters (oarstat++.py -s edgar
or oarstat++.py -s access2-cp
). Other options are documented (see oarstat++.py -h
).
oarsub -C<job_id>
oarsub
command to specify the resources that you want to reserve with the -l
option. These options can be specific for the CPU or GPU clusters. In both cases, to specify the duration of your job, use the following option:oarsub -l "walltime=16:0:0" <your_command>
-p
option. This properties can be combined oarsub -p “property1='XX' AND property2>'YY'”
(note the importance of all single '
and double quotes “
). It is also possible to use the OR
keyword. More details can be found in the corresponding section below depending on CPU or GPU cluster. The available properties can be found on the dedicated monika
web page for each cluster (cpu cluster and gpu cluster).When not specify otherwise, all jobs that you submit are default jobs. Once they started, they will run until they finish/crash/reach the walltime without interruption. To ensure a fair share of the resources between users, the resources allocated to default jobs of a single user are limited (c.f. below).
To submit a job in best effort, you should add the following option -t besteffort -t idempotent
to your submission:
oarsub -l xxx -t besteffort -t idempotent <your_command>
Besteffort job will be killed when resources are needed (for waiting default jobs) and restart it automatically (if you do not forger the option -t idempotent
). However, you have to handle checkpointing yourself.
-l
option allows you to specify the resources that you want to reserve. To get a machine with 16 cores, use:oarsub -l "nodes=1/core=16" <your_command>
To get 8 cores on 1 machine for at most 16 hours, use:
oarsub -l "nodes=1/core=8,walltime=16:0:0" <your_command>
You are now sharing the RAM with other jobs. Make sure you specify “nodes=1”, such that all your cores and RAM are reserved on the same machine.
Attention: Your main concern should be the use of memory. Before submitting a job, you should estimate the memory it will require and request for a computation node with sufficient memory (see below). Since OAR does not monitor the quantity of memory requested by submitted jobs, multiple heavy memory jobs can be assigned to the same computation node, creating some issue and potentially crashing the node. Thus, it is recommended that your experiments do not waste memory (for safety, you can use the ulimit
bash command in your script).
-p
property allow you to specify the characteristics of the computation node that you want to reserve. It can be mixed with the -l
option. For instance, if you want 8 cores on a single node with more than 64Gb of memory, you can use:oarsub -l "nodes=1/core=8,walltime=16:0:0" -p "mem>64" <your_command>
Note: More information about the shared cluster can be found in the dedicated page, in particular regarding the use of the resources from our team (THOTH) or from other teams.
The CPU computation nodes are Xeons with 8 to 48 cores. They have 32 GB of ram at least (you can find out how much ram each machine has with the command oarnodes
, property 'mem'). All nodes run Ubuntu 16.04 x86_64 with a set of commonly packages installed by default. You can install packages on the cluster nodes, however if you feel you need those everywhere, you should seek the help of a system administrator. On all nodes you can use /tmp and /dev/shm for (local) temporary files. Global data has to be on scratches.
Important: On the GPU cluster, many nodes have more than 1 GPU. You are responsible for using the GPU that was allocated to you by OAR.
To do so, you have to source the file gpu_setVisibleDevices.sh
(available on all nodes) at the beginning of the script that you submit to OAR. This file sets the “CUDA_VISIBLE_DEVICES” environment variable. Using this, you no longer have to specify a GPU ID when running your single-GPU experiments. For example, if you are assigned GPU number 5 on gpuhost21, it will show up as “gpuid=0” in your job. Thus, your (bash/zsh) script should be something like:
source gpu_setVisibleDevices.sh ... GPUID=0 # ALWAYS RunCudaCode(... , GPUID , ...)
Note 1: the previous procedure based on the script gpu_getIDs.sh
is now deprecated.
Note 2: If you don't use bash/zsh script and submit directly a command to OAR, you have to modify your submission to source gpu_setVisibleDevices.sh
before running your command, for instance:
oarsub -p ... -l ... "source gpu_setVisibleDevices.sh; python myscript.py"
Attention: You have to check if the program/code that you use is compatible with setting the environment variable CUDA_VISIBLE_DEVICES
. Standard libraries such as TensorFlow, Caffe, PyTorch (and anything based on Cuda) are compatible. In doubt, you can contact your system administrators before submitting your job on the cluster.
To check GPU memory consumption and use, you can connect to the node where your computations are done (with oarsub -C<job_id>
) and run:
nvidia-smi
If you are doing everything right, you will see decent percent of GPU-Util for the GPU you are using.
To reserve 2 GPUs for a single job (if you use multi-GPU computations for instance), you can use the following option -l “host=1/gpuid=2”
for oarsub
, which can be combined with a walltime option, for instance:
oarsub -l "host=1/gpuid=2,walltime=48:0:0" <your_command>
In this case (multi-GPU code), sourcing gpu_setVisibleDevices.sh
will set up the visible GPU id accordingly, and your (bash/zsh) script should be something like:
source gpu_setVisibleDevices.sh ... GPUID=0,1 # ALWAYS RunCudaCode(... , GPUID , ...)
The next section will be give you other examples of oarsub
commands.
See the man pages or the online OAR documentation for more info.
The GPU worker nodes are called gpuhost1
,.., gpuhost22
. Some of them are desktop machines, some of them are racked. They run the same base configuration as desktop computers, and you also have rights to install packages. They generally have a scratch attached, which means you can get high volume local storage if that is one of your computation requirements.
You can run the following command oarnodes
on edgar
to see the available resources (which GPU model) on the cluster or check the files listing the machines from the team txt or
spreadsheet.
The last version (or a very recent one) of the NVIDIA drivers are installed on each nodes. However, you have to install your own version of CUDA somewhere in your scratchdir (or use a version installed by one of your colleagues) and correctly set the PATH
and LD_LIBRARY_PATH
in your scripts to use it.
oarsub -I # reserve 1 GPU interactively for 2 hours oarsub -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU for 48 hours oarsub -p "gpumodel='titan_xp'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 Titan Xp for 48 hours oarsub -p "host='gpuhost5'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU on gpuhsot5 for 48 hours oarsub -p "host='gpuhost3' and gpuid='1'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve GPUID 1 on gpuhost3 for 48 hours oarsub -p "host='gpuhost1'" -l "gpuid=4,walltime=48:0:0" "/path/to/my/script.sh" # reserve 4 GPUs on gpuhost1 oarsub -p "gpumodel='titan_black'" -l "host=1/gpuid=2" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 2 Titan Blacks on a single host oarsub -r "2015-05-24 15:20:00" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve a GPU in advance oarsub -p "host='gpuhost3'" -l "host=1" -t besteffort -t idempotent "/path/to/my/script.sh" # reserve a whole node in besteffort (acceptable for CPU jobs) oarsub -p "gpuid='0'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve exclusively GPUID 0, thus no need to modify your code (default GPU is 0)
Some rules to use the cluster with efficiency and show courtesy towards your colleagues. :)
Do not run any jobs on submission nodes (edgar
, access{1,2}-cp
), which are just gateways to the different clusters, but also on clear (former frontend for the CPU cluster) a vital point in the THOTH network, notably as DHCP and NFS server (you don't want it to crash!).
The use of persistent terminal multiplexers (screen, tmux) is forbidden on the nodes themselves. Run your screen session from the OAR server or from your desktop machine.
Some of these rules are not enforced automatically, but if some jobs do not follow them, the administrators may suspend or kill them. Of course, exceptions can be made, especially during paper submission periods, but always by first discussing with your colleagues, and in a good spirit.
Submitting a job on n cores means that your job must be able to fully use n cores (cf. parallelization, at code or thread/process level). An exception is made for jobs that require high amounts of RAM. In those cases, inflate the number of cores allocated according to the RAM you use. If you use too much memory, the machine might stall completely (what we call swap thrashing), requiring a manual reboot from a sysadmin.
Avoid overloading the OAR server by launching too many jobs in a row. There should not be many more jobs in the waiting state than cores.
In the same spirit, you shouldn't occupy more than 100 cores on the CPU cluster with default jobs. If you need more, switch to besteffort jobs.
It is forbidden to reserve a node via the command sleep
.
You are limited to 3 default jobs on the GPU cluster, to avoid that a user reserve all the resources. If you need more, you can use besteffort jobs.
You should control the memory (as in RAM) and CPU consumption of your jobs using GPU since your jobs are sharing worker with jobs from other people (1, 2, 4 or 8 GPUs by nodes).
In addition, you should also avoid using “big” GPUs if your task does not require huge memory. You can specify the GPU model or GPU memory with the following options -p “gpumodel='XX'”
or -p “gpumem>'10000'”
(note the importance of the single quote around 1000).
Reserving a GPU via the command sleep
can be tolerated but is highly discouraged.
Equalizer script: in the past, it happened that a single user took all empty resources with besteffort jobs, forcing besteffort jobs from other users to wait (when all users already have their 2 default jobs), which was not a fair repartition of resources. To avoid this, we set up an “equalizer” script that regularly and automatically kills the last submitted job from the user with the largest number of besteffort job if some besteffort jobs from other users are waiting, enforcing a redistribution of the resources. So do not be surprised if some of your besteffort jobs are killed and resubmitted from time to time.
Fixing OAR lockup: sometimes OAR becomes completely unresponsive. Jobs can still be submitted using “oarsub”, however the list of jobs in “Waiting” state keeps growing, even though resources are available. In those cases and when admins are not around, you can use the following command on edgar:
sudo /bin/OAR_hard_reset.sh
It will attempt to restart OAR forcefully. Double-check that OAR is locked up before running this command.
OAR has another way of specifying all options for a job. Just pass it a script name:
oarsub --scanscript ./myscript.sh
OAR will now process any line starting with #OAR
.
Here an example:
#! /bin/bash #OAR -l nodes=1/core=10 #OAR -n mybigbeautifuljob #OAR -t besteffort #OAR -t idempotent echo "start at $(date +%c)" sleep 10 echo "start at $(date +%c)"
In this way every job can be a fairly self-contained description of what to do and which resources are needed. In case you are not a big fan of wrapping all your python scripts in shell scripts, this seems to work with arbitrary text files:
#! /usr/bin/env python #OAR -t besteffort #OAR -t idempotent #OAR -l nodes=1/core=10 import time from datetime import datetime print(datetime.now()) time.sleep(20) print(datetime.now())
chmod +x myscript.sh
#! /bin/bash
) in the script itself.besteffort
, one for idempotent
)
Some useful scripts are available in the thoth_utils
repository available here and on any machine of the team in the following directory /home/thoth/utils/thoth_utils
.
Here is a short description ot their usage:
oarstat++.py
is an improved version of oarstat
. It displays the names of the nodes, their number of cores and some statistics of the busy/free/waiting nodes as a function of their number of cores. For your convenience, just paste into your ~/.bashrc
file the following shortcut: alias oarstat++=“python /path/to/oarstat++.py”
To get help about it, call oarstat++ -h
. Idem for oarstatr
.oargetnode <nb_of_cores> [<job_name>]
is an improved version of job submission. To use it, paste into your ~/.bashrc
file the following shortcut: alias oargetnode=”/path/to/oargetnode“
/path/to/oardelall
. It will ask you confirmation, and will first delete your waiting jobs, and finally your running jobs after a second confirmation.Two useful commands to start interactive sessions on the CPU and GPU cluster, respectively:
################################################### # Start an interactive CPU job in the current directory #(2 cores by default) # E.g. to make a 20 core job: # $ c 20 ################################################### function c () { ssh -tA access2-cp "oarsub -I -d $(pwd) -l nodes=1/core=${1:-2}" } ################################################### ################################################### # Start an interactive GPU job in the current directory. # I use gtx 980 for debugging to not block the more powerful GPUs ################################################### alias g="ssh -t edgar oarsub -I -d $(pwd) -p \"\\\"gpumodel='gtx980'\\\"\"" ###################################################