Table of Contents

OAR Tutorial

On this page, you will learn how to properly use the CPU and GPU clusters of the team, and especially the OAR submission system.

In particular, you are highly encouraged to read the section about the “rules” and “good practices” when submitting jobs to ensure a fair use of resources among all users from the team.


Basics

The CPU cluster of the team is now part to the “shared cluster” of the center where you can use computing resources from different teams. You have a priority access to the computing nodes from THOTH.

The CPU cluster frontal node is called access2-cp. You can ssh to it, but you should not use it to do computations (it is just a bridge). For this, you have to submit jobs to the computation/worker nodes. A job is a task (generally a script in bash, python, etc.) that you cannot run on your working station (because of a lack of computing power/memory, etc.).

The GPU cluster is not shared and dedicated to our team. The frontal node is called edgar. Same thing, it is not a machine to do computations on.

In both cases you cannot directly SSH to computations nodes, you can only use them by submitting jobs. The job management system is handled by a program called OAR. OAR is a resource manager, it allows you to reserve some computing resources (xx CPU cores, a GPU) for a certain amount of times to do a task (run your script). When you submit a job, OAR allocates to you the resources that you requested and your job is run on this resources.

Important: You have to estimate the resources and time that your job will require when you submit it. If you reserve 20 cpu cores for 6 hours, your job should finish before the end of its allocated time. If not, your computations will be lost. To avoid this, you can use checkpointing and intermediate saving.

Using OAR

The following command will be useful to use OAR, especially to submit and monitor jobs:

oarstat

You can pipe the result to a grep on your username to print your jobs, e.g. oarstat | grep username or you can use the following option oarstat -u username.

Note: any submitted job is assigned a job id or job number (denoted <job_ID> in the following) that can be used to monitor it.

To monitor a running or finished job, you can use (replace <job_ID> by your job number):

oarstat -fj <job_ID>
oarsub -I   # Interactive

If the cluster isn't full, your job will be executed almost immediately. The -I option stands for “interactive”, it opens a shell on a computation node where you can run your experiment. This interactive mode is useful to run some small tests (session limited in time, at most 2 hours) before using the standard submission mode.

Standard job submission

oarsub my_script.sh

The command gives the prompt back immediately and will execute the job at a later time, this is called “batch mode”. The standard output and error are redirected to OAR.<job_ID>.stdout and OAR.<job_ID>.stderr in the current directory when you submitted the job.

Attention: Your script my_script.sh should have execution right (more on Linux file permission can be found here). You can modify access right with the following command chmod u+x my_script.sh.

oardel <job_ID>

(you can get the <job_ID> from oarstat)

A visual overview of the nodes is given by the Monika software (cpu cluster and gpu cluster).

Note: Some tools (python script) were designed in the team to parse oarstat results in a more human-readable format, check out the thoth_utils Gitlab webpage. These scripts are also available in the following directory /home/thoth/gdurif/shared/thoth_utils (that you can add to the PATH in your .bashrc or equivalent for instance). The more intersting scripts in here are gpustat which summarize the GPU cluster current use and oarstat++.py which cn be used on both clusters (oarstat++.py -s edgar or oarstat++.py -s access2-cp). Other options are documented (see oarstat++.py -h).

oarsub -C<job_id>
oarsub -l "walltime=16:0:0" <your_command>

Default and besteffort jobs

When not specify otherwise, all jobs that you submit are default jobs. Once they started, they will run until they finish/crash/reach the walltime without interruption. To ensure a fair share of the resources between users, the resources allocated to default jobs of a single user are limited (c.f. below).

To submit a job in best effort, you should add the following option -t besteffort -t idempotent to your submission:

oarsub -l xxx -t besteffort -t idempotent <your_command>

Besteffort job will be killed when resources are needed (for waiting default jobs) and restart it automatically (if you do not forger the option -t idempotent). However, you have to handle checkpointing yourself.

CPU cluster

oarsub -l "nodes=1/core=16" <your_command>

To get 8 cores on 1 machine for at most 16 hours, use:

oarsub -l "nodes=1/core=8,walltime=16:0:0" <your_command>

You are now sharing the RAM with other jobs. Make sure you specify “nodes=1”, such that all your cores and RAM are reserved on the same machine.

Attention: Your main concern should be the use of memory. Before submitting a job, you should estimate the memory it will require and request for a computation node with sufficient memory (see below). Since OAR does not monitor the quantity of memory requested by submitted jobs, multiple heavy memory jobs can be assigned to the same computation node, creating some issue and potentially crashing the node. Thus, it is recommended that your experiments do not waste memory (for safety, you can use the ulimit bash command in your script).

oarsub -l "nodes=1/core=8,walltime=16:0:0" -p "mem>64" <your_command>

Note: More information about the shared cluster can be found in the dedicated page, in particular regarding the use of the resources from our team (THOTH) or from other teams.

The CPU computation nodes are Xeons with 8 to 48 cores. They have 32 GB of ram at least (you can find out how much ram each machine has with the command oarnodes, property 'mem'). All nodes run Ubuntu 16.04 x86_64 with a set of commonly packages installed by default. You can install packages on the cluster nodes, however if you feel you need those everywhere, you should seek the help of a system administrator. On all nodes you can use /tmp and /dev/shm for (local) temporary files. Global data has to be on scratches.

GPU cluster

Important: On the GPU cluster, many nodes have more than 1 GPU. You are responsible for using the GPU that was allocated to you by OAR.

To do so, you have to source the file gpu_setVisibleDevices.sh (available on all nodes) at the beginning of the script that you submit to OAR. This file sets the “CUDA_VISIBLE_DEVICES” environment variable. Using this, you no longer have to specify a GPU ID when running your single-GPU experiments. For example, if you are assigned GPU number 5 on gpuhost21, it will show up as “gpuid=0” in your job. Thus, your (bash/zsh) script should be something like:

source gpu_setVisibleDevices.sh
...
GPUID=0 # ALWAYS
RunCudaCode(... , GPUID , ...)

Note 1: the previous procedure based on the script gpu_getIDs.sh is now deprecated.

Note 2: If you don't use bash/zsh script and submit directly a command to OAR, you have to modify your submission to source gpu_setVisibleDevices.sh before running your command, for instance:

oarsub -p ... -l ... "source gpu_setVisibleDevices.sh; python myscript.py"

Attention: You have to check if the program/code that you use is compatible with setting the environment variable CUDA_VISIBLE_DEVICES. Standard libraries such as TensorFlow, Caffe, PyTorch (and anything based on Cuda) are compatible. In doubt, you can contact your system administrators before submitting your job on the cluster.

To check GPU memory consumption and use, you can connect to the node where your computations are done (with oarsub -C<job_id>) and run:

nvidia-smi

If you are doing everything right, you will see decent percent of GPU-Util for the GPU you are using.

To reserve 2 GPUs for a single job (if you use multi-GPU computations for instance), you can use the following option -l “host=1/gpuid=2” for oarsub, which can be combined with a walltime option, for instance:

oarsub -l "host=1/gpuid=2,walltime=48:0:0" <your_command>

In this case (multi-GPU code), sourcing gpu_setVisibleDevices.sh will set up the visible GPU id accordingly, and your (bash/zsh) script should be something like:

source gpu_setVisibleDevices.sh
...
GPUID=0,1 # ALWAYS
RunCudaCode(... , GPUID , ...)

The next section will be give you other examples of oarsub commands.

See the man pages or the online OAR documentation for more info.

The GPU worker nodes are called gpuhost1,.., gpuhost22. Some of them are desktop machines, some of them are racked. They run the same base configuration as desktop computers, and you also have rights to install packages. They generally have a scratch attached, which means you can get high volume local storage if that is one of your computation requirements.

You can run the following command oarnodes on edgar to see the available resources (which GPU model) on the cluster or check the files listing the machines from the team txt or spreadsheet.

The last version (or a very recent one) of the NVIDIA drivers are installed on each nodes. However, you have to install your own version of CUDA somewhere in your scratchdir (or use a version installed by one of your colleagues) and correctly set the PATH and LD_LIBRARY_PATH in your scripts to use it.


GPU cluster OAR cheat sheet

oarsub -I # reserve 1 GPU interactively for 2 hours
oarsub -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU for 48 hours
oarsub -p "gpumodel='titan_xp'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 Titan Xp for 48 hours
oarsub -p "host='gpuhost5'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU on gpuhsot5 for 48 hours
oarsub -p "host='gpuhost3' and gpuid='1'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve GPUID 1 on gpuhost3 for 48 hours

oarsub -p "host='gpuhost1'" -l "gpuid=4,walltime=48:0:0" "/path/to/my/script.sh" # reserve 4 GPUs on gpuhost1
oarsub -p "gpumodel='titan_black'" -l "host=1/gpuid=2" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 2 Titan Blacks on a single host

oarsub -r "2015-05-24 15:20:00" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve a GPU in advance

oarsub -p "host='gpuhost3'" -l "host=1" -t besteffort -t idempotent "/path/to/my/script.sh" # reserve a whole node in besteffort (acceptable for CPU jobs)
oarsub -p "gpuid='0'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve exclusively GPUID 0, thus no need to modify your code (default GPU is 0)

Etiquette

Some rules to use the cluster with efficiency and show courtesy towards your colleagues. :)

Do not run any jobs on submission nodes (edgar, access{1,2}-cp), which are just gateways to the different clusters, but also on clear (former frontend for the CPU cluster) a vital point in the THOTH network, notably as DHCP and NFS server (you don't want it to crash!).

The use of persistent terminal multiplexers (screen, tmux) is forbidden on the nodes themselves. Run your screen session from the OAR server or from your desktop machine.

Some of these rules are not enforced automatically, but if some jobs do not follow them, the administrators may suspend or kill them. Of course, exceptions can be made, especially during paper submission periods, but always by first discussing with your colleagues, and in a good spirit.

CPU cluster

Submitting a job on n cores means that your job must be able to fully use n cores (cf. parallelization, at code or thread/process level). An exception is made for jobs that require high amounts of RAM. In those cases, inflate the number of cores allocated according to the RAM you use. If you use too much memory, the machine might stall completely (what we call swap thrashing), requiring a manual reboot from a sysadmin.

Avoid overloading the OAR server by launching too many jobs in a row. There should not be many more jobs in the waiting state than cores.

In the same spirit, you shouldn't occupy more than 100 cores on the CPU cluster with default jobs. If you need more, switch to besteffort jobs.

It is forbidden to reserve a node via the command sleep.

GPU cluster

You are limited to 3 default jobs on the GPU cluster, to avoid that a user reserve all the resources. If you need more, you can use besteffort jobs.

You should control the memory (as in RAM) and CPU consumption of your jobs using GPU since your jobs are sharing worker with jobs from other people (1, 2, 4 or 8 GPUs by nodes).

In addition, you should also avoid using “big” GPUs if your task does not require huge memory. You can specify the GPU model or GPU memory with the following options -p “gpumodel='XX'” or -p “gpumem>'10000'” (note the importance of the single quote around 1000).

Reserving a GPU via the command sleep can be tolerated but is highly discouraged.

Equalizer script: in the past, it happened that a single user took all empty resources with besteffort jobs, forcing besteffort jobs from other users to wait (when all users already have their 2 default jobs), which was not a fair repartition of resources. To avoid this, we set up an “equalizer” script that regularly and automatically kills the last submitted job from the user with the largest number of besteffort job if some besteffort jobs from other users are waiting, enforcing a redistribution of the resources. So do not be surprised if some of your besteffort jobs are killed and resubmitted from time to time.

Fixing OAR lockup: sometimes OAR becomes completely unresponsive. Jobs can still be submitted using “oarsub”, however the list of jobs in “Waiting” state keeps growing, even though resources are available. In those cases and when admins are not around, you can use the following command on edgar:

sudo /bin/OAR_hard_reset.sh

It will attempt to restart OAR forcefully. Double-check that OAR is locked up before running this command.


OAR Scripts

OAR has another way of specifying all options for a job. Just pass it a script name:

  oarsub --scanscript ./myscript.sh

OAR will now process any line starting with #OAR. Here an example:

  #! /bin/bash
  
  #OAR -l nodes=1/core=10
  #OAR -n mybigbeautifuljob
  #OAR -t besteffort
  #OAR -t idempotent
  
  echo "start at $(date +%c)"
  sleep 10
  echo "start at $(date +%c)"

In this way every job can be a fairly self-contained description of what to do and which resources are needed. In case you are not a big fan of wrapping all your python scripts in shell scripts, this seems to work with arbitrary text files:

  #! /usr/bin/env python
  
  #OAR -t besteffort
  #OAR -t idempotent
  #OAR -l nodes=1/core=10
  
  import time
  from datetime import datetime
  
  print(datetime.now())
  time.sleep(20)
  print(datetime.now())

Notes

Utilities

Some useful scripts are available in the thoth_utils repository available here and on any machine of the team in the following directory /home/thoth/utils/thoth_utils.

Here is a short description ot their usage:

Shell Aliases

Two useful commands to start interactive sessions on the CPU and GPU cluster, respectively:

  ###################################################
  # Start an interactive CPU job in the current directory
  #(2 cores by default)
  # E.g. to make a 20 core job:
  # $ c 20
  ###################################################
  function c () {
    ssh -tA access2-cp "oarsub -I  -d $(pwd) -l nodes=1/core=${1:-2}"
  }
  ###################################################
  
  
  ###################################################
  # Start an interactive GPU job in the current directory.
  # I use gtx 980 for debugging to not block the more powerful GPUs
  ###################################################
  alias g="ssh -t edgar oarsub -I -d $(pwd) -p \"\\\"gpumodel='gtx980'\\\"\""
  ###################################################

Web visualisations