Taking advantage of multiple processors/cores

Many of the newer Lear machines have multiple processors/cores. We assume that each core counts as a separate CPU (this is how they are mostly seen on Linux). To take advantage of them, the programmer must run several processes or threads in parallel. This document gives some simple recepies on how to do this.

1. How many processors are there?

Lear machines have between 2 and 64 CPUs. To find out how many of them there are:

In bash:

ncpu=$(cat /proc/cpuinfo | grep ^processor | wc -l)

In Python:

ncpu=len([1 for l in open("/proc/cpuinfo","r") if l.startswith("processor")])

In C:

#define __USE_GNU
#include <sched.h>

int count_cpu() {
  cpu_set_t set;
  sched_getaffinity(0,sizeof(cpu_set_t),&set);
  int i,count=0;
  for(i=0;i<CPU_SETSIZE;i++)
    if(CPU_ISSET(i,&set)) count++;
  return count;
}

Typically, for CPU-intensive operations, one runs as many tasks (threads or processes) as CPUs: less would be under-using available resources and more would cause unnecessary context switches.

2. Distributing a set of operations

Here we assume that operations 0 to n-1 must be executed on ncpu cores.

2.1 In shell

This is a simple solution when each operation is an executable or a shell function that can be run in separate processes:

trap 'kill -HUP 0' 0 # handle exit gracefully
for((cpu=0; cpu<ncpu; cpu++)); do
  for((i=cpu; i<n; i+=ncpu)); do
     operation $i
  done &
done

wait

There is no load balancing: the operations should all take approximately the same time, or n should be big enough so that the differences in computing times are averaged out.

To do load balancing, we can use a lock file:

prefix=/tmp/task_lock.$$ # some non-existing name
rm -f $prefix*
touch $prefix.0

trap 'kill -HUP 0' 0 # handle exit gracefully
for((cpu=0; cpu<ncpu; cpu++)); do
  for((i=0; i<n; i++)); do
     if mv $prefix.$i $prefix.$[i+1] 2> /dev/null; then
       operation $i
     fi
  done &
done

wait
This assumes that mv is an atomic operation. It is therefore advisable to take a lock file on a local disk (/tmp), because files on NFS may get de-synchronized.

2.2 In shell, over several machines

The method above can also be used as a simple batch scheduler, to distribute jobs over several machines. The operation cannot be a function this time, it has to be a shell script:
prefix=/tmp/task_lock.$$ # some non-existing name
rm -f $prefix*
touch $prefix.0

trap 'kill -HUP 0' 0 # handle exit gracefully
for machine in edmund cepheus anonio; do
  for((i=0; i<n; i++)); do
     if mv $prefix.$i $prefix.$[i+1] 2> /dev/null; then
       ssh $machine "operation.sh $i"
     fi
  done &
done

wait

Similarly, on the cluster, if 10 nodes are reserved with oarsub -lnodes=10 -I (replace -I with a script name when it works):

prefix=/tmp/task_lock.$$ # some non-existing name
rm -f $prefix*
touch $prefix.0
trap 'kill -HUP 0' 0 # handle exit gracefully


for machine in $( cat $OAR_NODEFILE ); do
  for((i=0; i<n; i++)); do
     if mv $prefix.$i $prefix.$[i+1] 2> /dev/null; then
       oarsh $machine "operation.sh $i"
     fi
  done &
done
wait

2.3 In Python

This piece of code calls a function on multiple CPUs. It is used like
import multiprocessing 
from multiprocessing.dummy import Pool as ThreadPool

def operation(i):
   # complicated processing on element i

pool = ThreadPool(multiprocessing.cpu_count())
pool.map(operation, range(n))

It uses load balancing. However, Python has a global lock that prevents Python functions from being executed concurrently. The lock is explicitly released by  I/O bound functions and by carefully wrapped C-functions (like many numpy functions). Plain Python code will not run faster with multiple threads.
The solution to this is to use processes instead of threads. For this, just use multiprocessing.Pool instead of threads.
Beware that errors are not correctly propagated by thread and process pools, so code should be validated as much as possible before using these functions.

2.4 In C

The easiest way is to use OpenMP. A loop can be distributed over ncpu threads with:

#include <omp.h> 

#pragma omp parallel for num_threads(ncpu) schedule(dynamic) 
   for(i=0; i<n; i++) {
     /* complicated processing on element i */ 
   }

If all operations are expected to take the same time, the schedule(dynamic) can be left out. Be careful to declare local variables used in the "complicated processing" part inside the braces, else they are shared between threads (fireworks!).
The C code has to be compiled with the -fopenmp flag.

An alternative is this piece of code, that calls a function on multiple CPUs. It is used like

void operation(void *arg, int tid, int i) {
   /* complicated processing on element i */
   /* tid is the index in 0..ncpu-1 of the thread */
   /* arg is an arbitrary argument */
}

compute_tasks(n, ncpu, &operation, arbitrary_argument);

3 A few princples

3.1 Am I using all the CPUs I can ?

top should display a CPU usage of ncpu*100%.

3.2 Levels

Usually, it is both easier and more efficient to use multitasking at the highest/coarsest level possible. This means that when you have two nested loops, if possible, prefer parallelizing the outer loop.

3.2 Data dependency

The operations must write into separate memory locations and/or files, because there is no guarantee on the ordering of the operations.

3.3 Beware of the busy wait

Parallelized code typically uses event-based programming (like GUIs), the events being the start/end of a computation, user interaction, etc.

It is tempting to implement this by checking ("polling") every x seconds if an event has occured, or a state has changed. This is usually not advisable: if x is small or 0, the checking consumes resources whithout doing anything useful most of the time ("busy wait"); if x is large, event processing will not be reactive.

To avoid busy waits, event processing code should wait for events like the end of a subprocess (wait in shell) or thread (pthread_join in C), data on a file descriptor (read or select in C), or synchonization events on threaded code (pthread_mutex_lock, pthread_cond_wait in C).