ADLB

Asynchronous Dynamic Load Balancing




A paper about ADLB


Introduction

ADLB is a software library designed to help rapidly build scalable parallel programs.
The name (pronounced adlib) is the acronym for Asynchronous Dynamic Load Balancing. However, ADLB does not achieve scalability solely by load balancing. It also includes some features that exploit work-stealing as well. Indeed, we sometimes use the phrase instantaneous load balancing via work-stealing to describe ADLB.

ADLB will be available under the same license arrangement as MPICH2.

There are features of ADLB that are reminiscent of a variety of other systems, e.g. Linda.
However, ADLB is significantly different and unique in a number of ways.

Overview

Some Thoughts on Scalability

When one begins to ponder an approach to parallelization of a given algorithm, a natural alternative to consider is a Master/Slave Algorithm, much like depicted here:

However, one is forced to accept the fact that such an approach is not likely to be very scalable due to the probable bottleneck caused by slave ranks trying to access data on the shared queue which is managed by the master.

What one really wants is an approach where the shared queue is easily accessible to all ranks without need to go through an arbiter process. For example:

However, even in that case, it is possible for the shared queue to become a bottleneck if the shared queue requires synchronized access.

The ADLB approach to these problems is to develop a distributed shared queue. This distributed queue logically appears as a single queue to all the app ranks. Portions of the queue are kept at each of several server ranks. These servers are solely responsible for managing the queue and do not participate in the app's computation. The servers share information among themselves about the current state of the work queue. When an app rank wants a piece of data, it can contact a server to help it retrieve that data. Data can migrate to arbitrary servers depending on memory usage, load, etc. For example:

Some Initial Comments on ADLB

From the user's perspective, ADLB is relatively trivial to use. Any process can simply do a Put operation to place data into a shared space, and any process can later Reserve then Get that data. Thus, the typical mode of operation is for an application to treat the shared space as a repository for work units. Arbitrary processes can add work to, or retrieve work from, the pool as needed. In support of the basic operations, there are additional provisions to provide enhanced functionality. For example, when data is Put into the shared space, it can be marked as targeted for (only reservable by) a particular rank. There are also some special operations designed to boost performance as well as to gather statistical information about a run.

ADLB assumes the presence of MPI. For example, in the current implementation, the app is responsible for performing the MPI_Init operation before ADLB_Init. Further, ADLB uses MPI ranks to identify processses.

Details of the ADLB API are covered more fully in a later section below. In this section, we merely present a brief overview designed to quickly give you the basic idea. ADLB consists of only a very small number of procedures to invoke. The most widely used ones are:

Detailed Discussion of the API

Currently we support an API for C and Fortran programs. Thus, below we follow the format of some of the MPI documentation providing both the C and Fortran prototypes for each function.

Download

You can view the ADLB trunk from read-only svn: here.
Or, you can download it by executing this command in a terminal window: svn co http://svn.cs.mtsu.edu/svn/adlbm/trunk adlbm
We also try to keep an up-to-date tgz here: here.

The companion software (DMEM) can also be downloaded here: here.
Or, you can download it by executing this command in a terminal window: svn co http://svn.cs.mtsu.edu/svn/dmem/trunk dmem

Build

On a "typical" linux system, you should be able to just do the usual:

./configure
make

Note that since ADLB is still very much in development mode, there is currently no install target. You have to either work in the ADLB directory, or move the libraries to a directory where you want them, or create a Makefile elsewhere which points to the ADLB libraries.

For other systems, there are a few pre-defined arg sets that you can supply to configure. For example, on our sicortex, we use:

./configure --with-config-args=sicortex

This causes the file configargs/sicortex.cfg to supply arguments to the configure script. You can create your own customized configure options files specific to your own environment.

Testing

There are a number of small test programs that are targets in the Makefile. You might want to build and run the nq program (nqueens). For example:

make nq
mpiexec -l -n 2 nq
mpiexec -l -n 10 nq -q -n 9 -nservers 2

The first execution of nq prints the 2 solutions to a 4x4 nqueens puzzle. The program stalls for about 5 seconds, because nq halts only when ADLB has determined that it has reached exhaustion, i.e. no more processes are going to do any Put operations because they are all waiting for Reserves. At present, ADLB prints quite a bit of information at the beginning and end of each run. We will reduce the output later when we enter a production mode.

The second execution of nq runs 10 processes. Two of them are ADLB servers, and the remaining 8 are nq app ranks. The -n argument to nq says to do a 9x9 board puzzle.