Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device


Up: Special features of different systems Next: Goals Previous: Using Shared Libraries

This device is experimental, and its version of mpirun is a little different from that for the other devices. In this section we describe how the mpd system of daemons works and how to run MPI programs using it. To use this system, MPICH must have been configured with the ch_p4mpd device, and the daemons must have been started on the machines where you will be running. This section describes how to do these things.



Up: Special features of different systems Next: Goals Previous: Using Shared Libraries


Goals


Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Introduction Previous: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device

The goal of the multipurpose daemon ( mpd and the associated ch_p4mpd device) is to make mpirun behave like a single program even as it starts multiple processes to execute an MPI job. We will refer to the mpirun process and the MPI processes. Such behavior includes

* fast, scalable startup of MPI (and even non-MPI) processes. For those accustomed to using the ch_p4 device on TCP networks, this will be the most immediately noticeable change. Job startup is now much faster.
* collection of stdout and stderr from the MPI processes to the stdout and stderr of the mpirun process.
* delivery of mpirun's stdin to the stdin of MPI process 0.
* delivery of signals from the mpirun process to the MPI processes. This means that it is easy to kill, suspend, and resume your parallel job just as if it were a single process, with cntl-C, cntl-Z, and bg and fg commands
* delivery of command-line arguments to all MPI processes
* copying of the PATH environment from the environment in which mpirun is executed to the environments in which the MPI processes are executed
* use of an optional argument to provide other environment variables
* use of a further optional argument to specify where the MPI processes will run (see below).



Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Introduction Previous: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device


Introduction


Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Examples Previous: Goals

The ch_p4 device relies by default on rsh for process startup on remote machines. The need for authentication at job startup time, combined with the sequential process by which contact information is collected from each remote machine and broadcast back to all machines, makes job startup unscalably slow, especially for large numbers of processes.

With Version 1.2.0 of mpich, we introduced a new method of process startup based on daemons. This mechanism, which requires configuration with a new device, has not yet been widely enough tested to become the default for clusters, but we anticipate that it eventually will become so. With Version 1.2.1 it has been significantly enhanced, and will now be installed when mpich is installed with make install. On systems with gdb, it supports a simple parallel debugger we call mpigdb.

The basic idea is to establish, ahead of job-startup time, a network of daemons on the machines where MPI processes will run, and also on the machine on which mpirun will be executed. Then job startup commands (and other commands) will contact the local daemon and use the pre-existing daemons to start processes. Much of the initial synchronization done by the ch_p4 device is eliminated, since the daemons can be used at run time to aid in establishing communication between processes.

To use the new startup mechanism, you must

* configure with the new device:
    configure -device=ch_p4mpd 

Add -opt=-g if you want to use the parallel debugger gdb.
* make as usual:
    make 

* go to the MPICH/mpid/mpd directory, where the daemons code is located and the daemons are built, or else put this directory in your PATH.
* start the daemons:

The daemons can be started by hand on the remote machines using the port numbers advertised by the daemons as they come up:

* On fire:
    fire% mpd & 

    [2] 23792 

    [fire_55681]: MPD starting 

    fire% 

* On soot:
    soot% mpd -h fire -p 55681 & 

    [1] 6629 

    [soot_35836]: MPD starting 

    soot% 


The mpd's are identified by a host and port number.

If the daemons do not advertise themselves, one can find the host and port by using the mpdtrace command:

* On fire:


    fire% mpd & 

    fire% mpdtrace 

    mpdtrace: fire_55681:  lhs=fire_55681  rhs=fire_55681  rhs2=fire_55681 

    fire% 

* On soot:
    soot% mpd -h fire -p 55681 & 

    soot% mpdtrace 

    mpdtrace: fire_55681:  lhs=soot_33239  rhs=soot_33239  rhs2=fire_55681 

    mpdtrace: soot_33239:  lhs=fire_55681  rhs=fire_55681  rhs2=soot_33239 

    soot% 

What mpidtrace is showing is the ring of mpd's, by hostname and port that can be used to introduce another mpd into the ring. The left and right neighbor of each mpd in the ring is shown as lhs and rhs respectively. rhs2 shows the daemon two steps away to the right (which in this case is the daemon itself).

You can also use mpd -b to start the daemons as real daemons, disconnected from any terminal. This has advantages and disadvantages.


There is also a pair of scripts in the mpich/mpid/mpd directory that can help:
    localmpds <number>  

will start <number> mpds on the local machine. This is only really useful for testing. Usually you would do
    mpd & 

to start one mpd on the local machine. Then other mpd's can be started on remote machines via rsh, if that is available:
    remotempds <hostfile> 

where <hostfile> contains the names of the other machines to start the mpd's on. It is a simple list of hostnames only, unlike the format of the MACHINES files used by the ch_p4 device, which can contain comments and other symbols.

See also the startdaemons script, which will be installed when mpich is installed.

* Finally, start jobs with the mpirun command as usual:
    mpirun -np 4 a.out 


You can kill the daemons with the mpdallexit command.



Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Examples Previous: Goals


Examples


Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: How the Daemons Work Previous: Introduction

Here are a few examples of usage of the mpirun that is built when the MPICH is configured and built with the ch_p4mpd device.

* Run the cpi example
    mpirun -np 16 /home/you/mpich/examples/basic/cpi 

If you put /home/you/mpich/examples/basic in your path, with
    setenv PATH ${PATH}:/home/you/mpich/examples/basic 

then you can just do
    mpirun -np 16 cpi 

* You can get line labels on stdout and stderr from your program by including the -l option. Output lines will be labeled by process rank.

* Run the fpi program, which prompts for a number of intervals to use.
    mpirun -np 32 fpi 

The streams stdin, stdout, and stderr will be mapped back to your mpirun process, even if the MPI process with rank 0 is executed on a remote machine.

* Use arguments and environment variables.
    mpirun -np 32 myprog arg1 arg2 -MPDENV- MPE_LOG_FORMAT=SLOG \ 
                  GLOBMEMSIZE=16000000 

The argument -MPDENV- is a fence. All arguments after it are handled by mpirun rather than the application program.

* Specify where the first process is to run. By default, MPI processes are spawned by by consecutive mpd's in the rung, starting with the one after the local one (the one running on the same machine as the mpirun process. Thus if you are logged into dion and there are mpd's running dion and on belmont1, belmont2, ..., belmont64, and you type
    mpirun -np 32 cpi 

your processes will run on belmont1, belmont2, ..., belmont32. You can force your MPI processes to start elsewhere by giving mpirun optional location arguments. If you type
    mpirun -np 32 cpi -MPDLOC- belmont33 belmont34 ... belmont64 

then your job will run on belmont33, belmont34, ..., belmont64. In general, processes will only be run on machines in the list of machines after -MPDLOC-.

This provides an extremely preliminary and crude way for mpirun to choose locations for MPI processes. In the long run we intend to use the mpd project as an environment for exploring the interfaces among job schedules, process managers, parallel application programs (particularly in the dynamic environment of MPI-2), and user commands.

* Find out what hosts your mpd's are running on:
    mpirun -np 32 hostname | sort | uniq 

This will run 32 instances of hostname assuming /bin is in your path, regardless of how many mpd's there are. The other processes will be wrapped around the ring of mpd's.



Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: How the Daemons Work Previous: Introduction


How the Daemons Work


Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Debugging Previous: Examples

Once the daemons are started they are connected in a ring:

A ``console'' process ( mpirun, mpdtrace, mpdallexit, etc.) can connect to any mpd, which it does by using a Unix named socket set up in /tmp by the local mpd.

If it is an mpirun process, it requests that a number of processes be started, starting at the machine given by -MPDLOC- as described above. The location defaults to the mpd next in the ring after the one contacted by the console. Then the following events take place.

* The mpd's fork that number of manager processes (the executable is called mpdman and is located in the mpich/mpid/mpd directory). The managers are forked consecutively by the mpd's around the ring, wrapping around if necessary.
* The managers form themselves into a ring, and fork the application processes, called clients.
* The console disconnects from the mpd and reconnects to the first manager. stdin from mpirun is delivered to the client of manager 0.
* The managers intercept standard I/O fro the clients, and deliver command-line arguments and the environment variables that were specified on the mpirun command. The sockets carrying stdout and sdterr form a tree with manager 0 at the root.

At this point the situation looks something like Figure 1 .


Figure 1: Mpds with console, managers, and clients

When the clients need to contact each other, they use the managers to find the appropriate process on the destination host. The mpirun process can be suspended, in which case it and the clients are suspended, but the mpd's and managers remain executing, so that they can unsuspend the clients when mpirun is unsuspended. Killing the mpirun process kills the clients and managers.

The same ring of mpd's can be used to run multiple jobs from multiple consoles at the same time. Under ordinary circumstances, there still needs to be a separate ring of mpd's for each user. For security purposes, each user needs to have a .mpdpasswd file in the user's home directory, readable only by the user, containing a password. This file is read when the mpd is started. Only mpd's that know this passord can enter a ring of existing mpd's.

A new feature is the ability to configure the mpd system so that the daemons can be run as root. To do this, after configuring /mpich/ you need to reconfigure in the /mpich/mpid/mpd directory with --enable-root and remake. Then mpirun should be installed as a setuid program. Multiple users can use the same set of mpd's, which are run as root, although their mpirun, managers, and clients will be run as the user who invoked mpirun.



Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Debugging Previous: Examples


Debugging


Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Computational Grids: the globus2 device Previous: How the Daemons Work

One of the commands supported by the mpd system is a simple parallel debugger which allows you to run all processes under the gdb debugger and interact with them one at a time or together by redirecting stdin. Here is a simple example of running mpich/examples/cpi in this way:

donner% mpigdb -np 3    cpi                  # default is stdin bcast 

(mpigdb) b 33                                # set breakpoint for all 

0: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. 

1: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. 

2: Breakpoint 1 at 0x8049eac: file cpi.c, line 33. 

(mpigdb) r                                   # run all 

2: Breakpoint 1, main (argc=1, argv=0xbffffab4) at cpi.c:33 

1: Breakpoint 1, main (argc=1, argv=0xbffffac4) at cpi.c:33 

0: Breakpoint 1, main (argc=1, argv=0xbffffad4) at cpi.c:33 

(mpigdb) n                                   # single step all 

2: 43           MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 

0: 39               if (n==0) n=100; else n=0; 

1: 43           MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 

(mpigdb) z 0                                 # limit stdin to rank 0 

(mpigdb) n                                   # single step rank 0 

0: 41               startwtime = MPI_Wtime(); 

(mpigdb) n                                   # until caught up 

0: 43           MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); 

(mpigdb) z                                   # go back to bcast 

(mpigdb) n                                   # single step all 

               ....                          # several times 

(mpigdb) n                                   # until interesting spot 

0: 52                   x = h * ((double)i - 0.5); 

1: 52                   x = h * ((double)i - 0.5); 

2: 52                   x = h * ((double)i - 0.5); 

(mpigdb) p x                                 # bcast print command 

0: $2 = 0.0050000000000000001                # 0's value of x 

2: $2 = 0.025000000000000001                 # 2's value of x 

1: $2 = 0.014999999999999999                 # 1's value of x 

(mpigdb) c                                   # continue all 

0: pi is approximately 3.1416009869231249, Error  0.0000083333333318 

0: Program exited normally. 

1: Program exited normally. 

2: Program exited normally. 

(mpigdb) q                                   # quit 

donner%  



Up: Fast Startup with the Multipurpose Daemon and the ch_p4mpd Device Next: Computational Grids: the globus2 device Previous: How the Daemons Work