2. Environment Configuration



Beehive system code base and applications are contained in two directores.The following example is for codebase version 3.9.3


2.1 CodeBase and TestProg Directory Structure

$BEEHIVE_HOME should be set to the path to the Beehive directory containing two directories, one for the codebase and the other for application programs, as shown below:

        .
        |__ Beehive
            |-- BeehiveCodeBase
            |   |__ BeehiveCodeBase-V3.9.3
            |       |__ src
            |           |-- HashTable.thrift
            |           |-- schema.thrift
            |           |-- beehive
            |           |   |-- util
            |           |   |-- server
            |           |   |-- thrift
            |           |   |-- validationService
            |           |   |-- workpool
            |           |   |__ worker
            |           |
            |           |__ lib
            |               |__ <Dependencies for Thrift>
            |
            |__ BeehiveTestProgs
                |__ BeehiveTestProgs-V3.9.3
                    |__ src
                        |__ TestProgs
                            |-- GraphColoring
                            |-- ShortestPath
                            |-- MaxClique
                            |-- <Various application program examples>
                            |__ util
                                |__ BeehiveAppLoader.java
    

2.2 Bash Environment Variables

The following environment variables should be set for bash, as shown below:

        export BEEHIVE_CODEBASE_VERSION="BeehiveCodeBase-V3.9.3"
        export BEEHIVE_TESTPROGS_VERSION="BeehiveTestProgs-V3.9.3"
        export BEEHIVE_CODEBASE="$BEEHIVE_HOME/Beehive/BeehiveCodeBase/$BEEHIVE_CODEBASE_VERSION/src"
        export BEEHVE_TESTPROGS_SRC="BEEHIVE_HOME/Beehive/BeehiveTestProgs/$TESTPROGS_VERSION/src"
        export BEEHIVE_TESTPROGS="$BEEHIVE_TESTPROGS_SRC/TestProgs"
        export THRIFT_JARS="$BEEHIVE_CODEBASE/lib/*"
    

For various example graphs for testing, the GRAPH_HOME should be set appropriately as shown in the example below:

        export GRAPH_HOME="/project/cluster16/GraphGen"
        export GRAPH_FGRAPH="GRAPH_HOME/F"
    

The Java classpath should be set as follows:

        CLASSPATH=".:$BEEHIVE_CODEBASE:$BEEHIVE_TESTPROGS_SRC:$GRAPH_FGRAPH:$GRAPH_HOME:$THRIFT_JARS:$CLASSPATH"
        export PATH=".:$JAVA_HOME:$THRIFT_JARS:$PATH"
    

2.3 Command Scripts for Executing an Application Program

An application program is launched using three command script programs. These programs are executed in the example code directory. For example, for the GraphColoring problem, these script programs will be executed in the following directory

        $BEEHIVE_HOME/BeehiveTestProgs/BeehiveTestProgs-V3.9.3/src/TestProgs/GraphColoring/
    

We need to first create a file containing the list of the cluster nodes on which we want to execute the parallel program. Suppose that we want to execute a program on a cluster of four nodes, then we create a file, say named 4-nodes, containing the hostnames of the nodes as follows:

        Example hostlist file::4-nodes
        nuclear01.cs.umn.edu
        nuclear02.cs.umn.edu
        nuclear03.cs.umn.edu
        nuclear04.cs.umn.edu
    

For the GraphColoring problem, a Java program called GCTest.java will be executed on each node in the cluster.

The details about developing a parallel program, such as GCTest.java will be discussed in the following chapters.

Before we launch the execution, there are several other steps that need to be performed as shown below:

  1. Create a configuration file, say named configFile, and store it in the application program's directory, e.g. GraphColoring. The details of preparing the configFile are given in the next chapter.
  2. Start the ValidationService on a dedicated host. Suppose that we will execute it on a host named jupiter.cs.umn.edu. The details are given below:

    1. Log onto the host running the ValidationService and make sure the bash environment variables are correctly set. Execute the file called startValidator which contains the following command:
       java beehive.workpool.GlobalWorkpoolImpl configFile
    2. Make sure the ValidationService is running, as it will printout the configFile

There are three script programs which will launch the exection of the GraphColoring parallel program:

  1. runGC
  2. start-GC.sh
  3. run-command-cluster.sh

The program exectuin will be launched using run-command-cluster which will then execute start-GC and then in turn, runGC.

Before you launch the program, you need to follow the steps detailed below:

Step 1: Edit runGC

The structure of this file is shown below:

        File:: runGC
        java -Xms4096m -Xmx8192m -XX:+UseG1GC TestProgs.GraphColoring.GCTest 4-nodes configFile jupiter.cs.umn.edu $GRAPH_FGRAPH/fgraph-50000-100-100.pp 50000 configFile 2>&1 &
    

This command given in the runGC will be executed on each of the cluster nodes when the program execution is launched. There several options given to JVM related to intial and max memory and the garbage collected to be used. These options play a critical role when execute a program on large data sets. The command also specifies the input graph to be used, which in this case is fgraph-50000-100-100 (50K node graph). The hostlist filename (4-nodes in this example) is given as an argument to the program. The hostname for the ValdiationService, which is jupiter.cs.umn.edu in this example, is also give as one of the argument to the GCTest program.

Step 2: Edit start-GC.sh

The structure of start-GC.sh is shown below:

        File:: start-GC.sh
        # This script is run on each cluster host machine in order to setup a beehive node on that machine.
        # To be used with run_command_cluster.sh to setup beehive nodes on multiple machines
        # This starts the Worker processes on nodes
        dir="$BEEHIVE_TESTPROGS/GraphColoring" #modify to change the directory
        cmd="./runGC"
        #tcsh
        # terminate any running rmiregistry
        pgrep rmiregistry | xargs kill -9
        ps -ef | grep java | grep GC | tr -s ' ' | cut -f2 -d' ' | xargs kill -9
        # run the command to start beehive process
        cd $dir
        $cmd
    

You need to make sure that the variables dir points to the GraphColoring program directory and cmd is set to runGC

Step 3: Edit run-command-cluster_GC.sh

The structure of this shell command file is shown below. You need to make sure that the variable script_dir is correctly set to the GraphColoring directory.

        #!/usr/bin/env bash
        # $1 - node lists file
        script_dir="$BEEHIVE_TESTPROGS/GraphColoring"
        for node in `cat $1`
        do
            echo "running command on $node"
            ssh $node "sh $script_dir/start-GC.sh &" &
        done
    

Step 4: Launch Parallel Program Execution

You are no ready to start the parallel exectuin of the GraphColoring program on a 4-node cluster. Execute the following command in the GraphColoring directory:

        run-command-cluster_GC.sh 4-nodes
    

Step 5: Terminating Program Execution

In case you want to terminate the program execution due to some errors or other reasons, execute the following command:

        clear_my_java.sh 4-nodes GC
    

The first argument is the hostlist file and the second argument is a unique string appearing in the program name which happens to be GCTest.java in this example.