Environment Configuration and Command Scripts
Codebase and TestProgr Directory Structure
$BEEHIVE_HOME
should be set to the path to the Beehive directory containing
two directories, one for the codebase and the other for application programs,
as shown below:
. ├── BeehiveCodeBase │ └── Beehive │ ├── lib │ └── src │ ├── beehive │ │ ├── server │ │ ├── thrift │ │ ├── util │ │ ├── validation │ │ ├── validationService │ │ ├── worker │ │ └── workpool │ ├── HashTable.thrift │ └── schema.thrift └── BeehiveTestProgs └── BeehivePrograms └── src └── TestProgs ├── GraphColoring ├── MaxClique ├── ShortestPath └── util └── BeehiveAppLoader.java
Bash Environment Variables
The following environment variable should be set for bash, as shown below:
export BEEHIVE_HOME="" export BEEHIVE_CODEBASE_VERSION="Beehive" export BEEHIVE_TESTPROGS_VERSION="BeehivePrograms" export BEEHIVE_CODEBASE="$BEEHIVE_HOME/Beehive/BeehiveCodeBase/$BEEHIVE_CODEBASE_VERSION/src" export BEEHVE_TESTPROGS_SRC="BEEHIVE_HOME/Beehive/BeehiveTestProgs/$TESTPROGS_VERSION/src" export BEEHIVE_TESTPROGS="$BEEHIVE_TESTPROGS_SRC/TestProgs" export THRIFT_JARS="$BEEHIVE_CODEBASE/lib/*"
For various example graphs for testing, the GRAPH_HOME
should be set appropriately as shown in the example below:
export GRAPH_HOME="/project/cluster16/GraphGen" export GRAPH_FGRAPH="$GRAPH_HOME/Fgraph"
The Java classpath should be set as follows:
CLASSPATH=".:$BEEHIVE_CODEBASE \ :$BEEHIVE_TESTPROGS_SRC \ :$GRAPH_FGRAPH \ :$GRAPH_HOME \ :$THRIFT_JARS \ :$CLASSPATH" export PATH=".:$THRIFT_JARS:$PATH"
Command Scripts for Executing an Application Program
An application program is launched using three command script programs. These programs are executed in the example code directory. For example, for the GraphColoring problem, these script programs will be executed in the following directory.
$BEEHIVE_HOME/BeehiveTestProgs/BeehiveTestProgs-V3.9.3/src/TestProgs/GraphColoring/
We need to first create a file containing the list of the cluster nodes on which we want to execute the parallel program. Suppose that we want to execute a program on a cluster of four nodes, then we create a file, say named 4-nodes, containing the hostnames of the nodes as follows:
Example hostlist file::4-nodes nuclear01.cs.umn.edu nuclear02.cs.umn.edu nuclear03.cs.umn.edu nuclear04.cs.umn.edu
For the GraphColoring problem, a Java program called GCTest.java will be executed on each node in the cluster. The details about developing a parallel program, such as GCTest.java will be discussed in the following chapters.
Before we launch the execution, there are several other steps that need to be performed as shown below:
- Create a configuration file, say named
configFile
, and store it in the application program's directory, e.g. GraphColoring. The details of preparing the configFile are given in the next chapter. -
Start the ValidationService on a dedicated host. Suppose that we will execute it on a host named
jupiter.cs.umn.edu
. The details are given below:-
Log onto the host running the
ValidationService
and make sure the bash environment variables are correctly set. Execute the file called startValidator which contains the following command:java beehive.validation.GlobalValidationService configFile
-
Make sure the
ValidationService
is running, as it will printout the configFile
-
Executing an Application Program
There are four important scripts which will launch the execution of the GraphColoring parallel program:
runGC
start-GC.sh
run-command-cluster.sh
clear_my_java.sh
runGC
The structure of this file is shown below:
java -Xms4096m -Xmx8192m -XX:+UseG1GC TestProgs.GraphColoring.GCTest \ 4-nodes \ configFile \ jupiter.cs.umn.edu $GRAPH_FGRAPH/fgraph-50000-100-100.pp \ 50000 \ configFile \ 2>&1 &
This command given in the runGC will be executed on each of the cluster nodes
when the program execution is launched. There several options given to JVM
related to intial and max memory and the garbage collected to be used. These
options play a critical role when execute a program on large data sets. The
command also specifies the input graph to be used, which in this case is
fgraph-50000-100-100 (50K node graph). The hostlist filename (4-nodes in this
example) is given as an argument to the program. The hostname for the
ValidationService
, which is jupiter.cs.umn.edu
in this example, is also
give as one of the argument to the GCTest program.
start-GC.sh
The structure of start-GC.sh
is shown below:
File:: start-GC.sh # This script is run on each cluster host machine in order to setup a beehive node on that machine. # To be used with run_command_cluster.sh to setup beehive nodes on multiple machines # This starts the Worker processes on nodes dir="$BEEHIVE_TESTPROGS/GraphColoring" #modify to change the directory cmd="./runGC" #tcsh # terminate any running rmiregistry pgrep rmiregistry | xargs kill -9 ps -ef | grep java | grep GC | tr -s ' ' | cut -f2 -d' ' | xargs kill -9 # run the command to start beehive process cd $dir $cmd
You need to make sure that the variables dir
points to the GraphColoring
program directory and cmd is set to runGC
run-command-cluster_GC.sh
The structure of this shell command file is shown below. You need to make sure
that the variable script_dir
is correctly set to the GraphColoring directory.
#!/usr/bin/env bash # $1 - node lists file script_dir="$BEEHIVE_TESTPROGS/GraphColoring" for node in `cat $1` do echo "running command on $node" ssh $node "sh $script_dir/start-GC.sh &" & done
You are now ready to start the parallel execution of the GraphColoring program on a 4-node cluster. Execute the following command in the GraphColoring directory:
run-command-cluster_GC.sh 4-nodes
clear_my_java.sh
In case you want to terminate the program execution due to some errors or other reasons, execute the following command:
clear_my_java.sh 4-nodes GC
The first argument is the hostlist file and the second argument is a unique
string appearing in the program name which happens to be GCTest.java
in this
example.