Checkpoint Recovery
Checkpointing
Beehive support checkpoint-recovery. That is, one can configure the system to periodically checkpoint its state and recovery from the latest checkpoint in the event of a failure or intentional halting of the system. To configure the system to periodically checkpoint, the following flag needs to be set:
System.checkpoint=true
By default, the system checkpoints every five seconds. This interval is configurable:
# Value is in milliseconds System.checkpointFrequency=5000
In order to checkpoint the data within the storage system of the system, all data must be serializable.
Checkpoint files
The system checkpoints its state into the following file structure:
. └── checkpoints ├── bak │ ├── host1 │ │ ├── HashtableServer.ser │ │ └── LocalWorkpool.ser │ ├── host2 │ ├── ... │ └── host-N ├── restore │ ├── host1 │ │ ├── HashtableServer.ser │ │ └── LocalWorkpool.ser │ ├── host2 │ │ ├── HashtableServer.ser │ │ └── LocalWorkpool.ser │ ├── ... │ └── host-N └── save ├── host1 │ ├── HashtableServer.ser │ └── LocalWorkpool.ser ├── host2 ├── ... └── host-N
There are three tiers of files. When the system checkpoints, it moves all
files within save/
to bak/
and all files in restore/ into save/ before any
checkpoint files are created. This is so that if the system crashes while the
system is creating the checkpoint files, there are two stable copies of the
previous state in which the system can restore from. When the system recovers,
it only attempts to recover from restore/
and bak/
Recovery
Automatic Recovery (Failure Detection)
Beehive has a failure detection server which can automatically detect if any cluster node has failed. In a case of a failure, the server will attempt to automatically recover the system from the last checkpoint. In order to configure the system to use a failure detection server, the following flags need to be configured:
FailureDetection.graces=3 FailureDetection.enable=true FailureDetection.hostname=nuclear01.cs.umn.edu FailureDetection.heartbeatFrequency=1000 # milliseconds FailureDetection.pollingFrequency=3000 # millseconds
- graces
- This value allows a cluster node to fail to report N times to the server before the server determines that it has failed. By default it is set to 3
- hostname
- This is the host name of the failure detection server
- heartbeatFrequency
- This value determines the period in which each cluster node is to report to the failure detection server. By default, each cluster node must report to the server every second.
- pollingFrequency
- This value determines how often the server should check for failure. By default, the server checks for failure every three seconds.
Necessary Files
Starting the Failure Server
To start the failure detection server, one must run the following script on the machine that is to run the server:
./startFailureDetector < hostfile >
The script contains the following command:
java beehive.checkpoint.FailureDetectorServer configFile $1
Recovering the System
There are three files of importance to automatically recover a cluster which must be edited to include the application name/directory wherever necessary:
recover-cluster.sh
This file contains the following bash script:
#!/usr/bin/env bash # $1 - node lists file # $2 - master command to be run master=0; echo $master script_dir="$BEEHIVE_TESTPROGS/< application directory >" for node in `cat $1` do echo "running command on $node" ssh $node "sh $script_dir/recover-< application-name >".sh &" & done
recover-< application-name >.sh
This file contains the following bash script:
# this script is to be run on a machine in order # to setup a beehive node on that machine. # To be used with run_command_cluster.sh to setup beehive nodes on multiple machines cmd="./recover< application-name >" dir="$BEEHIVE_TESTPROGS/< application directory >" # restart rmi registry #pgrep rmiregistry | xargs kill -9 ps -ef | grep java | grep PreFlow | tr -s ' ' | cut -f2 -d' ' | xargs kill -9 # Line below commented when we are started rmirgistry in our program code itself #rmiregistry & # run the command to start beehive process cd $dir $cmd
recover< application-name >
This file contains the java command to start the application. You must include a command line argument in your application start up code that when read, states that the system is starting in recovery mode. Then, one must use the following variant of the init method of the BeehiveAppLoader
init(args, id, nodeClassNameMap, nodeClassMap, recover=true)
Manual Recovery
If instead one wants to manually recover the system, then the following flag must be set to true before starting the application
System.recovery=true