Checkpoint Recovery

Checkpointing

Beehive support checkpoint-recovery. That is, one can configure the system to periodically checkpoint its state and recovery from the latest checkpoint in the event of a failure or intentional halting of the system. To configure the system to periodically checkpoint, the following flag needs to be set:

System.checkpoint=true

By default, the system checkpoints every five seconds. This interval is configurable:

# Value is in milliseconds
System.checkpointFrequency=5000

In order to checkpoint the data within the storage system of the system, all data must be serializable.

Checkpoint files

The system checkpoints its state into the following file structure:

.
└── checkpoints
    ├── bak
    │   ├── host1
    │   │   ├── HashtableServer.ser
    │   │   └── LocalWorkpool.ser
    │   ├── host2
    │   ├── ...
    │   └── host-N
    ├── restore
    │   ├── host1
    │   │   ├── HashtableServer.ser
    │   │   └── LocalWorkpool.ser
    │   ├── host2
    │   │   ├── HashtableServer.ser
    │   │   └── LocalWorkpool.ser
    │   ├── ...
    │   └── host-N
    └── save
        ├── host1
        │   ├── HashtableServer.ser
        │   └── LocalWorkpool.ser
        ├── host2
        ├── ...
        └── host-N

There are three tiers of files. When the system checkpoints, it moves all files within save/ to bak/ and all files in restore/ into save/ before any checkpoint files are created. This is so that if the system crashes while the system is creating the checkpoint files, there are two stable copies of the previous state in which the system can restore from. When the system recovers, it only attempts to recover from restore/ and bak/

Recovery

Automatic Recovery (Failure Detection)

Beehive has a failure detection server which can automatically detect if any cluster node has failed. In a case of a failure, the server will attempt to automatically recover the system from the last checkpoint. In order to configure the system to use a failure detection server, the following flags need to be configured:

FailureDetection.graces=3
FailureDetection.enable=true
FailureDetection.hostname=nuclear01.cs.umn.edu
FailureDetection.heartbeatFrequency=1000 # milliseconds
FailureDetection.pollingFrequency=3000 # millseconds

graces
- This value allows a cluster node to fail to report N times to the server before the server determines that it has failed. By default it is set to 3
hostname
- This is the host name of the failure detection server
heartbeatFrequency
- This value determines the period in which each cluster node is to report to the failure detection server. By default, each cluster node must report to the server every second.
pollingFrequency
- This value determines how often the server should check for failure. By default, the server checks for failure every three seconds.

Necessary Files

Starting the Failure Server

To start the failure detection server, one must run the following script on the machine that is to run the server:

./startFailureDetector < hostfile >

The script contains the following command:

java beehive.checkpoint.FailureDetectorServer configFile $1

Recovering the System

There are three files of importance to automatically recover a cluster which must be edited to include the application name/directory wherever necessary:

recover-cluster.sh

This file contains the following bash script:

#!/usr/bin/env bash
# $1 - node lists file
# $2 - master command to be run
master=0;
echo $master
script_dir="$BEEHIVE_TESTPROGS/< application directory >"

for node in `cat $1`
do
  echo "running command on $node"
  ssh $node "sh $script_dir/recover-< application-name >".sh &" &
done

recover-< application-name >.sh

This file contains the following bash script:

# this script is to be run on a machine in order
# to setup a beehive node on that machine.
# To be used with run_command_cluster.sh to setup beehive nodes on multiple machines

cmd="./recover< application-name >"
dir="$BEEHIVE_TESTPROGS/< application directory >"

# restart rmi registry
#pgrep rmiregistry | xargs kill -9
ps -ef | grep java | grep PreFlow | tr -s ' ' | cut -f2 -d' ' | xargs kill -9

# Line below commented when we are started rmirgistry in our program code itself
#rmiregistry &

# run the command to start beehive process
cd $dir
$cmd

recover< application-name >

This file contains the java command to start the application. You must include a command line argument in your application start up code that when read, states that the system is starting in recovery mode. Then, one must use the following variant of the init method of the BeehiveAppLoader

init(args, id, nodeClassNameMap, nodeClassMap, recover=true)

Manual Recovery

If instead one wants to manually recover the system, then the following flag must be set to true before starting the application

System.recovery=true