Table of Contents




6. Configuration Files for Fault Tolerance and Self-Recovery

      6.1 Failure Agents

      6.2 Multiple Failure Agents

      6.3 SMS Agent Recovery

      6.4 Examples

        6.4.1 Two Failure Agents

        6.4.2 Two Failure Agents and SMS Agent Recovery




6.1 Failure Agents

A failure agent is an agent that contains the FailureEventDetector. This detector waits for a heartbeat that is sent out from agents with the AgentAliveEvent. The AgentAliveEvent is triggered by the TimerEventDetector. If the failure agent does not get sent an agent alive event for a certain number of cycles, the Failure Agent will tell the SMS Agent to relaunch whichever agent failed. If the agent that is being relaunched had subscribers, the subscriptions will be restarted.

How to adjust the configuration files:

The SMS Agent needs to know what to look for to restart an agent that has failed. Up to this point we have not included detectors or handlers in the SMS Agent. To do this we add a new agent to the configuration file. There is a special way to add an SMS Agent to the configuration, it looks like the following:

SMS_AGENT{
   TARGET_HOST = URN:ans:plato.cs.umn.edu/jtucker/ellora.cs.umn.edu;
   AGENT_NAME = ~/ia;
   dburl = jdbc:mysql://archimedes.cs.umn.edu:10000/test;
   dbuser = mobile_agent;
   dbpasswd = user1000;
   TRIGGER_TABLE = /home/ugrad00/jtucker/konark/trigger;
   DETECTOR{
      network.detectors.RecoveryHandlerDetector, network.manager.RecoveryHandler, "HandlerOnly";
      network.detectors.TimerEventDetector, network.manager.EventHandler;
   }
}

In addition to the above the system values need the addition of the SMS_AGENT's url and the SMS_AGENT's name. The url can be found when you start the SMS Agent Server. (note: only one slash before the url not two). The name is the name found inside of the SMS_AGENT's block.

SYSTEM_VALUE{
   PREVIOUS SYSTEM_VALUES;
    SMS_AGENT_NAME = ia;
    SMS_AGENT_URL = /ellora.cs.umn.edu:20000/URN:ans:plato.cs.umn.edu/jtucker/ellora.cs.umn.edu;
}

The SMS Agent has the RecoveryHandlerDetector, this detector lets the SMS Agent restart agents that have failed.

Each agent in the system (excluding SMS agent) that you wish to be recoverable should have both the TimerEventDetector and the AgentAliveEventDetector. The TimerEventDetector triggers an AgentAliveEvent, this gets sent to agents with the FailureEventDetector (Failure Agents). The Failure Agent should have the FailureEventDetector and the TimerEventDetector. Subscriptions to agents with the agent alive event detectors will be automatic (including other Failure Agents). A Failure Agent's configuration will look like this:

AGENT{
   TARGET_HOST = URN:ans:plato.cs.umn.edu/jtucker/failureAgent1;
   AGENT_NAME = ~/failureAgent1;
   dburl = jdbc:mysql://archimedes.cs.umn.edu:10000/test;
   dbuser = mobile_agent;
   dbpasswd = user1000;
   TRIGGER_TABLE = /home/ugrad00/jtucker/konark/trigger;
   DETECTOR{
      network.detectors.TimerEventDetector, network.manager.EventHandler;
      network.detectors.FailureEventDetector, network.manager.RemoteEventHandler;
   }
}





6.2 Multiple Failure Agents

It is possible for there to be multiple Failure Agents in the system. In this way failure of a Failure Agent will not prevent future recovery. If there are multiple failure agents than as long as there is one failure agent left the other(s) can be restarted.

How to adjust the configuration files:

Simply add another Failure Agent as described above. The Failure Agents also need the AgentAliveEventDetector and AgentAliveEventHandler.




6.3 SMS Agent Recovery

SMS Agent Recovery is similar to regular agent recovery. The SMS agent sends out a SMS Agent Alive Event. If this event is not received by the SMS Failure Agent(s) than it will be relaunched. Because no agent in the system has access to the configuration file that the SMS Agent has, the agent that relaunches the SMS Agent must read a backup configuration file generated by the SMS Agent. This backup configuration file is generated whenever the configuration is changed from the gui. To enable creating a backup enter in the full path to the backup directory and a base filename into the configuration launcher.

How to adjust the configuration files:

The SYSTEM_VALUEs need to change. The CHECK_POINT_DIR is the location of the directory that you want the system backups stored in. It should match the information given to the gui when specifying the configuration (see the picture above, the CHECK_POINT_DIR should match the directory field in the Backup Information section). The CHECK_POINT_FILE should match the Base filename in the Backup Information field. The SMS_AGENT_URL and SMS_AGENT_NAME are the same as in 6.1.

SYSTEM_VALUE{
   CHECK_POINT_DIR = /home/ugrad00/jtucker/konark/backup;
    CHECK_POINT_FILE = cp;
    SMS_AGENT_NAME = ia;
    SMS_AGENT_URL = /ellora.cs.umn.edu:20000/URN:ans:plato.cs.umn.edu/jtucker/ellora.cs.umn.edu;
}

The SMS Agent should have the SMSAgentAliveEventDetector.

The SMS Failure agent should have the following added:

network.detectors.SMSFailureEventDetector, network.manager.EventHandler;

SUBSCRIPTION{
        AGENT_NAME = ~/SMS_Agent_Name;
        EVENT{
            network.events.SMSAgentAliveEvent, network.manager.EventHandler;
        }
}

The SMS Recovery Agent should have the following added:

network.detectors.SMSRecoveryHandlerDetector, network.manager.SMSRecoveryHandler, "HandlerOnly";
SUBSCRIPTION{
        AGENT_NAME = ~/SMS_Failure_Agent;
        EVENT{
            network.events.SMSFailureEvent, network.manager.EventHandler;
}




6.4 Examples

The first example will demonstrate two failure agents in a system and the second example will demonstrate two failure agents and SMS Agent recovery.






6.4.1 Two Failure Agents

The example configuration can be found here.

The file should be changed similarly to that of the example in shown in section three. When the system is started agents and/or failure agents may be killed. After they are killed you should restart the agent server (see section 3.2 for a discussion of how to start the agent servers). A new agent will be sent to this agent server.






6.4.2 Two Failure Agents and SMS Agent Recovery

The example configuration can be found here.

The file is a simple extension to the example in 6.4.1. The relevant fields in the configuration file should be changed to suit your configuration.

The system needs to generate a backup configuration file. To make it generate such a file you should add a detector/event or modify the system (see section 4 for ways to modify the system). After you have done this the backup configuration should be generated (you can check in the directory you specified to make sure the backup file is there). At this point the SMS Agent can be killed. An agent server should be restarted. The recovery agent will relaunch a SMS Agent with the configuration specified in the backup configuration file.




Table of Contents