Skip to content

Data Representation

Node

Beehive provides a Java class called Node which represents a graph vertex in a graph. Each Node in Beehive maintains the following information:

nodeId
A unique identifier for the node so that it can be retrieved from the StorageSystem, either from a remote server in the cluster or locally.
TS
The logical timestamp at which this node was created or last updated
neighbors
Adjacent Nodes of this particular instance. The connections are represented by a mapping from the adjacent node id to an Edge object, a class provided by Beehive. Edges are directed, outgoing edges.

Programmers can extend this class in order to add application-specific functionality by writing a subclass of Node. For example, consider the GraphColoring problem which requires each Node to maintain its color and whether or not it has been marked. A possible subclass then, would be GCNode where additional fields and behaviors can be defined.

public GCNode extends Node implements Exernalizable {
    boolean marked;
    int color;
    // Auxiliary methods
}

It is important to note that Node is Externalizable since they can be retrieved from a remote server or created on one server but deposited on another. The fields that need to persist across the network must therefore be saved and retrieved in the writeExternal and readExternal methods respectively. Using the GCNode as an example:

public void writeExternal(ObjectOutput out) throws IOException {
    super.writeExternal(out);           // Required
    out.writeBoolean(this.marked);
    out.writeInt(this.color);
}

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    super.readExternal(in);             // Required
    this.marked = in.readBoolean();
    this.color = in.readInt();
}

Note that the fields provided by the Node class are not automatically saved and retrieved, thus a call to super is required. It is also important to note that Beehive stores all graph vertices as Node objects so that several subclasses of Nodes can exist in the StorageSystem. Therefore, all node retrievals are returned as a Node object. It is up to the programmer to type-check the nodes and make the appropriate type-cast(s) if necessary.


NodeData

Beehive provides a NodeData class which serves for fine-grain operations within the application. The NodeData class provides the capability to query and update specific properties of a Node without having to retrieve the entire Node object from the StorageSystem. This reduces communication costs when subclasses of Node are complex and one only needs a subset of properties. The NodeData achieves this by maintaining the following:

nodeId
The id of the Node the NodeData is associated with
timestamp
The timestamp at which the NodeData was constructed for validation purposes
nodeClassId
A unique byte identifier for the Class object for the Node the NodeData is associated with. See the AppLoader section for more details
properties

A mapping from a node's properties to values.

Note

When the RPC_MECHANISM of the system is set to THRIFT, the types of the properties are restricted to the following primitive types: String, Boolean, Integer, Long

This mapping can be used in one of three ways:

  1. Querying specific properties of a Node
    • In this case the programmer would map the name of the desired properties to null and query the StorageSystem with the NodeData. The StorageSystem will then return a NodeData object where the mappings are now of the desired properties to their actual values from Node.
  2. Updating specific properties of a Node
    • In this case, the programmer would map the names of properties to update to their desired values and give the NodeData object to the StorageSystem.
  3. Executing specific methods on a Node
    • In this case, the programmer would map the names of the methods to their respective parameter and give the NodeData object to the StorageSystem. The StorageSystem will then invoke the desired methods on the target node. A NodeData object is then returned where the mappings are now of the method names to their return values. Note that a method invoked by exec can only have one parameter.

An example of NodeData usage can be found in the GraphColoring Example program.


Edges

Beehive provides an Edge class to represent the connections between graph vertices. The Beehive Edge class maintains no information. Thus, using the base Edge class serves no purpose other than representing a connection between any two nodes. Programmers can extend this class in order to add application-specific functionality by writing a subclass of Edge. For example, in the Single Source ShortestPath problem, edges maintain the distance between two adjacent nodes. Thus, a possible subclass would be SPEdge where additional fields and behaviors can be defined.

public class SPEdge extends Edge implements Externalizable {
    int weight;
    // Auxiliary methods
}

It is important to note that each Edge is Externalizable since they are stored within a Node which is also Externalizable. The fields that need to persist across the network must therefore be saved and retrieved in the writeExternal and readExternal methods respectively. Using the SPEdge as an example:

public void writeExternal(ObjectOutput out) throws IOException {
    super.writeExternal(out);       //Required
    out.writeInt(this.weight);
}

public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    super.readExternal(in);         //Required
    this.weight = in.readInt();
}

Note that even though the Beehive Edge class does not maintain any information, a call to super is required. This is in the event that further development of the Beehive model requires Edges to maintain some state. It is also important to note that Beehive stores all connections as Edge objects in each Node to facilitate applications that involve hyper graph structures. Therefore, all edge retrievals are returned as an Edge object. It is up to the programmer to type-check the edges and make the appropriate type-cast(s) if necessary.


StorageSystem

Beehive provides the StorageSystem class as an abstraction to store all of the nodes needed for the application's lifetime. The programmer can use the StorageSystem to add, update, remove, locate, and relocate nodes in the application. The StorageSystem provides all of its functionality with location transparency so that programmers do not need to be concerned with the location of nodes. This is achieved by having each server in the cluster maintain a local StorageSystem that keeps track of its peers. The full listing of functionality provided by the StorageSystem can be found in the java documentation. Here we present a summary of the methods of importance to the programmer:

putNode(String nodeId, Node node)
Puts a node in to the local or remote StorageSystem with the given nodeId.
getNode(String nodeId)
Retrieves a node associated with the given id
updateNodeProperties(NodeData nodeData)
Updates the properties of a given node without having to retrieve it and depositing it. See NodeData for more information.
getNodeProperties(NodeData nodeData)
Retrieves specific properties of a given node. See NodeData for more information.
exec(NodeData nodeData)
Executes specific methods of a Node and retrieves the results without having to retrieve the entire node. See NodeData for moreinformation.
getAllKeys(String url)
Retrieves all of the node ids that are local to the given server whose hostname/url is provided

Excluding getAllKeys, the StorageSystem provides a bulk variant for each method just described. For example, one could provide a list of nodes to put, get, update, or execute upon in parallel. See Chapter 6 for more details.


Task

As mentioned in The Programming Model, a Task is an abstraction in the Beehive computation model that maintains information specific to the application problem. Each task reads and/or updates the data of some vertices in the graph. Upon completion, a task can create additional tasks. Each Task maintains the following information:

nodeId
The id of the node that the computation is to be performed on
phase

The phase in which the task is in during the computation. This is application driven so that programmers can define several phases for their tasks. For example, we define several phases in the base Task class:

STARTUP
The task is currently preprocessing the graph or creating the initial computation tasks to distribute
GENERIC
The task is currently doing some generic computation
TERMINATION
The task is currently terminating
status

The status of the task during its lifetime. There are three statuses:

TASK_INIT
The task was just created
TASK_WAITING
The task is in the workpool waiitng to be picked up for execution
TASK_FINISHED
The task has been validated, committed, and removed from the workpool
target location
Which workpool in the cluster the task should ideally be deposited to.
affinity level

The affinity to which the task is to be deposited to the target location. There are three levels of affinity:

NO_AFFINITY
Deposit the task in an arbitrary workpool in the cluster
WEAK_AFFINITY
Deposit the task at the target location if possible
STRONG_AFFINITY
Deposit the task at the target location

Programmers can extend this class in order to add application-specific computations by writing a subclass of Task. For example, consider the Single Source Shortest Path problem. Each task can maintain the current shortest distance from a node to the a source node as well as the distance of an edge that could potentially decrease the distance. Thus, a possible subclass would be SPTask where additional fields and behaviors can be defined:

public class SPTask extends Task {
    int distFromSrc;
    int edgeWeight;
    String senderId;

    // Auxiliary methods
}

It is important to note that each task is stored as a Task object in the workpool. This is so that multiple types of task can co-exist in the workpool for more abstract applications. As a result, it is up to the programmer to type-check the tasks and make the appropriate type-cast(s) if necessary. In general, each task carries the information that is needed for the computation that is to be done on a node by the Worker class. Each task can also carry the computation logic that can be invoked by the Worker class at run time.


LocalWorkpool

Beehive stores all created tasks in a workpool. Each server in the cluster maintains its own LocalWorkpool, but can deposit Tasks into its peer's workpool. A LocalWorkpool maintains two task-pools. The first is a ready-pool, tasks that are ready to be picked up and executed. The second is an active-pool, tasks that have been picked up and are currently being executed. In addition to maintaining two task-pools, the LocalWorkpool is also responsible for interfacing with the Validator to validate any completed tasks. Completed tasks that are valid are removed from the LocalWorkpool entirely. Those that are invalid may either be retried immediately or placed into the ready-pool for re-execution. A full listing of the functionalities provided by a LocalWorkpool can be found in the java documentation . Here we present a summary of the functionalities that are of importance to the programmer:

broadcast(Task t)
Broadcasts a task to each of its peers
addTask(Task t)
Adds the task into the ready-pool
addTasks(Vector<Task> tasks)
Distributes tasks into the ready-pools of the different cluster nodes
getTask()
Retrieves the current WorkStatus of the workpool. The status itself may or may a task depending on the status of the workpool, e.g. NO_TASK_READY
removeTask(Task t)
Removes a task from the active-pool if it exists
validate(Task t, RWSetInfo rw)
Validates the completed task. Information about about RWSetInfo can be found in the Transactions Section
reportCompletion(Task t, HashSet<Task> created)
Reports the completion of a valid task and distributes any created tasks to its peers