The main thrust of this activity has been on the development of scalable
transaction management mechanisms to support multi-key ACID transactions with (atomicity,
isolation, durability, consistency) properties on key-value based data
storage systems, as those provided in the traditional database systems. The
specific focus of this work was on the Hadoop/HBase system. The approach adopted
in this work is based on the Snapshot Isolation (SI) model, which is an
optimistic concurrency technique based on multi-version data. This project
builds upon the multi-version data management model of HBase and its single-key
transaction primitive. In the system
architectures developed here, the transaction management protocols are decoupled
from the storage system and integrated with the application-level processes.
These protocols are designed to support a model of cooperative recovery in which
any process can complete a partially executed sequence of commit/abort actions
on behalf of another process which is suspected to be failed. Because the SI
model can lead to non-serializable transaction executions, this work
investigated two conflict detection techniques for ensuring serializability.
In this context, two approaches were
investigated for their scalability under the scale-out model of cloud computing
platforms. In the first approach all transaction management functions are
executed in a fully decentralized
manner by the application processes. The second approach is based on a hybrid
approach in which the conflict detection functions are performed by a dedicated
service. This work is reported in the paper titled
Scalable Transaction Management with Snapshot
Isolation for NoSQL Data Storage System.
The
investigation of transaction management techniques for partially replicated data
management in large-scale systems resulted in the development of the Partitioned
Causal Snapshot Isolation (PCSI) protocol. The design of this protocol is
reported in a paper titled
Transaction Management with Causal Snapshot Isolation in Partially Replicated
Databases
in IEEE SRDS’2014.
Further refinements of this protocol to scale-out its performance
on a large cluster computer is reported in another publication titled
Scalable
Transaction Management for Partially
Replicated Data
published in IEEE Cloud Computing where it
received the Best Paper Award at the conference.
This
project also developed transaction management techniques supporting a spectrum
of consistency models, ranging from eventual consistency, causal consistency, to
serializability Building upon the PCSI transaction management protocol, a model
was developed in this project for simultaneously supporting transactions with
different consistency levels in an application. This model supports transactions
in four classes of consistency levels. serializable, causal snapshot isolation
(CSI), causal snapshot isolation with concurrent commutative updates (CSI-CM),
and asynchronous updates with eventual/causal consistency. This work is reported
in a paper titled
A Transaction Model for Management of Replicated Data with Multiple Consistency
Levels
published in IEEE BigData’2015.
Data and the associated transactions
are organized in a hierarchy which is based on consistency levels. Certain rules
are imposed on transactions to constrain information flow across data at
different levels in this hierarchy to ensure the required consistency
guarantees. An example of the use of this model in groupware applications is
reported in another publication titled
A Transaction Model with Multilevel Consistency for Shared Data in Distributed
Groupware Systems
published in IEEE CIC’2016.
A Transactional Model for Parallel Programming of Graph Problems on Cluster
Computers
(Open Source
software released under
GNU GPL V3 license)
This project investigated scalable transaction management
techniques for developing a parallel programming framework called Beehive for
graph data analytics problems on cluster computing platforms.
The primary goal of this work has been to
harness fine-grain amorphous parallelism in graph problems through speculative
parallel execution of computation tasks as transactions on cluster computing
platforms.
The Beehive framework provides a simple and robust model for
programming large-scale graph data analytics problems on cluster computers. It
provides an object-oriented model for defining nodes, edges, and computation
tasks in a graph problem. It supports multigraphs for problems where multiple
edges may exist between nodes, representing different types of relationships
between them. The graph data is maintained and managed in a key-value based
distributed data storage system maintained in the RAM of the cluster nodes. The
intermediate results of the parallel computations are stored in this shared
in-memory storage. In this model, tasks are scheduled to executes in parallel on
cluster nodes, and each task is executed as a serializable transaction updating
the graph data. The Beehive framework also provides facilities for checkpointing
of computations for recovery/restart from node crashes.
Utilizing the transactional computing model of Beehive, this project
investigated and developed incremental parallel, computing techniques for
large-scale graph structures which are dynamic and evolving. This work
investigated and showed the benefits of the incremental computing approach for
several graph problems such as
single-source-shortest paths, k-nearest neighbors, maximal cliques, and
vertex coloring. These results are
reported in a paper titled
Incremental Parallel Computing for Continuous Queries in
Dynamic
Graphs using a Transactional model, published in
Wiley Concurrency and Computation: Practice and Experience. This project also
investigated the use of the Beehive framework and its transactional computing
model as a graph-database engine for building location-based publish/subscribe
architectures. This investigation is reported in a paper titled
Design of a Location-based
Publish/Subscribe Service using a Graph-based Computing Model, published
in IEEE CIC’2017.