Skip to content

File Loader

Motivation

The file loaders are programs which will read the input file and load the graph into the memory of each host. For each of the graph problem, a different version of the same file loader is present since for each problem, the structure of the Node and Edge will be different. For example, a file loader for the Graph Coloring problem will be written in its specific sematics, i.e., GCNode, GCEdge, etc.

In all file loaders, the id of the host where the node is to be loaded in memory is decided by a modulo function:

hostId = nodeId % totalHosts.

Here are a few file loaders:

InputFileReaderV2
This is one of the early file loaders and is for the max flow problem. This file loader odes not take the preprocessed (.pp) file as input. Instead, it will take the fgraph file as input. The file loader would essentially create new node objects and create a list of Edge objects for all neighbors for each node, and store the list in each node. The nodes are not batched before flushing to the storage system. It is a single threaded program (run by the master).
InputFileReaderBatch
This is a file loader for the Graph Coloring problem. This file loader does not take the preprocessed (.pp) file as input. Instead, it will take the fgraph file as input. The file loader would create new node objects and create a list of Edge objecs for all of the neighbors for each node, and store the list in each node. The nodes are batched and when the batch size reaches 1000, the batch is sent to the storage server.
InputFileReader
This is a file loader for the Page Rank problem. This file loader does not take the preprocessed (.pp) file as input. Instead, it will take the fgraph as input. The file loader would essentially create new node objects and create a list of Edge objecs for all neighbors for each node, and store the list in each node. The nodes are batched and when the batch size reached 100, the batch is sent to the storage server.
InputFileReaderParallel
This is a file loader for the Graph Coloring problem. The file loader takes a preprocessed (.pp) file as input. Here there would be one thread on the master to read the input file. There will be ten additional threads (run by master), each of which would create the node and edge objects, batch the node objects (batch size 100), and flush the batch to the storage system.
InputFileReaderDistributed
This is a file loader for the Graph Coloring problem. This file loader takes a preprocessed (.pp) file as input. Here each host will read the input file and the host's storage server will load the node only if that host is the destination of the node. Otherwise, it will be ignored. This prevents the bottleneck of one master node loading the data into remote servers. However, since all of the hosts are reading the entire file and only loading a fraction of it into their local storage server, there is a lot of unnecessary traffic and hence this file loader does not scale well to larger graphs.
InputeFileReaderBucketed
This is a file loader for the Graph Coloring problem. This file loader takes a preprocessed (.pp) file as input. Here the idea is that the input file is bucketized into smaller files and each host will then processes one of the many bucket files to load their respective nodes into its local storage system. The bucketized files are created using the FileBucketizer program (/project/cluster15/GraphGen/FileBucketizer) from the .pp files.