In this project, I create a parallel version of R-tree on GPU. My parallel R-tree is able to support parallel R-tree construction and parallel query execution of R-tree on GPU.

R-trees are well known spatial indexing techniques and have been widely adopted in many applications for indexing 2D or higher dimensional data. While processing a single spatial query on an R-Tree is typically sub-linear, there are many applications that involve bulk-loading for R-tree construction and batched queries on R-Trees. Morever, as the dataset grows increasingly larger, single-thread spatial query sometines cannot meet the performance requirement. Therefore, parallelization is obviously beneficial to be used on both R-tree constrcution and R-tree query.

Quite some parallel R-Tree construction and query processing algorithms have been proposed for different parallel architectures. But most of them are based on CPU parallelization techniques such as OpenMP. Unlike CPU, GPU devices usually have larger numbers of processing cores and higher computing power, which has attracted considerable interests of implementing both R-tree bulk loading construction and R-tree based spatial window query on GPU.

The original data of R-tree is typically a bunch of multi-dimensional points. The first step of constructing R-tree is to generate minimum bounding boxes for these points. The next iteration is to group nearby rectangles and represent them with another minimum bounding rectangle in the next higher level of the tree. This process will continue until we get the root of R-tree. This process is also called bulk-loading.

Constructing R-tree on GPU is different. One of the main goals is to maximize the usage of cuda threads while still ensure correctness of
construction. There are two approaches to bulk loading on GPU, the bottom-up approach and the top-down approach. Both approaches are composed of
the same basic steps, **sorting** and **packing**. In this project, I use bottom-up approach to construct my R-tree.

The idea is of sorting the data on the x coordinate of the rectangles. The purpose of sorting is to prepare data for the packing phase.

The packing phase is to scan the sorted list of rectangles. Successive rectangles are assigned to the same Rtree leaf node until this node reaches its fanout; then, a new leaf node is created and the scanning of the sorted list continues. The same process is applied to the MBRs to build another level of the R-tree. This process is applied iteratively until the root node is obtained.

This process is illustrated in the following graph:

Here is a simple code of how to do packing in parallel:

```
__global__ void level_kernel(rtree_node* next_level) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
rtree_node* n = &next_level[i]; // i is the node number of next level
init_node(n);
n->num = get_node_length(i, RTREE_NODE_SIZE); // get the number of children is has
# pragma unroll
for(size_t j = 0, jend = n->num; j != jend; ++j) {
/* update the MBR of n using all the MBRs of its children */
}
}
```

Before performing queries on GPU, we should first copy relavent data structures to the GPU global memory. Here I use three arrays:
A **node_array** to store all the node of R-tree and a **rect_array** to store the MBR of each node.

Another **edge_array** is used to store the starting index of all the children of a R-tree node in the edge_array. Each node has a **starting** attribute
which is the starting index of the its first child and a **num_edges** attribute. If a node is a leaf-node, then the starting attribute is -1.

All the above data structures are constructed in CPU and copy to GPU for further uses.

After we successfully copy all the data structures to GPU. We begin to perform the searching. We start with the root node and push it into a frontier queue. Then we examine all the MBRs of the node in frontier; if they intersect with the query rectangle, I put the children into new frontier array. The search continues with the new frontier array until we find all the overlap rectangles. The following steps how I implement parallel searching for R-tree:

Set the root index

**search_front**array to be 1.For each thread i, if search_front[i] is set:

For each node j(looked up from

**edge_arr**) that overlaps the query rectangle:If the child is a leaf node, mark it as part of the output.

If the child is not a leaf, mark it in the

**search_nextfront**array.Copy the

**search_nextfront**to the**search_front**. The search will finish when the search_front array is empty.

Take the following figure for example; the search begins from the root; then we pop out root from frontier and found Node B and Node C overlap the query rectangle thus they become the new frontier; when we reach the leaf level, we put all the MBRs that overlaps the query rectanglr into the result array.

This process explores node-based parallelism, meaning that each thread is dedicated to one node. The following code shows how the searching is performed in parallel:

```
__global__ void Kernel(Rect cur) {
// each thread represents one node
int tid = blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x;
bool flag = false;
if(overlap(rect[tid], cur) // whether the query rectangle overlaps the MBR of the frontier node
flag = true;
// node is in fronier and its MBR overlaps query rectangle
if (tid < num_nodes && d_search_front[tid] && flag) {
// remove it from frontier
d_search_front[tid] = false;
// reach leaf level
if (d_nodes[tid].starting == -1) {
res[tid] = 1;
return;
}
//put its children to search_nextfront
for (int i = d_nodes[tid].starting; i < (d_nodes[tid].num_edges + d_nodes[tid].starting); i++) {
int id = d_edges[i];
d_search_nextfront[id] = true;
}
}
}
```

Another kernel function is used to copy the search_nextfront to search_front. Note that there is **d_over** in the second
kernel function. This variable is intialized to be true. The second kernel function will set this variable to be false. Becasue executing the second
kernel function means that there is at least one node in the frontier array so the searching needs to be continued.

```
__global__ void
Kernel2(bool *d_over, bool *d_search_front, bool *d_search_nextfront) {
int tid = blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x;
if (tid < num_nodes && d_search_nextfront[tid]) {
// copy to search_front and reset search_nextfront
d_search_front[tid] = true;
*d_over = true;
d_search_nextfront[tid] = false;
}
}
```

Currently my project only supports searching for overlap. There are many other types of queries of R-tree but most of them are the varients of overlap searching.

The result also contains two parts: the performance of R-tree construction and the performance of R-tree query. I conducted experiments on a host machine with six 2.4 GHz Intel Xeon processor and 15 MB cache. A single NVIDIA GeForce GTX 480 processor with 1.5 GB global memory was used to run CUDA applications. We downloaded the R-tree sequential implementation from here as the baseline implementation.

The first benchmark for construction is to test the speedup with randomly generated rectangles. I found the following construction times:

The GPU time includes the time to copy the query results back from the GPU memory to the CPU memory. From the graph, we can see that CPU outperforms GPU as for small datasets. The reason is that copying original data from CPU to GPU takes lots of time thus the speedup is limited by bandwidth. However, more speedup is achieved as the problem size grows. For the even biggest benchmark, I can achieve almost 30 times speedup.

Another test is to construct r-tree using real-world 2-D spatial data. The test cases include roads of Germany, railways of France and hypsography data. I downloaded these test cases from here. Each test case is a list of rectangles. The following graph shows the construction time of these three real-world dataset:

The above graph shows that GPUs are capable of achieving better performance when bulk-load real-world datasets since most of those datasets are quite large.

Next benchmark is to test the performance of spatial window query processing on GPUs. As mentioned in the above section, the data transfer between CPU and GPU is expensive, soCPU can outperforms GPU for querying only a small amount of rectanlges. Thus one assumption of this benchmark is that the number of query rectanlges is large enough to eliminate the overhead of transferring data and result between CPU and GPU. From my experiments, the number of query rectangles should at least be larger than 100. Another issue is that the R-tree constructed in parallel on GPU may have a different structure from R-tree constructed on CPU since the construction algorithm is very different. Thus I compare three types of queries: serial query on R-tree constructed on CPU, serial query on R-tree constructed on GPU and parallel query. Here is the result I observed:

The result shows that parallel query can outperform serial query to some extent but the result is not very promising. First, the speedup is not very obvious. From my experiments, the speedup can reach at maximum 25 percent comparing with serial query. Second, another observation is that if the number of rectanlges is smaller than 8 million or larger than 16 million, the perofrmance of parallel query is even worse than serial query on CPU. I think there are three reasons:

Parallel searching on R-tree needs to update two global arryas on every iteration, which will brings significant overhead.

Some of the query does not require extensive computation. Especially when the query rectangle does not have many overlaps, the query on CPU can converge very fast.

The data transfer between CPU and GPU is expensive. Large number of query rectangles also means large amount of transfer of the results for each query between CPU and GPU.

The results show parallel r-tree is more suitable dense graph which has lots of overlapping. To verify this speculation, I manually make the test graph dense and increase the amount of overlapping of query rectangles by manually enlarge the r-tree rectangles and query rectangles by a factor of 150 percent and did the above benchmark again. Here is result I observed:

The result show that now parallel query does have significant speedup. Therefore for dense graphs or queries with lots of overlapping, parallel r-tree is a better choice. But if the graph is sparse or most of the workload consist of small query rectangles, serial query on CPU is enough to do the work.

- Kunjir, Mayuresh, and Aditya Manthramurthy.
*Using graphics processing in spatial indexing algorithms.*Research report, Indian Institute of Science, Database Systems Lab, 2009. - You, Simin, Jianting Zhang, and L. Gruenwald.
*"GPU-based spatial indexing and query processing using R-Trees."*Proc. ACM SIGSPAIAL BigSpatial'13 Workshop. 2013. - Papadopoulos, Apostolos, and Yannis Manolopoulos. "Parallel bulk-loading of spatial data."
*Parallel Computing*29.10 (2003): 1419-1444. - Luo, Lijuan, Martin DF Wong, and Lance Leong. "Parallel implementation of R-trees on the GPU."
*Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific.*IEEE, 2012. - Thinking Parallel, Part II: Tree Traversal on the GPU
*http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/*

The first aim of this project is to implement parallel construction of R-tree on GPU. There is several paper discussing how to implement it. In the first weeks, I have been reading these paper carefully and try to find the most proper method which meets my requirement. And I find that the bottom-up construction is a good choice. And in the next week, I try to implement the bottom-up construction of R-tree on GPU. This process is not easy but I finish it several days before checkpoint. After my first implementation, I also tried some experiments such as using heterogeneous way to construct R-tree with both CPU and GPU and compare their performance under different conditions.

After I successfully construct R-tree, my next move is to do some tests. I use a R-tree implementation from here as comparison to do some simple test of my parallel version of R-tree on GPU. I compare the performance of construction of these two version of R-tree and the result is pretty good. I observe very high speedup which is to close to 10x faster with five million rectangles to insert and 24x faster with ten million rectangles to insert.

So far I find it a little tricky to do benchmarking. Besides the construction speedup I observe as I mentioned in the last section. Sometimes the performance improvement does not come from entirely come from parallelism but implementation detail.

My current implementation involves some unnecessary data movement between CPU and GPU, which will have some impact on the overall speedup. Ideally, there should be only two data movements, which are copying the original data from CPU to GPU and copying the results after construction from GPU to CPU. In the next phase, I will try to achieve this goal.

As I mentioned in the first section, I've successfully implemented the parallel bottom-up construction of R-tree on GPU and did some benchmarking against the serial construction on CPU with different workloads. Here is the results I oberved:

The above graph shows that when the number of rectangles to insert is small, CPU serial construction version outperforms the GPU parallel version but when the number of rectangles to insert reach over one million, CPU version becomes very slow. But the GPU version only has a small increase of construction time and is now 3x faster than the CPU version. Actually, when the number reaches ten million, the GPU version only need 787ms while the CPU version needs 19090ms. This is 24x faster!

In the next phase, my primary goal is to implement parallel R-tree queries on GPU. There several paper discussing topics such as how to parallel traverse a tree on GPU and the design of parallel tree queries on GPU. I will study those paper and integrate parallel query into my current R-tree implementation. Besides that, I will find better way and better comparison to do benchmarking. I will try to reduce the irrelavent factor in the eperiments and perform more thorough experiemnts to measure the performance.

For the presentation, I'll show off the construction process and parallel query of my R-tree and any other examples to illustrate my implementation. I'll also have graphs comparing performance and some implememntation details to demostrate the basic idea.

**Friday, April 15**: Read related paper and understand the idea and methods of implmenting R-tree on GPU.**Saturday, April 18**: Complete the first version of implementation and ensure the R-tree can provide most of the functionalities.**Sunday, April 26**: Study and implement parallel R-tree query on GPU. Benchmarking the performance of parallel query on GPU and sequential query on CPU.**Friday, May 1**: Further optimize all the operations, try to reduce the sequential parts within each operations to the minimum and perform more rigorous benchmarking.**Friday, May 8**: Finish final report. Do further optimization like using shared memory.

In this project, I will create a parallel R-tree on NVIDIA GPU.

R-tree is an index data structure for the efficient management of spatial data such as rectangles. The key idea of the data structure is to group nearby objects and represent them with their minimum bounding rectangle in the next higher level of the tree; the "R" in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. For example, to manage the shaded rectangles in the left figure, we can construct an R-tree as shown in the right figure.

There have been a lot of parallel works on R-tree query on different multiple-processor architectures. All of them tried to maximize the parallelism for large range queries on the particular architectures. However, there is only a few papers discussing implementint R-tree on GPU. Modern GPUs are very efficient at manipulating computer graphics and image processing, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. Therefore it is very tempting to implement R-tree in GPU for much better parallelism.

The problem is challenging first on the GPU architecture. Programming with GPU is not easy since most of us are more familiar with programming under traditional CPU environment. Also the architecture of GPU is different from that of CPU, which makes it more challenging to have proper untilization of all the available resources

Another challenge is the understanding and implementation of R-tree. Although I have taken a database course, but this is still my first time to completely implement a R-tree, not mentioning implementing it on GPU. So some operations like bulk-loading and parallel queries may be hard to implement.

For machines, I'll use the ghc machine which has NVIDIA 480 GPU. Specifically, the GPU I will use is NVIDIA GeForce GTX 480 graphics processor is a collection of 15 multiprocessors, with 32 streaming processors each. Other useful resources include some paper discussing implementing R-tree on GPU such as this.

The project has three main areas: implementing basic R-tree on GPU, explore and maximize parallelism of R-tree on GPU, and benchmarking the implementation against other R-tree implementations.

Specifically:

**Implement basic R-tree on GPU**Implement all the basic functionalities which a normal R-tree data structure should have.

Providing easy-to-use api to use and test the R-tree.

I assume that the spatial data can fit into GPU global memory.

**Explore and maximize parallelism of R-tree on GPU**Implement parallel R-tree query

Parallel R-tree construction

**Benchmarking**Compare my GPU R-tree implementation with other implementation including serial implementation on CPU, parallel implementation on CPU, serail implementation on GPU, etc.

CUDA makes sense as we've already learned it in class, I've also learned R-tree in database and data mining class. And there are enough ghc machines which have NVIDIA 480 GPUs.

**Friday, April 15**: Read related paper and understand the idea and methods of implmenting R-tree on GPU.**Saturday, April 18**: Complete the first version of implementation and ensure the R-tree can provide most of the functionalities.**Friday, April 24**: Implement parallel R-tree query and benchmark the performance with the basic implementation and query performance of other implementations.**Wednesday, April 29**: Implement parallel construction e.g. parallel bulk-loading of R-tree on GPU and benchmark the performance against other implementations.**Friday, May 8**: Finish final report. Do further optimization like using shared memory.