Wednesday, February 17, 2010

Spatial Index RTree

For location-based search, it is very common to search for objects based on their spatial location. e.g. find all restaurants within 5 miles of my current location, or find all schools within the zipcode of 95110 ... etc.

Every spatial object can be represented by an object id, a minimal bounded rectangle (MBR), as well as other attributes. So the space can be represented by a collection of spatial objects. (here we use 2 dimension to illustrate the idea, but the concept can be extended to N dimensions.)

A query can be represented as another rectangle. The query is about locating the spatial objects whose MBR overlaps with the query rectangle.

RTree is a spatial indexing technique such that given a query rectangle, we can quickly locate the spatial object results.

The concept is similar to BTree. We group spatial objects that are close by each other and form a tree whose intermediate nodes contains "close-by" objects. Since the MBR of the parent node contains all MBR of its children, the Objects are close by if their parent's MBR is minimized.


Start from the root, we examine each children's MRB to see if it overlaps with the query MBR. We skip the whole subtree if there is no overlapping, otherwise, we recurse the search by drilling into each child.

Notice that unlike other tree algorithm where only traverse down one path. Our search here needs to traverse down multiple path if the overlaps happen. Therefore, we need to structure the tree to minimize the overlapping as high in the tree as possible. This means we want to minimize the sum of MBR areas along each path (from the root to the leaf) as much as possible.


To insert a new spatial object, starting from the root node, pick the children node whose MBR will be extended least if the new spatial object is added, walk down this path until reaching the leaf node.

If the leaf node has space left, insert the object to the leaf node and update the MBR of the leaf node as well as all its parents. Otherwise, split the leaf node into two (create a new leaf node and copy some of the content of the original leaf node to this new one). And then add the newly created leaf node to the parent of the original leaf node. If the parent has no space left, the parent will be split as well.

If the split goes all the way to the root, the original root will then be split and a new root is created.


Deleting a spatial node will first search for the containing leaf node. Remove the spatial node from the leaf node's content and update its MBR and its parent's MBR all the way to the root. If the leaf node now has less than m node, then we need to condense the node by marking the leaf node to be delete. And then we remove the leaf node from its parent's content as well as updating the . If the parent is now less than m node, we mark the parent to be delete also and remote the parent from the parent's parent. At this point, all the node that is marked delete is removed from the RTree.

Notice that the content with these delete node is not all garbage, since they still have some children that are valid nodes (but were removed from the tree). Now we need to reinsert all these valid nodes back in the tree.

Finally, we check if the root node contains only one child, we throw away the original root and use its own child to become the new root.


Update happens when an existing spatial node changes its dimension. One way is to just change the spatial node's MBR but not change the RTree. A better way (but more expensive) is to delete the node, modify it MBR and then insert it back to the RTree.


J. Andrew Rogers said...

Unfortunately, R-trees have limited utility due to their poor scaling characteristics. Update concurrency, query selectivity, and distributability all rapidly degrade to pathological levels.

The limitation to small, read-mostly indexes has marginalized this algorithm in real-world applications.

nvasil said...

For machine learning applications I would recommend using kd-trees or metric-trees. They are much more stable and they have shown huge speedups on kmeans, knn classifier etc.