Tuesday, May 14, 2013

Interval Tree

Question: Find the intervals from a set of intervals in which a given point lies

Given a set of intervals such as (10,20), (15,25), (28,40), (50,70), (0,9) (60,90) and build a data structure. Query the data structure for point x, and it find out all the intervals that contain this point x.

The trivial solution is to visit each interval and test whether it intersects the given point or interval, which requires Θ(n) time, where n is the number of intervals in the collection.

In a simple case, the intervals do not overlap and they can be inserted into a simple binary search tree and queried in O(log n) time. However, with arbitrarily overlapping intervals, there is no way to compare two intervals for insertion into the tree since orderings sorted by the beginning points or the ending points may be different. A naive approach might be to build two parallel trees, one ordered by the beginning point, and one ordered by the ending point of each interval. This allows discarding half of each tree in O(log n) time, but the results must be merged, requiring O(n) time. This gives us queries in O(n + log n) = O(n), which is no better than brute-force.

Interval trees solve this problem.

To construct an interval tree from the input given in the question, the following steps should be taken

  1. Firstly we will arrange the end points of all the intervals in the increasing order. So for the above given input it will become 0 9 10 15 20 25 28 40 50 60 90. If there are N intervals, there will be 2N end-points and hence sorting will take O(NlogN) time. The entire range of all the intervals now becomes 0-90.

  2. We start by taking the entire range of all the intervals and dividing it in half at x_center (in practice, x_center should be picked to keep the tree relatively balanced). This gives three sets of intervals, those completely to the left of x_center which we'll call S_left, those completely to the right of x_center which we'll call S_right, and those overlapping x_center which we'll call S_center.

  3. The intervals in S_left and S_right are recursively divided in the same manner until there are no intervals left.

  4. The intervals in S_center that overlap the center point are stored in a separate data structure linked to the node in the interval tree. This data structure consists of two lists, one containing all the intervals sorted by their beginning points, and another containing all the intervals sorted by their ending points.

The result is a binary tree with each node storing:

  • A center point
  • A pointer to another node containing all intervals completely to the left of the center point
  • A pointer to another node containing all intervals completely to the right of the center point
  • All intervals overlapping the center point sorted by their beginning point
  • All intervals overlapping the center point sorted by their ending point

To find the intervals in which a given number 'x' lies we do the following:

  1. For each tree node, x is compared to x_center, the midpoint used in node construction above. If x is less than x_center, the leftmost set of intervals, S_left, is considered. If x is greater than x_center, the rightmost set of intervals, S_right, is considered.

  2. As each node is processed as we traverse the tree from the root to a leaf, the ranges in its S_center are processed. If x is less than x_center, we know that all intervals in S_center end after x, or they could not also overlap x_center. Therefore, we need only find those intervals in S_center that begin before x. Suppose we find the closest number no greater than x in this list. All ranges from the beginning of the list to that found point overlap x because they begin before x and end after x (as we know because they overlap x_center which is larger than x). Thus, we can simply start enumerating intervals in the list until the endpoint value exceeds x.

  3. Likewise, if x is greater than x_center, we know that all intervals in S_center must begin before x, so we find those intervals that end after x using the list sorted by interval endings.

  4. If x exactly matches x_center, all intervals in S_center can be added to the results without further processing and tree traversal can be stopped.

Intersection with Interval

First, we can reduce the case where an interval R is given as input to the simpler case where a single point is given as input. We first find all ranges with beginning or end points inside the input interval R using a separately constructed tree. In the one-dimensional case, we can use a simple tree containing all the beginning and ending points in the interval set, each with a pointer to its corresponding interval.

A binary search in O(log n) time for the beginning and end of R reveals the minimum and maximum points to consider. Each point within this range references an interval that overlaps our range and is added to the result list. Care must be taken to avoid duplicates, since an interval might both begin and end within R. This can be done using a binary flag on each interval to mark whether or not it has been added to the result set.

The only intervals not yet considered are those overlapping R that do not have an endpoint inside R, in other words, intervals that enclose it. To find these, we pick any point inside R and use the algorithm below to find all intervals intersecting that point (again, being careful to remove duplicates).

The complete java implementation can be found here.

http://thekevindolan.com/2010/02/interval-tree/

No comments: