Sunday, March 20, 2011

Big File containing billions of numbers

Q: There is big file containing billions of numbers. How do we find maximum 10 numbers from those files?

Priority queue is the best data structure to answer such question. Of course loading all the data and building the heap is not such a good idea. Main challenge is how do we manage chunks of data in memory and arrive at the solution.

Divide the data into some sets of 1000 elements, make a heap of them and take 10 element from each heap one by one. Finally sort all the sets of 10 elements and take top 10 among those. But where do we store 10 element from each heap which may itself require large amount of memory as we have billions of numbers.

Reuse top 10 elements from earlier heap in subsequent elements can solve this problem. Take first block of 1000 elements and subsequent blocks of 990 elements. The last set of 990 elements or less than that and taking top 10 elements from the final heap will fetch you the answer.

Time Complexity would be O( (n/1000)*(complexity of heapsort of 1000 elements) = O( n )


Bharath said...

Although heaps are a good choice for this, don't you think the selection algorithm would be as good if not faster for this problem. It also has a a linear time complexity. So, don't quite understand how priority queues are the "best" data structure for this problem.

Eric Pruneau said...

Why would you want to keep a heap of 1000 elements when you only need to keep the 10 biggest... a sorted array of your 10 biggest elements so far is enough.