Tuesday, March 15, 2011

External Merge Sort

If data to be sorted doesn’t fit into main memory external sorting is applicable. External merge sort can be separated into two phases:

  1. Create initial runs (run is a sequence of records that are in correct relative order or in other words sorted chunk of original file).
  2. Merge created runs into single sorted file.

Let’s assume that M records at the same time are allowed to be loaded into main memory. One of the ways to create initial runs is to successively read M records from original file, sort them in memory and write back to disk.

The core structure behind this algorithm is priority queue. Taking one by one current minimum element out of the queue forms ordered sequence.

The algorithm can be described as follows:

  1. Create two priority queues (that will contain items for current and next runs respectively) with capacity of M records.
  2. Prefill current run priority queue from unsorted file with M records.
  3. Create current run if there are elements in the current run priority queue:
    1. Take minimum element out of current run priority queue and append it to current run (basically write it to disk).
    2. Take next element from unsorted file (this is the replacement part of the algorithm) and compare it with just appended element.
    3. If it is less then it cannot be part of the current run (or otherwise order will be destroyed) and thus it is queued to the next  run priority queue.
    4. Otherwise it is part of the current run and it is queued to the current run priority queue.
    5. Continue steps 1 through 4 until current run priority queue is empty.
  4. Switch current and next runs priority queues and repeat step 3.

At any given moment at most M records are loaded into main memory as single written element into current run is replaced with single element from unsorted file if any (depending on comparison it either goes into current or next run).

Next step is to merge created initial runs. For the merge step we will use simplified algorithm based on k-way merge.

  1. Append created runs into a queue.
  2. Until there are more than one run in the queue:
    1. Dequeue and merge K runs into a single run and put it into the queue.
  3. Remaining run represents sorted original file.

No comments: