Monday, February 18, 2013

How to detect duplicate documents in billions of urls

Question: You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

  1. If the two documents have exactly the same links inside the page
  2. They have the same title
  3. Their creation time are the same


  1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
  2. Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n2).

Based on the above two observations we can derive an algorithm which is as follows:

  1. Iterate through the pages and compute the hash table of each one.
  2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

  • How much space does each page take up in the hash table?
    • Each page hashes to a four byte value.
    • Each url is an average of 30 characters, so that’s another 30 bytes at least.
    • Each url takes up roughly 34 bytes.
  • 34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!

What do we do?

  • We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.
  • We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.
  • Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
    • First, we hash the document to get a hash value v
    • v%n tells us which machine this document’s hash table can be found on.
    • v/n is the value in the hash table that is located on its machine.

Minimum manhattan distance

You have a grid with certain intersections marked. We'd like to find the intersection that is closest to all the marked intersections. That is, We need to find a point such that the sum of distances from all points is minimum



The cool thing about the Manhatan distance is that the distance itself comprises of two independent components: the distance on the x and y coordinate. Thus you can solve two simpler tasks and merge the results from them to obtain the desired results.

The task I speak of is: given are points on a line. Find the point on the line that minimizes the sum of the absolute distances to all the points. If there are many find all of them (btw they always turn to be a single segment which is easy to prove). The segment is determined by the (potentially two) points medians of the set. By median I mean the point that has equal number of points to the left and to the right of it. In case the number of points is odd there is no such point and you choose the points with difference 1 in both directions to form the segment.

Here I add examples of solutions of this simpler task:
In case the points on the line are like that:
-4 | | | 0 | 2 3 4
The solution is just a point and it is 2.In case the points on the line are like that:
-4 | | | 0 | 2 3
The whole segment [0, 2] is the solution of the problem.
You solve this task separately for the x and y coordinate and then merge the results to obtain the rectangle of minimum distanced points.

And now comes an example of the solution for the initial task.Imagine you want to find the points that are with minimum Manhattan distance to the set (0, 6), (1, 3), (3, 5), (3, 3), (4, 7), (2, 4)
You form the two simpler tasks:
For x:
0 1 2 3 3 4
And here the solution is the segment [2, 3] (note that here we have duplicated point 3, which I represented in probably not the most intuitive way).
For y:
3 3 4 5 6 7
Here the solution is the segment [4, 5].
Finally we get that the solution of the initial task is the rectangle with formula:
2 <= x <= 3; 4 <= y <= 5

Let's speak about complexity.

The complexity of the task is actually the same as the complexity of solving the simpler task (because as already discussed the solution actually consists of solving two simpler tasks). Many people will go and solve it via sorting and then choosing out the medians. However, this will cause O(nlog n) complexity, where n is the number of points in the input set.

This can be improved if a better algorithm for finding the kth element is used. This algorithm basically follows the same approach as qsort. The running time isO(n). Even in the case of two median points this will still remain linear (needing two runs of the same algorithm), and thus the total complexity of the algorithm becomes O(n). It is obvious that the task can not be solved any faster, as long as the input itself is of the mentioned complexity.

Friday, February 8, 2013

LRUCache in java

The simplest way to create a cache using LinkedHashMap is to extend it. The constructor takes as an argument the maximum number of entries we want a Cache object to hold. The superclass constructor has three arguments: the initial capacity of the map, the load factor, and a boolean argument that tells theLinkedHashMap constructor to keep entries in access order instead of the default insertion order.

The removeEldestEntry() method overrides a default implementation in LinkedHashMap and is where we determine the policy for removing the oldest entry. In this case, we return truewhen the cache has more entries than our defined capacity.

Hit rate is usually expressed as a percentage - a hit rate above 80% is usually pretty good for a cache.We can add a few things to our Cache class to make it possible to monitor performance so that we can tune the cache by setting an optimum size. We need a counter for the number of accesses and another for the number of hits. We will need getter methods to allow us to retrieve those values after the cache has been running for a time, and finally we must override the get() method to update the counts.

import java.util.LinkedHashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicLong;

public class LRUCache extends LinkedHashMap<Object, Object> {

* serialUID
private static final long serialVersionUID = -4412392853140727526L;

private final int capacity;
private AtomicLong accessCount = new AtomicLong(0);
private AtomicLong hitCount = new AtomicLong(0);

* Constructs caching object with access order instead of insertion order
* @param capactity
* The capacity of the cache after which the eldest entry will be
* removed
public LRUCache(int capacity) {
* passing load factory as 1.2f to avoid resizing of HashMap till it
* reaches its full capacity passing true as accessOrder will ensure
* eldest entry is based on access and not on insertion
* */
super(capacity + 1, 1.2f, true);
this.capacity = capacity;

public Object get(Object key) {
if (containsKey(key)) {
Object value = super.get(key);
return value;

protected boolean removeEldestEntry(Map.Entry<Object, Object> eldest) {
return size() > capacity;

* Returns the number of times the cache was accessed
* @return returns the access count
public long getAccessCount() {
return accessCount.get();

* Returns the number of times the entry was found in the cache
* @return returns the hit count
public long getHitCount() {
return hitCount.get();

Appending Last N node of Linked list to beginning

Question: Append the last n nodes of a linked list to the beginning of the list

eg: 1->2->3->4->5->6

if n=2


Solution: This is the same approach as the previous problem.Start traversing till you traverse n elements.Start another traverse in same loop after meeting above condition (there will be always gap of n nodes between these to pointers) when first pointer reach to end of linked list second pointer will be n node behind. Just change the pointer .

node *prepend(node * root, int k)
node *prev, *curr;
curr = root;
for (int i = 0; i < k; i++) {
curr = curr->next;
if (curr == NULL)
return NULL;
prev = root;
while (curr->next != NULL) {
curr = curr->next;
prev = prev->next;
curr->next = root;
root = prev->next;
prev->next = NULL;
return root;

Printing Last n lines in a file.

Question: Print the last n lines of a file in one iteration

Solution: This problem is similar to finding nth element from the last in a linked list. We can have 2 variables one from pointing towards the start and the other for the end of the file. First move the first pointer N lines ahead in a file. Then start incrementing the second pointer one line ahead and at the same time increment first pointer one line ahead till we hit the EOF with the first pointer. Once we hit the EOF, 2nd pointer will be at the nth line from the end of the line. Start printing the lines till we reach the end of the file.


class PrintLastN {
public void PrintLastNLinesOfFile(String fileName, int numberOfLines) {
BufferedReader pointer1 = null;
BufferedReader pointer2 = null;
try {
if (numberOfLines <= 0)
throw new Exception(
"Invalid Number of Lines: Number of lines have to be greater than 0");
int i = 0;
pointer1 = new BufferedReader(new FileReader(fileName));
while (pointer1.readLine() != null) {
if (i == numberOfLines) {
pointer2 = new BufferedReader(new FileReader(fileName));
} else if (i > numberOfLines) {

if (pointer2 != null) {
String line = "";
while ((line = pointer2.readLine()) != null) {
} else {
.println("Invalid Number of Lines: Number of lines are less than expected");
} catch (Exception ex) {
} finally {

Tuesday, February 5, 2013

Kth Largest element in a sorted 2D matrix

Question: Given an NxN matrix of numbers such that all rows are sorted and all columns are sorted, find the K-th largest number in the matrix.

Solution: The simplest approach is to sort the entire matrix by converting into a single array. That will take a time complexity of O(N^2 log N) . Better approach would be to merge N sorted arrays. That would take a time complexity of O(N^2). Even that seems to be costly. Can we do better than that?

We can construct an external structure like Priority Queue and do a DFS on that. It is a O(k log k) solutiion. But remember K can be as large N * N.


import java.util.Comparator;
import java.util.PriorityQueue;

// given a 2D NxN matrix where both rows and columns are sorted, find the kth largest element.

//Solution - Use a Priority Queue to do a DFS. O(k log K).

class Node {
public int num;
public int i;
public int j;

public Node(int a, int b, int c){
this.num = a;
this.i = b;
this.j = c;

public class PQProblem{

public static int N = 4;

public static int get_kth(int[][] Arr, PriorityQueue<Node> pq, int count, int k, boolean[][] bit){

Node top = pq.peek();


if (count == k) return top.num;

if (top.i+1 < N && !bit[top.i+1][top.j])
pq.offer(new Node(Arr[top.i+1][top.j], top.i+1,top.j));
bit[top.i+1][top.j] = true;

if (top.j+1 < N && !bit[top.i][top.j+1])
pq.offer(new Node(Arr[top.i][top.j+1], top.i, top.j+1));
bit[top.i][top.j+1] = true;

return get_kth(Arr,pq,count,k,bit);

public static void main(String[] args){

int [][] Arr = {{1,5,10,12},
boolean[][] bit = new boolean[N][N];
bit[0][0] = true;

// get k-th element from the 2D Matrix.
int k = 9;

PriorityQueue<Node> pq = new PriorityQueue<Node>(k+1,
new Comparator<Node>(){
public int compare(Node a, Node b)
{ return a.num - b.num;

pq.offer(new Node(Arr[0][0],0,0));

int res = get_kth(Arr,pq,0,k,bit);

Friday, February 1, 2013

Local Minimum of an array

Problem: Given an array ‘a’ of N distinct integers, design an O(log N) algorithm to find a local minimum: an index i such that a[i-1] > a[i] < a[i+1].

Solution: If a[0] < a[1] or a[N - 2] > a[N - 1] then a[0] or a[N - 1] is the local minimum, respectively. Pick a[mid] where mid = N / 2. If it's not the answer, we have three cases:

   1)  If a[mid - 1] < a[mid] < a[mid + 1], lower half will contain the local minimum.

   2)  If a[mid - 1] > a[mid] > a[mid + 1], upper half will contain the local minimum.

   3)  If a[mid - 1] < a[mid] > a[mid + 1], either half will contain the local minimum.

Search on the new interval recursively.


If a[1] < a[2], a[1] is an LM.
Otherwise, if a[n-1] > a[n], a[n] is an LM.
Otherwise, the array entries start out descending (going from a[1] to a[2]) and ends up ascending (going from a[n-1] to a[n]).  It follows that an LM must exist somewhere in between.  Examine a[n/2].  If a[n/2] > a[n/2+1], then by the same reasoning there is an LM in the upper half of the array.  Similarly, if a[n/2] < a[n/2+1] there is an LM in the lower half of the array.  Solve recursively in the appropriate half.

public class LocalMinimum {

public static int findLocalMinimum(int[] elements, int lowIndex, int highIndex) {
if (lowIndex > highIndex) {
return -1;

if (lowIndex == highIndex) {
return lowIndex;

if (lowIndex + 1 == highIndex) {
if (elements[lowIndex] < elements[highIndex]) {
return lowIndex;
return highIndex;

int midIndex = (lowIndex + highIndex) / 2;

if ((elements[midIndex] <= elements[midIndex - 1])
&& (elements[midIndex] <= elements[midIndex + 1])) {
return midIndex;

if (elements[midIndex] >= elements[midIndex + 1]) {
return findLocalMinimum(elements, midIndex, highIndex);
} else {
return findLocalMinimum(elements, lowIndex, midIndex);

public static void main(String[] args){
int Arr[] = {8,5,4,3,1,2,6,9};
int index = findLocalMinimum(Arr, 0, Arr.length-1);
System.out.println("local mimimum is "+Arr[index]);