Tuesday, September 27, 2011

Design a web crawler

Web crawler is a program that traverses the hyperlinks structure of web pages. We crawler visits each web page by following the hyperlinks on previous page it visits. while it is crawling web pages to extract hyperlinks, a web crawler also save each page it has visited.

Crawler basic algorithm

  1. Remove a URL from the unvisited URL list
  2. Determine the IP Address of its host name
  3. Download the corresponding document
  4. Extract any links contained in it.
  5. If the URL is new, add it to the list of unvisited URLs
  6. Process the downloaded document
  7. Back to step 1

Single Crawler

Picture2

Multithreaded Crawler

Picture3

Parallel Crawler

Picture4

Search Engine : architecture

Picture5

Search Engine : major components

  • Crawlers
    Collects documents by recursively fetching links from a set of starting pages.Each crawler has different policies The pages indexed by various search engines are different
  • The Indexer
    Processes pages, decide which of them to index, build various data structures representing the pages (inverted index,web graph, etc), different representation among search engines. Might also build additional structure ( LSI )
  • The Query Processor
    Processes user queries and returns matching answers in an order determined by a ranking algorithm

Issues on crawler

  • General architecture
  • What pages should the crawler download ?
  • How should the crawler refresh pages ?
  • How should the load on the visited web sites be minimized ?
  • How should the crawling process be parallelized ?

Distributed crawler

The crawler system should consist of a number of crawler entities, which run on distributed sites and interact in peer-to-peer fashion. Each crawler entity has to have the knowledge to its URL subset, as well as mapping from URL subset to network address of corresponding peer crawler entity.

Whenever the crawler entity encounters a URL from a different URL subset, it should forward to the appropriate peer crawler entity based on URL subset to crawler entity lookup.

Each crawler entity should maintain its own database, which only stores the URL’s from the URL subset assigned to the particular entity. The databases are disjoint and can be combined offline when the crawling task is complete.

Crawler Entity Architecture

Each crawler entity consists of a number of crawler threads, a URL handling thread, a URL packet dispatcher thread and URL packet receiver thread. The URL set assigned to each crawler entity will be further divided into subsets for each crawler thread.

Image1

Each crawler thread should have its own pending URL list, a DNS cache. Each thread picks up an element from URL pending list, check to see if the IP corresponding to URL is available in the DNS cache, else does a DNS lookup, generates a HTTP fetch requests, gets the page and finally extracts more URL’s from the fetched page and puts them in the job pending queue of the URL handling thread.

URL handling thread will have a job queue. It will get a URL from the job queue, checks to see if the URL belongs to the URL set corresponding to the crawler entity, if it belongs to another entity it will put the URL on the dispatcher queue and get a new URL from the its job queue. If the URL belongs to its set, it firsts checks the URL-seen cache, if the test fails it queries the SQL database to check if the URL has been seen, and puts the URL in the URL-seen cache. It then puts the URL into URL pending list of one of the crawler threads. URLs are assigned to a crawler thread based on domain names. Each domain name will only be serviced by one thread; hence only one connection will be maintained with any given server. This will make sure that the crawler doesn’t overload a slow server.

URL dispatcher thread will communicate the URL’s corresponding crawler entity.

A URL receiver thread collects the URL’s received from other crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts them on the job queue of the URL handling thread.

Buffer Overflow

Question: What is a buffer overflow and how do you exploit it?

What is Buffer Overrun?

Buffer Over-run refers to problem where we can make program to use more than allotted ‘buffer’.All buffer overruns cannot be exploited as security vulnerability

What it could lead to?

Buffer overflows can be triggered by inputs that are designed to execute code, or alter the way the program operates. You get an access violation (AV). Your site becomes unstable.The attacker injects code into your application, executes it, and makes everyone an administrator of your site.

Types of Buffer Overrun

  • Stack Overruns
  • Heap Overruns
  • Array Indexing Errors
  • Format String
  • Unicode and ANSI Buffer size mismatch

A colleague of mine created a  PPT on Buffer Overflow which we deal with here on daily basis. I removed some of the proprietary code and some other useful information.

Monday, September 26, 2011

Max Black sub square matrix

Imaging there is a square matrix with n x n cells. Each cell is either filled with a black pixel or a white pixel. Design an algorithm to find the maximum subsquare such that all four borders are filled with black pixels.

Algorithm to Exhibit Did You Mean Feature of Google Search Engine

When people search at Google, they often type two words without space. For example, instead of typing "binary tree", someone can type "binarytree". Given an input, what is a good way of finding if it is a combination of two valid words e.g. In Directly You have Add Space between two such or if are using some data structure , where words would have been inserted , you would have maintained a boolean variable to defined the end of word (eod)
An Example Will be Let's say you have a phrase without any spaces - eg. "thisisawesome". Given a dictionary, how would you add spaces in this string?

You can solve this problem using a trie datastructure. Trie data structure implementation can be found here in my previous blog. The rest is just searching the trie

Sunday, September 25, 2011

LRU Cache Implementation

Linked List is a palindrome or not

Question: Write a program to check if the given linked list is a palindrome or not

METHOD 1 (using recursion)


METHOD 2 (Using Stack)




  • Get the middle of the linked list.

  • Push the elements into the stack till the middle of the linked list

  • Compare the top of the stack and second half.

Thursday, September 22, 2011

Next bigger number

Q : You have given a positive number. You have to find a number which is immediate bigger than that by using same digits available in the number. use same digits with same number of time, coming in positive integer and if a small number is not possible then we have to return -1.

For example: (1) You have given a number 7585 , your output should be 7855 . (2) 7111, return –1

Algorithm:


Code: