A web crawler (Search Engine Spider) is an automated and orderly managed program that browses the internet and creates a copy of all the websites it frequently visits. Search engines make use of these to index the downloaded pages and accelerate the entire search process. A crawler has ability to index million pages per day but due to use of different algorithms they produce different search results.
How crawler works
During the process of crawling, a Crawler not only visits the landing pages and primary pages of the website but also all relevant links posted on the websites. Content and keywords on a website are also the vital components that attract the crawler. Every search engine has a scheduler mechanism to intimate crawler the frequency of the crawling and the list of the documents to be crawled next.
The crawler explores and indexes web pages of your website so that whenever a user searches a webpage with a relevant key phrase, your website turns up in the search query. The process of crawling includes bringing together a very comprehensive list of the keywords from the database. After indexing the website, a number of algorithms are used by the search engines to evaluate the visibility of your website in the search engine.
The Crawling Policy
- A selection policy to decide the page to be downloaded
- A re-visit policy to determine when to check the pages for any change
- A politeness policy to avoid overloading websites
- A parallelization policy to avoid repeated download of the same page