As a webmaster, you might not have known much more about SEO. You do SEO work every day, and have mastered the strong skills. But, have you thought what all this means? There is no surprise that most SEOers don’t really know how search engines really work. Here, we will reveal some secrets about it.
Secret 1: Indexing Vs Crawling
What does it really mean when we say search engines have indexed a site? This means that we see the site in a [site:www.site.com] on search engine. When the page is indexed by a search engine, it has been added to the search engine’s database. But technically, it doesn’t mean the search engine has crawled this page. That is why we always see this:
"A description for this result is not available because of this site’s robots.txt."
There is a order of priority: URLs have to be discovered before they can be crawled, and they have to be crawled before they can be “indexed”. An index refers to a list of words or phrases, and doesn’t contain documents. For each of indexes, a reference to all the documents that are related to that word or phrase. This means some of the words related to the document now point to the document.
Another version about this subject: if search engine has known the URLs, it adds those URLs into its crawl scheduling system. Then,search engine dedupes the list and then rearranges the list of URLs in priority order and crawls in that order. When a page is crawled, another algorithmic process is started to determine whether this page can be stored in indexes. From here, we know that search engines don’t crawl every page they learn about,and don’t index every page they crawl.
- Robot.txt can tell search engines not to crawl the page. With Robot.txt, search engines are only allowed to communicate with pages through some words based on things like internal links, instead of crawling the content of pages.
- there is a difference between Google and other search engines like Bing and Yahoo. Google will still show the noindexed page in a public available records,if other signals are strong enough that this page should be indexed. However, Bing and Yahoo respect your noindex pages.
Secret 2: PageRank
The next secret about search engines may confuse you are links and how its processed. You should know that links are not processed during crawling. Thus, PageRank is considered separated from crawling.
PageRank is a measure of quality and quality of links. Referrer-based blocking do not work to stop the flow of PageRank from one page to another, and you can’t control PageRank with any kind of referrer-based tracking. However, to popular belief, PageRank can be passed through 302 redirect.
Four things that work to stop passing PageRank:
- A nofollow directive setup on the link.
- A disallow directive in the robots.txt.
- A 404 error on the originating page.
- A 404 error on the destination page.
Above, we’ve introduced two core secrets of search engine work. So which surprised you the most or the two of them? Anyhow, just keep these two secrets in mind ,and you will have a deep understand about search engine and SEO.