Monday, June 06, 2011

Control the contents to be indexed/crawled

In SharePoint, one can control what contents on the page needs to be indexed, should the links on the page to be crawled or not. Use the HTML <META > tag to tell the indexer whether you want to crawl the page contents or not.  You can also direct the indexer whether you want to crawl the links on the page or not.

For example:- <META name= "robots" content= "noindex, nofollow" > tells the indexer not to crawl the contents and don't crawl the links on the page.
Similarly <META name= "robots" content= "index, nofollow" > means crawl the contents but ignore the links on the page. 
<META name= "robots" content= "noindex, follow" > means don't crawl the contents but crawl the page contents whose links are present on this page.

By default the indexer will treat as content="index,follow" if nothing is specified as above.

You can further control the specific contents on the page to be crawled or not.  This can be achieved using the "noindex" class for the <div> tag.  Any content within the <div class="noindex" >
tags will not be crawled.  This is simlar to "noindex, nofollow" i.e. no contents will be crawled within that div tag and no links will be followed.

For example:-
<div class="noindex">
Contents in this div tag will not be crawled
This link will not be crawled too
</div>


But there is an exception to this.  The
<div class="noindex">
will not work in the nested div tags. 

For example:-
<div class="noindex">
<div>

Contents of this tag will be crawled as it is inside the child tag.
This link be crawled too
</div>
</div>

Labels: ,