Shopping Cart Forum

Go Back   eShop Forums - eCommerce Help Forum for Shopping Cart Owners. > Tech Corner > SEO and PPC
Register Blogs FAQ Members List Calendar Search Today's Posts Mark Forums Read
SEO and PPC Search engine optimisation and pay per click advertising.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 03-09-2007, 07:41 AM
Tinkerbell's Avatar
Tinkerbell Tinkerbell is offline
Demented Butterfly
 
Join Date: Aug 2007
Location: Tyne and Wear
Posts: 932
Blog Entries: 16
Thanks: 50
Thanked 17 Times in 16 Posts
Default robots.txt

Here is some useful info I found. In particulsr pay attention tot he bit about preventing googlebot form crawling your images. This can help deter people from stealing yourimages and using your bandwidth, which has happened to me when they have use dmy images in forum chats ! Of course if you find the culprits you cam always replace the image with something else

What on Earth is a robots.txt File?

A robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your site. You can use it to prevent indexing totally, prevent certain areas of your site from being indexes or to issue individual indexing instructions to specific search engines.

The file itself is a simple text file, which can be created in Notepad. It need to be saved to the root directory of your site, that is the directory where your home page or index page is.
Why Do I Need One?

All search engines, or at least all the important ones, now look for a robots.txt file as soon their spiders or bots arrive on your site. So, even if you currently do not need to exclude the spiders from any part of your site, having a robots.txt file is still a good idea, it can act as a sort of invitation into your site.

There are a number of situations where you may wish to exclude spiders from some or all of your site.

You are still building the site, or certain pages, and do not want the unfinished work to appear in search engines
You have information that, while not sensitive enough to bother password protecting, is of no interest to anyone but those it is intended for and you would prefer it did not appear in search engines.
Most people will have some directories they would prefer were not crawled - for example do you really need to have your cgi-bin indexed? Or a directory that simply contains thank you or error pages.
If you are using doorway pages (similar pages, each optimized for an individual search engine) you may wish to ensure that individual robots do not have access to all of them. This is important in order to avoid being penalized for spamming a search engine with a series of overly similar pages.
You would like to exclude some bots or spiders altogether, for example those from search engines you do not want to appear in or those whose chief purpose is collecting email addresses.
The very fact that search engines are looking for them is reason enough to put one on your site. Have you looked at your site statistics recently? If your stats include a section on 'files not found', you are sure to see many entries where search engines spiders looked for, and failed to find, a robots.txt file on your site.

Creating the robots.txt file

There is nothing difficult about creating a basic robots.txt file. It can be created using notepad or whatever is your favorite text editor. Each entry has just two lines:

User-Agent: [Spider or Bot name]
Disallow: [Directory or File Name]

This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.

1. Exclude a file from an individual Search Engine

You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file:

User-Agent: Googlebot
Disallow: /private/privatefile.htm

. Exclude a section of your site from all spiders and bots

You are building a new section to your site in a directory called 'newsection' and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, '*', to exclude them all.

User-Agent: *
Disallow: /newsection/

Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

3. Allow all spiders to index everything

Once again you can use the wildcard, '*', to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from nowhere.

User-agent: *
Disallow:

4. Allow no spiders to index any part of your site

This requires just a tiny change from the command above - be careful!

User-agent: *
Disallow: /

If you use this command while building your site, don't forget to remove it once your site is live!

and for the bit I like!
Now you want to keep Google away from those images. Google grabs these images with a sperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:

User-Agent: Googlebot-Image
Disallow: /images/

That will work if you are very organized and keep all your images strictly in the images folder.

User-Agent: Googlebot-Image
Disallow: /

This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Google

Edible Graphics, Affordable E-Commerce, Web Shops & Custom Form Scripts
Click here for domain name registration and web hosting
tech news, product reviews, the latest home and business technology, the latest in digital imaging


Content Relevant URLs by vBSEO 3.0.0