I, Robots.txt -- Microsoft Certified Professional Magazine Online

I, Robots.txt

A robot may not allow a well-behaved Web crawler or spider to expose confidential information.

By Russ Cooper
02/14/2005

In a recent bout of stupidity, the U.S. Department of Energy apparently accidentally published confidential Homeland Security Department documents marked "For Official Use Only", and the documents remain visible via Google's Web cache.

To avoid situations like this, be sure you've created a properly configured robots.txt file on your Web servers. While it won't prevent confidential documents from being placed on a publicly available server, it is at least one way to prevent such documents from being available in Google's Web cache from now until eternity.

The robots.txt file isn't based on any officially recognized standard, but it has been in existence since 1993 and is generally accepted. Full details can be found at http://www.robotstxt.org.

The robots.txt file is placed on a Web server to provide instructions to well-behaved Web crawlers or spiders. Anyone can use a crawler, but they're most often used by search engines to collect information about Web sites. The file's role is to provide instructions to the crawler, specifying what directories or files should not be indexed by the crawler. There are basically two lines:

User-Agent:
Disallow:

These lines can be repeated within the same file. The User-Agent: line indicates which crawler type the subsequent Disallow: lines apply to. You can specify a particular crawler by indicating its User-Agent value (found in your Web logs), or simply specify "*" to indicate all crawlers.

Following the User-Agent: line are one or more Disallow: lines, typically indicating directories. Files can also be specified if desired. Here's a sample robots.txt file:

User-Agent: *
Disallow: /

These two lines, if placed in the robots.txt file at the root of your Web site, tell crawlers to ignore your site.

It's important to understand that a robots.txt file isn't a security mechanism; it does nothing to prevent crawlers or individuals from searching your site for files to index or view. Only polite crawlers will request the file and honor its contents.

Want More Security?

This column was originally published in our weekly Security Watch newsletter. To subscribe, click here.

If you want some of your site to be found in search engines, but have other files you want to keep out, you should disallow all directories except the ones you want to make available in the search engine. For example, if you have the following structure on your Web root:

  "/": Publicly available information to be put into search engines
  "/Dev": Stuff you're working on but don't want published
  "/Private": Stuff you definitely don't want published

Your robots.txt file would look like this:

  User-Agent: *
  Disallow: /Dev/
  Disallow: /Private/

To be extra secure, you should put some form of authentication on both the /Dev and /Private sub-directories.

Finally, you might have specified that nothing should be crawled, yet you find crawlers still reading directory pages that should be inaccessible. This is means there's still a link to a page on your site somewhere on the Internet.

Using the previous example, let's say you've got a file named FOO.ASP in the /Dev directory. According to your robots.txt file, it shouldn't be crawled. If, however, some other site offers up a link like this:

"http://www.yoursite.com/Dev/FOO.ASP"

Then crawlers will follow that link to your FOO.ASP page and include it in their searches. There's nothing you can do about this. That's why authentication is a necessary extra step to prevent access.

About the Author

Russ Cooper is a senior information security analyst with Verizon Business, Inc. He's also founder and editor of NTBugtraq, www.ntbugtraq.com, one of the industry's most influential mailing lists dedicated to Microsoft security. One of the world's most-recognized security experts, he's often quoted by major media outlets on security issues.