Security Watch
I, Robots.txt
A robot may not allow a well-behaved Web crawler or spider to expose confidential information.
In a recent bout of stupidity, the
U.S. Department of Energy
apparently accidentally published confidential
Homeland Security
Department documents marked "For Official Use Only", and the documents
remain visible via
Google's Web cache.
To avoid situations like this, be sure you've created a properly configured
robots.txt file on your Web servers. While it won't
prevent confidential documents from being placed on a publicly available server,
it is at least one way to prevent such documents from being available in Google's
Web cache from now until eternity.
The robots.txt file isn't based on any officially recognized standard, but
it has been in existence since 1993 and is generally accepted. Full details
can be found at http://www.robotstxt.org.
The robots.txt file is placed on a Web server to provide instructions to well-behaved
Web crawlers or spiders. Anyone can use a crawler, but they're most often used
by search engines to collect information about Web sites. The file's role is
to provide instructions to the crawler, specifying what directories or files
should not be indexed by the crawler. There are basically two lines:
User-Agent:
Disallow:
These lines can be repeated within the same file. The User-Agent:
line indicates which crawler type the subsequent Disallow:
lines apply to. You can specify a particular crawler by indicating its User-Agent
value (found in your Web logs), or simply specify "*" to indicate
all crawlers.
Following the User-Agent: line are one or more
Disallow: lines, typically indicating directories.
Files can also be specified if desired. Here's a sample robots.txt file:
User-Agent: *
Disallow: /
These two lines, if placed in the robots.txt file at the root of your Web site,
tell crawlers to ignore your site.
It's important to understand that a robots.txt file isn't a security mechanism;
it does nothing to prevent crawlers or individuals from searching your site
for files to index or view. Only polite crawlers will request the file and honor
its contents.
Want
More Security? |
This
column was originally published in our weekly Security Watch
newsletter. To subscribe, click here. |
|
|
If you want some of your site to be found in search engines, but have other
files you want to keep out, you should disallow all directories except the ones
you want to make available in the search engine. For example, if you have the
following structure on your Web root:
"/": Publicly available information to be put into search
engines
"/Dev": Stuff you're working on but don't want published
"/Private": Stuff you definitely don't want published
Your robots.txt file would look like this:
User-Agent: *
Disallow: /Dev/
Disallow: /Private/
To be extra secure, you should put some form of authentication on both the
/Dev and /Private sub-directories.
Finally, you might have specified that nothing should be crawled, yet you find
crawlers still reading directory pages that should be inaccessible. This is
means there's still a link to a page on your site somewhere on the Internet.
Using the previous example, let's say you've got a file named FOO.ASP in the
/Dev directory. According to your robots.txt file, it shouldn't be crawled.
If, however, some other site offers up a link like this:
"http://www.yoursite.com/Dev/FOO.ASP"
Then crawlers will follow that link to your FOO.ASP page and include it in
their searches. There's nothing you can do about this. That's why authentication
is a necessary extra step to prevent access.
About the Author
Russ Cooper is a senior information security analyst with Verizon Business, Inc.
He's also founder and editor of NTBugtraq, www.ntbugtraq.com,
one of the industry's most influential mailing lists dedicated to Microsoft security.
One of the world's most-recognized security experts, he's often quoted by major
media outlets on security issues.