web design portfolio

Sids are Spider Killers

Continuing our discussion on session IDs in our last article History of the Session IDs, you, like many other webmasters, may be wondering what is wrong with passing a session ID in the query string. After all, it sounds like a great way to keep sessions intact when surfers do not have 'cookies' enabled in the web browser.

Well, the problem is that no one clued the search engines in. Or perhaps it's better stated to say that search engines didn't have the foresight to see this coming and develop a way to better handle it.

In order to better understand the problem, you really need to better understand how search engines get the pages of your website into their index for searchers to find. I've oversimplified this process in the following description so it will be easier to follow along.

Search engines like google usually use 2 different programs when building their index. For illustrative purposes, we'll call the first program the 'gatherer'. Google sends the 'gatherer' program out to each website it knows about on the internet. It's the gatherer's job to follow each and every single link it can find on each website. While doing that, it compiles two different lists, a list of 'external' links and a list of 'internal' links. It uses it's external list to maintain and update it's grand list of all websites it knows about across the internet. It uses it's 'internal' list to get an idea of all of the pages that exist within your website.

The second search engine program is what I call the 'parser'. google sends it's parser out to each website that it knows about across the internet, and starts going through that's website's internal list of URLs that the gatherer program put together earlier. For each URL in the list, the parser program reads in the content of the page. Once all of the content has been read by the parser for all of the URLs found on the internal list, it sends the gatherer back out a few days later to make sure it didn't miss any URLs.

Here's where the problem occurs. Search engine programs like the gatherer and the parser don't have 'cookies' enabled. So the website they are visiting assigns a session ID to the visit, not knowing that the visitor is actually a search engine, and since that visitor doesn't have cookies enabled, the session is kept alive by passing that session ID through the query string.

The first time the gather program goes through the site really isn't much of a problem (except that all of your pages in the search engines index will have the query string attached. i.e. http://www.google.com/search?q=%22powered+by+oscommerce%22+inurl%3A%22si...).
However, the second time the gather goes through the site, after the parser is done, the gatherer is actually assigned a different session ID, because the first one has usually expired by then. Because the URL is different (different session ID in the query string), the gatherer thinks it has found a whole new set of URLs that it missed earlier, even though it's the exact same pages as before, just with a different session ID added to them. so, it adds all of these new URLs to it's internal list, and promptly sends the parser back out again.

Ultimatly, this causes Google and other search engines to go into an endless loop of gathering and parsing the pages of your website. This can be costly to the website owner for several reasons.

First, left unresolved, I've seen this endless loop consume over 100 gb of bandwidth transfer in less than 2 weeks. Most hosting account charge extra for this much transfer.

Second, you could end up with each page of your site indexed in Google more than once. That doesn't sounds too bad until you remember the Google will penalize a website for duplicate content. This could cause a page to actually be de-indexed.

Third, once Google realized that it gets into an endless loop by going to your website, it may not decide to come back. Hence your listings will eventually get stale when your content changes.

In October, 2002, I worked with some other developers and came up with a script to identify that a visitor is a search engine by using a combination of the browsers user agent and know search engine IP addresses. If the visitor was identified to be a search engine, then no session ID was assigned.

Technically, at the time, this was against Google's terms of service, as Google insisted that it's programs be shown the exact same content and URLs as a regular surfer. But, as Google didn't have (and still really doesn't have) a way to resolve the endless loop created from session IDs, we are left with no alternative.

The script I helped develop was modified several times, and was eventually adopted into the code code of not only osCommerce, but also Zen-Cart and other derivatives. It's not the perfect solution, but if you want to allow people to use your website even if they have cookies disabled, it's as close as I've seen to a solution.

These days, surfers are much more accustomed to allowing cookies to be placed on their computer, and fewer and fewer people actually have them disabled. Eventually, one day, perhaps everyone will be so used to allowing cookies, that this won't be an issue. That's probably about the time that Google and other search engines will finally get around to solving the problem.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Copy the characters (respecting upper/lower case) from the image.