Tag Archives: web crawler

Why an ending slash on a URL matters (or not)

Recently I read a report from Google about the status of their own Web sites in terms of Search Engine Optimization (SEO). Uhh, Google, the search engine giant, is looking how the own sites are behaving when a Web crawler is stumble over them! To be honest, it’s fully clear to me that in such a big company, like Google, not all is perfect in the sense of cooperative appearance. There are different projects, different teams, with different strengths and weaknesses, different priorities and of course different managers. To summaries the report: In some areas Google does a good job, but in most it doesn’t.

How to optimize

Although SEO even lead to a business case for some companies, I believe it’s not so hard to do and in my humble opinion the most important points, a Web site owner should take care of, are the following 6:

  1. Usage of Web standards like XHTML 1.0 Transitional and making sure the Web site conforms to them.
  2. Usage of the <title> tag.
  3. Usage of meta tags like description and keywords.
  4. Usage of the header tags <h1>, <h2>, and so one.
  5. Adding canonical URL information to every site of the Web site, if a specific site is reachable from more than one URL.
  6. Writing good content.

Although 6. is quite obviously, it’s the most important point and often people forget about it and wonder why there is no traffic on there Web site at all.

1.-5. are technical aspects and if a Web site owner is using e.g. WordPress the Web site should be in good shape, already. Of course this depends a little bit on the used theme and the plug-ins the user has installed. For item 1. I always propose to make a bug report if a theme or a plug-in doesn’t confirm to them, as I have done for the theme used in this blog. Conformity could be easily tested with the W3C Valitator.

Why one should use HTML tags like the title, the meta and the heading tag is also easy to understand. A Web crawler isn’t a human, so he can’t distinguish between structural information because e.g. the font size is different. Helping him by semantic marking some of the text with the available tags of HTML is therefor a good idea. For the same reason using HTML tables to layout a page is a bad idea. Although this was standard in Netscape 4.0 times it isn’t necessary anymore these days.

Slash or no slash

Item 5. is about giving the Web crawler a clear idea about the structure of your Web site in a whole. Comparing 64k-tec.de/test and 64k-tec.de/test/ as a human doesn’t seem to be very different. From a technical point of view, it is. Considering that the Web itself is grown up in a UNIX environment, the former points to a filename and the later to a directory. This means for a Web crawler two different sites are targeted. The easiest way to fix this, is to decide about the nomenclature globally used. Either use the one form or the other. WordPress uses the “ending with a slash” variant (most of the time). By the way, this is also important for 64k-tec.de/test/index.html and other variants. Another way is to tell the Web crawler the canonical address even if the site is served by another URL. This could be done by adding the link rel tag canonical to the header of the page. On my homepage this looks as follow:


As you see, even on the top-level domain a slash is added. WordPress does this automatically for you since version 2.3. On older versions plug-ins for this task are available. The canonical tag is a good way to make clear which address is the base URL of a specific page. On the other side I see some potential for improvements. I have found two places in my blog where the base address isn’t targeted right. The first one is the tag for the site index relationship. It’s noted as follow, on my blog:


The second one is the link of the logo presented on top of every page on the blog. It use the following link:

64k

As you see the ending slash is missing, both times. It is not really a problem, cause the page itself use the canonical tag. The second mistake is clearly a failure of the theme. It’s not fully clear to me if the first wrong target is a failure of the theme or WordPress itself. It also happens with the default WordPress theme (version checked is 2.9.2).

Conclusion

Creating a Web site which is easily understandable by a Web crawler isn’t any magic. Of course you could make a pure science out of it. There are tons of plug-ins for WordPress available. On the other side considering some simple rules will help a lot. Reading Google’s hints or using Google webmaster tools might help, too. Even for Web crawlers from other companies.

On a last note, here is a nice article about how to effectively keep users out of your blog ;).