SEO News & Search Engine Updates: October 26, 2008

Optimizing Web Site Navigation

Friday, October 31, 2008 at 4:01 AM Posted by Vasu

While outlining universal nodes and topical anchors, discussing link placement and supporting search engine crawls, I've dropped a few snippets of information on internal linking in a Web site's (outer) navigation. Now lets draw the whole picture by looking at the impact navigational links have on search engine placements.

Web site navigation obviously must be user friendly. User friendliness plus a few tweaks and shortcuts implemented for search engine spiders makes up a search engine friendly navigation. Laying out navigational links to lead users straight to the content they're searching for allows some fair search engine optimizing. What you never should do is tweaking the linkage for the engines when this results in a loss of usability and visitor support.

Technically, outer navigation elements are a part of the page's template (see page partitioning and link placement). Search engines can distinguish templated page areas from the body's (unique) content. As a matter of fact, they weight text and links differently depending on the page area. How much power navigational links have with regard to ranking depends on the site's architecture. On sites where the outer navigation is very repetitive, that is the menus get duplicated over and over with very few page specific items, those navigational links are treated like artificial links and their power gets downgraded next to zero. You will find this kind of flawed design at many eCommerce sites, where the static outer navigation (i.e. links to product lines and home page) is identical on most pages. The in-depth linkage is represented by the dynamicly generated inner navigation, that is links nearby or within the page body.

Since it makes no sense to deal with impotent page areas, clever developers balance the linking power by placing as many in-depth navigation as possible at the outer navigation areas. The goal is to drill down the outer navigation to the last node (e.g. product), while restricting the inner navigation to within-the-node linkage (e.g. product sizes and colors). Sometimes it's even possible to integrate a complex node's internal navigation with the outer navigation. This approach enables powerful linking in the peripheral areas, because every node comes with a different menu, that is less repetition (link duplication).

Another advantage of node specific outer navigation is, that it supports internal authority hubs. Having less than a handful of links leading to upper levels, most of the linking power gets used to strengthen on-topic (navigational) links. Additionally, a node specific outer navigation, e.g. a left handed menu, bread crumbs near the top and horizontal links at the bottom, develops enough linking power (mostly received from deep inbound links) to support the root and the main sections as well. Thus having a search engine unfriendly DHTML menu or flash based navigation at the top or right side of the page doesn't harm anymore, it may even help to establish topic authority hubs.

To demonstrate the impact keywords placed in a page's templated area can have on rankings, here is an example of a node specific outer navigation where a search engine assigns a lot of weight to navigational anchor text. The search term is "Internet Google", which is by the way a totally useless #4 spot, because nobody searches for it. I've picked it because it pulls 68.5 million results at Google, "Internet" is not closely related to the on page content (the word "Internet" appears only once in a navigational link and the URL), and at least it looks like a popular search. Here is the SERP:

Google's first SERP for 'Internet Google'

Look at the snippet and the screenshot of the page at the time of indexing. The sequence of keywords in the bread crumbs' anchor text makes (most of) the placement on the SERP.

This page is about 'Google Sitemap Validation' and has not so much to do with 'Internet'

It works fine with a few other useless keyword phrases taken from the page's bread crumbs too: Utterly Useless Keywords: smart internet business google Utterly Useless Keywords: it consulting internet google Utterly Useless Keywords: smart internet google Utterly Useless Keywords: consulting internet google ... but don't expect it's that easy to achieve top rankings in competitive markets. However, keywords within the navigation can help to define themes and topics, so you should use your most important keywords in prominent navigational anchor text.

Bread crumbs are a great way to break down a site's theme to topics and sub-topics. This You are here path to the root index page, placed near the top of each page, can act as an authority hub's sole connection to it's upper hierarchy levels. It's not even necessary to repeat the complete path to the root in the left handed menu.

By the way, consistent linking of the current node is neither lazy nor useless, because in complex nodes the current page is often different from the node's point of entry.

Other important navigation element are top level links, stored popular searches and horizontal views. The number of top level links, leading to the home page and main sections, must be kept as low as possible to avoid dilution of topical authority build around the rich nodes in deeper levels. There is nothing to say against nicely formatted top level links which aren't spiderable, e.g. in java based drop down menus, if they improve the surfing experience. From a SEO point of view they are (in most cases) pretty much useless.

Horizontal views are for example indexes of all image galleries, all tutorials related to a product line, or all articles related to a broader theme. These indexes may or may not reflect a part of the site's hierarchy, but mostly they are used as more content type oriented than theme specific layers. Like site maps, horizontal views should not contain more than 100 links per page, 15-25 links plus descriptions and/or previews are a proven limit. The content linked on a site map page or on a horizontal index should be describable with a short catchword (phrase) in a manu item. If that doesn't work, probably the collection of links is useless at all.

Methods to Support Search Engines in Crawling and Ranking

at 3:55 AM Posted by Vasu

Let's recap the basic methods of steering and supporting search engine crawling and ranking:

# Provide unique content. A lot of unique content. Add fresh content frequently.
# Acquire valuable inbound links from related pages on foreign servers, regardless of their search engine ranking. Actively acquire deep inbound links to content pages, but accept home page links. Do not run massive link campaigns if your site is rather new. Let the amount of relevant inbound links grow smoothly and steadily to avoid red-flagging.
# Put in carefully selected outbound links to on-topic authority pages on each content page. Ask for reciprocal links, but do not dump your links if the other site does not link back.
# Implement a surfer friendly, themed navigation. Go for text links to support deep crawling. Provide each page at least one internal link from a static page, for example from a site map page.
# Encourage other sites to make use of your RSS feeds and alike. To protect the uniqueness of your site's content, do not put text snippets from your site into feeds or submitted articles. Write short summaries instead and use a different wording.
# Use search engine friendly, short but keyword rich URLs. Hide user tracking from search engine crawlers.
# Log each crawler visit and keep these data forever. Develop smart reports querying your logs and study them frequently. Use these logs to improve your internal linking.
# Make use of the robots exclusion protocol to keep spiders away from internal areas. Do not try to hide your CSS files from robots.
# Make use of the robots META tag to ensure that only one version of each page on your server gets indexed. When it comes to pages carrying partial content of other pages, make your decision based on common sense, not on any SEO bible.
# Use rel="nofollow" in your links, when you cannot vote for the linked page (user submitted content in guestbooks, blogs ...). Do not hoard PageRank™.
# Make use of Google SiteMaps as a 'robots inclusion protocol'.
# Do not cheat the search engines.

About /robots.txt

at 3:54 AM Posted by Vasu

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

* robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
* the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

See also:

* Can I block just bad robots?
* Why did this robot ignore my /robots.txt?
* What are the security implications of /robots.txt?

The details

The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:

* the original 1994 A Standard for Robot Exclusion document.
* a 1997 Internet Draft specification A Method for Web Robots Control

In addition there are external resources:

* HTML 4.01 specification, Appendix B.4.1
* Wikipedia - Robots Exclusion Standard

The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion.

The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.
How to create a /robots.txt file
Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

See also:

* What program should I use to create /robots.txt?
* How do I use /robots.txt on a virtual host?
* How do I use /robots.txt on a shared host?

What to put in it
The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server

User-agent: *
Disallow: /

To allow all robots complete access

User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)
To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

Print format

Robots Exclusion Protocol: now with even more flexibility

at 3:34 AM Posted by Vasu

This is the third and last in my series of blog posts about the Robots Exclusion Protocol (REP). In the first post, I introduced robots.txt and the robots META tags, giving an overview of when to use them. In the second post, I shared some examples of what you can do with the REP. Today, I'll introduce two new features that we have recently added to the protocol.

As a product manager, I'm always talking to content providers to learn about your needs for REP. We are constantly looking for ways to improve the control you have over how your content is indexed. These new features will give you flexible and convenient ways to improve the detailed control you have with Google.

Tell us if a page is going to expire
Sometimes you know in advance that a page is going to expire in the future. Maybe you have a temporary page that will be removed at the end of the month. Perhaps some pages are available free for a week, but after that you put them into an archive that users pay to access. In these cases, you want the page to show in Google search results until it expires, then have it removed: you don't want users getting frustrated when they find a page in the results but can't access it on your site.

We have introduced a new META tag that allows you to tell us when a page should be removed from the main Google web search results: the aptly named unavailable_after tag. This one follows a similar syntax to other REP META tags. For example, to specify that an HTML page should be removed from the search results after 3pm Eastern Standard Time on 25th August 2007, simply add the following tag to the first section of the page:

META NAME="GOOGLEBOT" CONTENT="unavailable_after: 25-Aug-2007 15:00:00 EST"

The date and time is specified in the RFC 850 format.

This information is treated as a removal request: it will take about a day after the removal date passes for the page to disappear from the search results. We currently only support unavailable_after for Google web search results.

After the removal, the page stops showing in Google search results but it is not removed from our system. If you need a page to be excised from our systems completely, including any internal copies we might have, you should use the existing URL removal tool which you can read about on our Webmaster Central blog.

Meta tags everywhere
The REP META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.

We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples:

* Don't display a cache link or snippet for this item in the Google search results:

X-Robots-Tag: noarchive, nosnippet

* Don't include this document in the Google search results:

X-Robots-Tag: noindex

* Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT:

X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT

You can combine multiple directives in the same document. For example:

* Do not show a cached link for this document, and remove it from the index after 23rd July 2007, 3pm PST:

X-Robots-Tag: noarchive
X-Robots-Tag: unavailable_after: 23 Jul 2007 15:00:00 PST

Our goal for these features is to provide more flexibility for indexing and inclusion in Google's search results. We hope you enjoy using them.

I Robot | Robots.txt Help | SebastianX of Sebastians Pamphlets

at 3:09 AM Posted by Vasu

Hobo - Right Sebastian! What do you think you are doing calling me out on a slight bit of “misinformation” on a post I made for a bit of branding. Just who do you think you are spamming my content with useful, original and interesting content?

Don’t you realize that @ 1,500 stumblers and Twitters visited my site as a result of this slapping?? You trying to discredit me? :)

Sebastian - Howdy Shaun, - I’m so sorry that I discredited you, that was really not my intention.

I couldn’t resist coz robots.txt is kinda pet peeve of mine. Thanks for the opportunity to spam your neat blog with my links thoughts, though. :)

Hobo: That post was about how expert SEO people were using Robots.txt - I should have put a disclaimer at the bottom saying I didn’t know a thing about Robots.xt files and that I had nicked mine some time ago from Michael Gray and forgot about it. And spam my blog all you like with that kind of content, although I’ve got Lucia’s Linky Love installed so generally Spam doesn’t get much of a foothold about these parts (actually I am not even sure if that is working properly).

OK - you seem to know what you’re on about when it comes to robots.txt. Fancy educating me and the Hobo team as to what you’ve learned and know about these often misunderstood files? You know, all that stuff that took you years to learn, Let me have it….now!

Hobo - WTF is a Robots.txt file, Sebastian, in simple idiot’s terms?

Well, the “idiot’s version” will lack interesting details, but it will get you started.

Robots.txt is a plain text file. You must not edit it with HTML editors, word processors, nor any applications other than a plain text editor like vi (Ok, notepad.exe is allowed too). You shouldn’t embed images and such, also any other HTML code is strictly forbidden.

Hobo - Why shouldn’t I edit it with my Dreamweaver FTP client, for instance?

Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:

{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 User-agent: Googlebot}{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /}{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} …
(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)

Hobo - Where Do I put the damn thing?

Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example “/web/user/htdocs/example.com/robots.txt” resolving to http://example.com/robots.txt.

Can I use Robots.txt in sub directories?

Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt.

When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” - choose “ASCII” even when you’ve used an ANSI editor to create it.

Hobo - Why?

Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)

Hobo - What about if I am on a Free Host?

If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers.

Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.

Hobo - Sebastian, Do you know how search engines work, then?

Yep, to some degree. ;) Basically, a search engine has three major components:

1. A crawler that burns your bandwidth fetching your unchanged files over and over until you’re belly up.
2. An indexer that buries your stuff unless you’re Matt Cutts or blog on a server that gained search engine love making use of the cruelest black hat tactics you can think of.
3. A query engine that accepts search queries and pulls results from the search index but ignores your stuff coz you’re neither me nor Matt Cutts.

Hobo - What goes into the robots.txt file?

Your robots.txt file contains useful but pretty much ignored statements like
# Please don't crawl this site during our business hours!
(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).

Hobo - I say index, you say crawl. You say tomato, I say….ah! I see!

Currently, there are only three statements you can use in robots.txt:

1. Disallow: /path
2. Allow: /path
3. Sitemap: http://example.com/sitemap.xml

Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.

The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:
User-agent: *
Disallow:
Allow: /
Sitemap: http://example.com/sitemap.xml

If you’re comfortable with Google but MSN scares you, then write:
User-agent: *
Disallow:

User-agent: Googlebot
Disallow:

User-agent: msnbot
Disallow: /

Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.

From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a
User-agent: [crawler name]
line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (”User-agent: *”) section before you edit the code.

Now to the directives. The most important crawler directive is
Disallow: /path

“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (”*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).

URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.

Allow: path/
refines Disallow: statements, for example
User-agent: Googlebot
Disallow: /
Allow: /content/
allows crawling only within http://example.com/content/

Sitemap: http://example.com/sitemap.xml
points search engines that support the sitemaps protocol to the submission files.

Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ - The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.

Hobo - Say I want to keep a file / folder out of Google. Exactly what what would I need to do?

You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
"meta name=”Googlebot” content=”noindex,noarchive” /"
Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.

Hobo - Sebastian, thanks so much for your invaluable insights into this pesky but poweful file. Your blog was recently cited by Jim Boykin as a favourite destination of Jim’s. If I had to ask you to tell the readers 5 of your favourite posts on your own website, which ones would you pick?

Being a greedy link-whore of course I’d pick my Canonical SEO definitions. I hope you don’t mind that this link points to my sitemap (sheesh you’ve spammed the rest of my blog anyways - Hobo …. :)) that some folks have even sphunn. Ok, that’s zero, and here is the list of 5 posts that I consider somewhat useful, either because they’re interesting from a technical point of view, or because they tell something about me.

0. The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise
0. Shit happens, your redirects hit the fan!
0. Why proper error handling is important
1. Analyzing search engine rankings by human traffic
2. If you free-host your blog flee now!
3. Microsoft funding bankrupt Live Search experiment with porn spam
4. SEOs home alone - Google’s nightmare
5. My plea to Google - Please sanitize your REP revamps

I should have mentioned earlier that counting somewhat challenges me when it comes to limits of links lists. Of course I like a few more of my posts, but I can resist to quote my blog’s site map.

Hobo - Where online do you hang out?

At Sphinn and Google’s Webmaster Help Group. For the latter some folks call me a slimy Google groupie, but I can perfectly live with that. Google’s SEO forum is a nice place to help noobs and discuss interesting topics as well.

Hobo - Who do you read every day/week?

Oh well. That’s a very long list. Probably the OPML file would be too large to email it. I read (sometimes skim) my friend’s posts daily, when I’m swamped at least weekly. I guess the best way to get a grip of my reading preferences is my shared feed, my list of stumbles, bookmarks, and sphinns.

Hobo - Tell me who your favourite music band is? Mine is the Stone Roses, have you heard of them?

Today that’s Ten Years After, yesterday it was Bob Dylan. Stone Roses is not on my radar, maybe I missed out on a great band?

Hobo - What else are you interested in online?

Tough question. What can a lonely geek do online? Viewing porn of course. Seriously, I consume more technical stuff than smut.

Hobo - I’ll send you a couple of links complete with free passwords I confiscated off my Managing Director, Michael ;)

Can’t wait for this list. If it contains passwords from one of my adult sites I’ll sue Michael! ;)

If someone wants to know more about robots.txt, where do they go?

Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.

Hobo - Can I ask you how you auto generate and mask robots.txt, or is that not for idiots? Is that even ethical?

Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.

Hobo - Think Disney will come after you for your avatar now you are famous after being interviewd on the Hobo blog?

Sebastian's avatarI’m sure they will try it, since your blog will become an authority on grumpy red crabs called Sebastian. I’m not too afraid though, because I use only a tiny thumbnailed version of an image created by a designer who –hopefully– didn’t scrape it from Disney, as icon/avatar. If they become nasty, I’ll just pay a license fee and change my avatar on all social media sites, but I doubt that’s necessary. To avoid such hassles I’ve bought an individually drawed red crab from an awesome cartoonist last year. That’s what you see on my blog, and I use it as avatar as well, at least with new profiles.

Hobo - What’s your day job? Who do you work for?

I’m a freelancer loosely affiliated with a company that sells IT consulting services in several industries. I do Web developer training, software design / engineering (mostly the architectural tasks), and grab development / (technical) SEO projects myself to educate yours truly. I’m a dad of three little monsters, working at home. If you want to hire me, drop me a line. ;)

Sebastian, a big thanks for slapping me about about Robots.txt and indeed for helping me craft the Idiot’s Guide To Robots.txt. I certainly learned a lot from talking to you for a day, and I hope some others can learn from this snippet article. You’re a gentleman spammer. :)

Internal Links - Only The First Link Counts in Google?

at 3:08 AM Posted by Vasu

I thought I would share the results of another simple test I did to see how Google treats internal links.

What does Google count, when it finds two links on the same page going to the same internal destination page.

I surmised:

1. Google might count one link, the first it finds as it indexes a page
2. Google might count them all (I think unlikely)
3. Google might count perhaps 55 characters of ALL of the available links (could be useful)

OK - From this test, and the results on this site anyways, testing links internal to this site, it seems Google only counted the first link when it came to ranking the target page.

In much the same method as my recent seo test where I tested how many words you should put in a link, I relied on the “These terms only appear in links pointing to this page” (when you click on the cache) that Google helpfully shows when the word isn’t on the page.

Again, I pointed 2 everyday words at a page that don’t appear on the page or in links to the page, and searched for the page in Google using a term I knew it would rank high for (Shaun Anderson) and added my modifier keywords. I left it for quite some time, and checked every now and again the results.

Google Cache

Searching for “shaun anderson” + “Keyword 1″ returned the page (cache shown above).

Cartoonist

Searching for the term “shaun anderson” + “keyword 2″ did not return the page at all, only the page with the actual link on it, further down the SERPS.

Fireman

Not even in a site search.

Site Search

It’s not exactly Google terrorism to identify this, so here is the actual test page where you can see the simple test in action.

So today :) on this site :) in internal links :), Google only counted the first link as far as anchor text transfer is concerned :)

How you can use to your advantage?

1. Perhaps, you could place your navigation below your text
2. This lets you vary the anchor text to important internal pages on your site, within the text content, instead of ramming down Google’s throat one anchor text link (usually high in the navigation)
3. Varying anchor text naturally optimises to an extent the page for long tail ‘human’ searches you might overlook when writing the actual target page text
4. Of course, I assume links within text surrounded by text are more important than links in navigation menus
5. It makes use of your internal links to to rank a page for more terms, especially useful if you link to your important pages often, and don’t have a lot of incoming natural links to achieve a similar benefit

Long Tail SEO

Credit - Graphic first sourced at Search Engine Land and created by Elliance, an eMarketing firm.

Works for me anyways, when I’m building new sites, especially useful on longtail searches, and there’s plenty of editorial content being added to the site for me to link to a few sales pages.

Note: I would think Google would analyse everything it finds, so it would find it easy to spot spammy techniques we’ve all seen on sites trying to force Google to take multiple link anchor text to one page.

25 Web Form Optimization Tips

Wednesday, October 29, 2008 at 11:30 PM Posted by Vasu

Stop for a moment and consider the goals of your website. Regardless of whether it’s a purchase through a shopping cart, a lead generation, white paper download, or a email opt, I’m going to bet every one of these actions requires a customer to use a web form.

With web forms playing such an important role in the completing goals, it goes without saying that we should optimize the heck out of them. Below are 25 tips for doing just that.

1. Ditch the Captchas: Captcha’s are great for blocking spam, but some evidence suggests they are just as good at blocking conversions. A little spam isn’t the end of the world, and definitely isn’t worth losing conversions over. If you must use a Captcha, make sure it’s easy to read.
2. Remove Unnecessary Fields: Do you really need to ask for your customers date of birth and gender? Even if your customers aren’t concerned about privacy issues, odds are they’re lazy and might just abandon your excessively inquisitive form. Here’s some great advice from Get Elastic on registration forms.
3. Keep It Simple: Just because we can use CSS to do all sorts of fancy things with text boxes, doesn’t mean we should. Keeping form fields simple will ensure that customers understand their purpose and won’t confuse them with design elements.
4. Clear the Clear Button: Having a clear button next to the submit button just makes it easier for customers to accidentally delete what they’ve entered. Skip this unnecessary feature.
5. Cancel the Cancel Button: In the case of long or multi-part form pages, such as checkouts, don’t give customers the option to cancel their decision. That’s equivalent to a commission driven salesperson asking, “are you sure you really want to buy this?”
6. Label Required Fields: People want to do as little as possible. For this reason, let your customers know what they are required to fill out with an asterisk or similar label.
7. Use Point of Action References: If customers are getting confused by the information you’re asking for in a particular field, include a small note with a popup link with more information. For example, one of the most common POA references is an explanation of the 3 digit CVV code found on the back of credit cards.
8. Show Formatting Examples: Some fields should have a notes showing how to format them, depending on your database requirements. For example, you might want phone number formatted in a certain way, with or without parenthesis, dashes, etc. In general though, keep these formatting requirements to a minimum in order to keep it simple for customers.
9. Make it International Friendly: Forms requiring an address can be confusing if they’re built only with US residents in mind. Check out these detailed guidelines for building international friendly forms.
10. Allow Easy Forward and Backward Movement: Customers rarely maneuver through our website the way we intend them to. In order words, they hit the back button the forward button, refresh, etc. Depending on how your forms pass data, this could cause error messages such as “this page has expired”. Make sure you test the forward and backward flow of any multiple page forms on your site.
11. Logical Tab Sequence: Don’t you hate it when you hit the tab button, and rather than going to the next field, the focus moves somewhere else on the page? This problem is likely due to the way the form is laid out with HTML tables. Make sure your forms tabs in a logical sequence to prevent customers from accidentally skipping fields.
12. Server Side Validation: Basically, there are 2 ways to ensure that your visitors are entering correct data into fields. You can use client-side scripting (such as Javascript which is browser dependent) or server side error processing. In addition to server side validation being less reliant on the user’s browser settings, it is also preferable from a security point of view.
13. Clear Error Messages: When displaying error messages when customers enter invalid data, make sure your messages are clear and well placed. This means saying “Please enter an email address” rather than something vague like “you must fill out all fields.” A best practice is taking them right back to the field with incorrect data, and displaying the error message next to it.
14. Show What’s Needed When Its Needed: It’s best to hide form fields until you know they are absolutely needed. For example, if you already know your user is from the US, you can dynamically hide the province field and show the state drop down box instead.
15. Logical List Order: When using drop down lists or radio button lists, make sure you order them in a logical way, listing items higher if they are selected more often. In other words, if 90% of your customers buy from the USA, don’t list Afghanistan as your first option, and United States at the very bottom.
16. AJAX Validation: Some sites have begun to validate form inputs as soon as the user tabs out of the field. This can be very effective, since it does not break the flow of the process. In other words, its easier to correct an error immediately after entering it rather than after the whole form is completed.
17. Remember Me Feature: For login forms, always allow customers to choose a “remember me” option, which uses a cookie to fill in login information the next time. Who wants to remember all those passwords anyway?
18. Set Focus: When a page loads containing a forms, sending the cursor to the first required field will prevent users from having to click into the field in order to start typing. This can be accomplished with a simple JavaScript function.
19. Avoid Obnoxious Password Requirements: Ever received this annoying error? “Your password must contain at least one letter, number, and be least X number of digits.” Requiring passwords to be formated in a certain way may help security, but it will likely discourage return visits since visitors must now remember a new password they are not used to.
20. Progress Indicators: For any forms that span multiple pages, make sure to include a progress indicator letting people know where they are in the process. These are most commonly seen during checkout and would include steps such as “Shipping Info > Payment Info > Receipt Confirmation.”
21. Minimize Scrolling & Pages: A good case can be made to limit the number of pages in a a multi-part form in order to prevent customers from abandoning. However, an opposing case can also be made than ridiculously long, single pages forms that require scrolling can scare off customers. There’s no sure-fire rule here, its a perfect opportunity to perform your own a/b test.
22. Strong Call to Action Buttons: Sometimes “Submit” just doesn’t cut it. In other words, be specific and action oriented with your form buttons.
23. Use External Labels: Have you ever used a form that labeled the field with text that disappears when you click into it? This can be a great space saver, but extremely confusing if a customer forgets what the field is for since the label has disappeared. Here’s a great example of why external form labels are more effective.
24. Prioritize Size and Location of Multiple Button Forms: On a form with multiple action buttons, make sure you emphasize the most important button leading to the conversion. For example, your final order confirmation screen has 2 buttons, “Finalize Order” and “Edit Order”, make sure the “Finalize Order” button is larger and more prominent.
25. Clear Confirmations: Have you ever filled out a long, tedious form, clicked submit, only to be returned to what seems like the same page with the form empty? You can do everything right with your form, but if you drop the ball on the confirmation, your customers will be helplessly confused. In addition to making a clear confirmation message, check out these other tips to prevent wasting your confirmation page.

Image Optimization Part 1: The Importance of Images

at 11:18 PM Posted by Vasu

This is the first in a series of posts about image optimization. In this series, I’ll explore how images affect web site performance and what can you do to your images in order to improve page loading times. (I won’t say how many posts in this series, so that I can claim later that I underpromised and overdelivered…).

When you think about improving page response time, one of the first obvious things to think about is the page weight. It’s obvious that, all things being equal, the heavier a page is the slower it will be. If we take this to the extreme, we can say that the fastest page you can possibly have is the blank page. Once you start adding stuff to the blank page, you’re only making it slower.

On a more serious note, it really is up to you how much content you want to put on a page, so let’s focus on what comes next. After you’ve settled on the content, it’s your job to make sure the content and components are as small as possible. Following our Yahoo! performance best practices, you should make sure that all plain text components (HTML, XML, CSS, JavaScript…) are sent compressed over the wire and that you minify CSS and JavaScript.

But what about the images, how can you speed them up without sacrificing quality and looks? But first, does it really matter?
How important are the images?

Before we start, let’s see if we should even bother with images. Lately we’ve been witnessing the rise of rich internet applications with lots of JavaScript — by “lots” meaning sometimes 300K or more worth of JavaScript code. In other cases, especially in advertising, Flash seems to be the weapon of choice. So, on average, how much of the page weight is images. It’s easy to answer this question by just looking at Alexa’s top 10 websites in the world (as of October 2008) and use YSlow to check what percent of the total page weight is images. The results are given below.
Percentage of page weight that goes to images, average 46.6% 1 Yahoo! 39%
2 Google 75%
3 YouTube 37%
4 Live.com 94%
5 Facebook 39%
6 MSN 59%
7 MySpace 36%
8 Wikipedia 34%
9 Blogger 28%
10 Yahoo! JP 25%

On average, 46.6% of the page weight for these popular sites consists of images, included either inline with tags or via CSS stylesheets. Other studies show that this percent can be even higher, depending on the cross section of sites you examine. The exact number is not important, because every site is unique and different from the average; for example Amazon’s home page was made of 75% images at the time of the experiment.

This is a massive percentage and it tells us one thing: There’s huge potential to improve the performance of websites if we can improve the way we handle the image payload. By focusing on images you can make a difference and delight your site visitors with a faster and more pleasant experience.
To be continued…

Over the course of the following weeks, we’ll be publishing more about image optimization. The topics for discussion include:

* different image formats and how to pick the right one
* ways to put your images on a diet without compromising quality
* optimizing generated images
* the effect of using AlphaImageLoader
* favicons
* CSS sprites
* serving images faster

The series of posts will not require Photoshop or other designers’ domain-specific knowledge, so it should be pretty easy for anyone to learn and apply these techniques. More to come soon…

Google Adds RSS Feeds For Web Search Results

at 11:17 PM Posted by Vasu

Google RSS Feed screenshot

As expected, Google has added an RSS feed for web search results to the Google Alerts service. As seen in the screenshot above, when creating a new alert, you can now choose to get the alert via email or RSS feed. RSS feed alerts are only available to logged-in Google account holders.

As we reported earlier this month, Google is the last major search engine to offer its web search results via RSS.

This is a good addition, but I have to agree with Google Operating System today: “The new feature from Google Alerts is useful, but Google should’ve provided an option to subscribe to feeds for each search result.”

Removing your entire website using a robots.txt file

at 4:20 AM Posted by Vasu

You can use a robots.txt file to request that search engines remove your site and prevent robots from crawling it in the future. (It's important to note that if a robot discovers your site by other means - for example, by following a link to your URL from another site - your content may still appear in our index and our search results. To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag.)

To prevent robots from crawling your site, place the following robots.txt file in your server root:

User-agent: *
Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, place the following robots.txt file in your server root:

User-agent: Googlebot
Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):

User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt):

User-agent: *
Disallow: /

Source Code for Web Robot Spiders

at 2:51 AM Posted by Vasu

Robots (also known as spiders, wanderers, worms, crawlers, gatherers, intelligent agents) follow links from one web page to another. They work with indexing code to store data for later searching.

Robot Source Code

There is a good deal of free open source code available -- you don't have to start from scratch. Take a look at some of the options below, in the programming language best suited for your needs. If you'd like to contract your robot out, see the Robots Consultants page.
Useful Links

* Robot Spider Coding Checklist at SearchTools.com

* Bot2001
o In Search Of Search Bots by Brian Profitt
Describes a presentation by Sundar Kadayam, CTO of Intelliseek on the nature of sophisticated search bots, thinking beyond simply gathering static data. Describes how an advanced metadata agent (such as Intelliseek) works by selecting the best information sources, sending the query and receiving results, post-processing to organize results, present them, and offer updates on the query in the future.

o BotSpot Feb. 14 2001 Newsletter
conference panel suggestions for learning to program robots

Perl

Harvest NG
The Gatherer module is the robot which follows the links
Combine Harvesting Robot
Powerful and flexible robot control
Libwww (Perl 5) and Libwww (perl 4)
Perl modules for accessing Web pages, including some examples of following links.
Agent Perl WebReview.com, August 29, 1997 by Ben Smith
Nice tutorial about writing a search indexing spider or robot using Libwww.
MOMspider (Multi-owner Maintenance Spider)
Designed for checking links on multiple servers.
WWW-Robot 0.021 (alternate 0.011 version)
Configurable web traversal engine

Java

Class Acme.Spider
A web-robot that performs a breadth-first crawl and returns URLConnections. Written by the inimitable Jef Poskanzer.

Writing a Web Crawler in the Java Programming Language Java Developer Connection, January 1998 by Muscle Fish developers
Describes an example program following links to get files, keeping track of those already found. Honors robots.txt. Source code available.

BDDBot
Java robot / search engine / web server

NQL (Network Query Language) Java version

SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers
Sophisticated article from WWW7 conference about the issues involved in robot crawling. The implementation is in WebSPHINX.

C and C++

W3C Webbot - Libwww Robot
HTTP robot source code in C based on "Libwww", primarily designed to test HTTP/1.1 pipelining, but usable for other purposes.

ht://Dig
Full-featured search engine in C++, contains a sophisticated robot.

SWISH-E
Another full search engine with a robot spider.

Pavuk
A program designed to copy entire sites by following links and gathering the pages. Implemented with an interface for Mac OS X Server as epicware WebGrabber.

Pre-emptive Multithreading Web Spider MFC Programmer's SourceBook article, June 21, 1998 by Sim Ayers
Tutorial article on making a spider in MFC with a lot of explanation.

Other

TkWWW Robot
Robot code in Tcl/Tk

Commercial Products

Tenmax Dataplex Robot
High capacity web spider can handle millions of pages per day, complex HTML and even JavaScript.

Checklist for Search Robot Crawling and Indexing

at 2:28 AM Posted by Vasu

This document provides both technical information and some background and insight into what search engine indexing robots should expect to encounter . Technically, the problems arise from misunderstandings and exploitation of anomalies by HTML creators (direct tagging, WYSIWYG and automated systems), and the tendency of browser applications to be very forgiving in their interpretation of pages and links. Therefore, it's impossible to simply read the HTML and HTTP specifications and follow the rules there -- the real world is much messier than that.

Related page: Source code for Web Robot Spiders

Topic
Suggestion
Servers, Hosts and Domains

For best results, you should work with the servers and conform to their expectations, derived from the behavior of other search engines. But you'll also need to defend against some tricks that have been developed to improve rankings (search engine spam).
Virtual Hosts and Shadow Servers

Virtual hosts and virtual domains allow one server and IP address to act as though it is many servers. To access this, be sure to include the HTTP/1.1 "Host" field in all requests. Most web hosting services use these features to accommodate client hosts. In those cases, you should be sure to accept these URLs and index them using their own host name, although the IP address may be the same.

However, a few search engine spammers will create multiple hosts and even domains, and point them all at the same pages (occupying more of the desirable high rankings in the results). In addition, they may submit or create links to the IP address and even a hex or ten-digit version of the address. The alternate version is sometimes used to get around firewalls and proxies, but it's also used for search engine spam. You may want to do a random check on IP addresses and make sure that similar IP address pages are not just duplicating data, or at least design for future spam checks on this issue.
User-Agent
Your robot should always include a consistent HTTP Header "User-Agent" field with your spider's name, version information and contact information (either a web page or an email address). The spec says that you should put an email address in the "From" field as well.
Referer Field

Another helpful way of working with webmasters is to include the referring page in the HTTP header "Referer" field (yes, it's misspelled). This is the page containing the link you are following. You may want to add this only when you are doing a first crawl, so you don't have to include it in the index database. In any case, it will help webmasters who read their logs to locate bad links and generally understand what you are doing and how you got where you are.
Accept
The HTTP "Accept" field lets your request define the MIME types of files you want to see. Few robots really want audio files, telnet links, compressed binaries, conferencing and so on, but may accidentally follow those links. Using this header field lets the server return a 406 ("not acceptable") status code when the file requested is not one of the desired types, instead of wasting bandwidth and processing time on both server and client sides.
Robots.txt

First, read the Web Robots FAQ and the Guidelines for Robot Writers. They're old but still definitive. There are are a couple of additional checks you may find useful:

* make sure you can read the robots.txt file whether the line-break characters are LF, CR or CR/LF.
* Assume that the web server is not case-sensitive (better to be conservative)
* some people forget to put a slash / to indicate the root directory: if you see a disallow without a slash, assume it starts at the root.

In general, you should check the robots.txt file before any indexing. If you are reading only a few pages a day, you could check less often, perhaps weekly.
META Robots Tag

In addition to the robots.txt file, the Robots Exclusion Protocol allows page creators to set up robot controls within the header of each page using the Robots META tag. Be sure you recognize these options:

* meta name="robots" content="noindex" do not index the contents of the page, but do follow the links.
* , meta name="robots" content="nofollow"you can index the page but do not follow the links
* meta name="robots" content="noindex,nofollow", you should neither index the contents nor follow the links.
* meta name="robots" content="index,follow", not required, default behavior.

Be sure that you can handle spaces between noindex and nofollow, capitalization variations, and even a different order (nofollow, noindex).

Do not mark these pages as off-limits forever: the settings may change, so you should check them again in future index updates.
Indexing Speed

Although many web servers are perfectly capable of responding to hundreds of requests a second, your indexing robot should be relatively conservative. I recommend that you do one page per ten seconds per site: it will be much faster than most of the other search engines (they tend to do very slow crawls these days). One per ten seconds will make sense to whoever is reviewing the web server log.

Avoid hammering servers! web hosting providers may have multiple virtual hosts per IP address, and multiple IP addresses per physical machine. They are likely to disallow your robot via robots.txt or IP access controls if you overload them.
Server Mirroring and Clustering
Many sites are concerned about server overload, so they use multiple servers. In most cases, it won't matter to the robot, because the pages will be distributed and the links will remain static. However, DNS-based load-balancing servers switch among multiple hosts based on current load, so a robot could get to a page on www1.domain.com that is really the same as the page on www2.domain.com. I don't know of a good solution here, as the IP addresses are different. You may want to set up periodic checks on your index for large amounts of overlap.
Following Links

Simple HTML links are straightforward, however there are many that are not simple at all.
HREF Links

Here's a checklist of things to watch out for in HREF links:

* Port numbers (A HREF="http://www.domain.com:8010"). This is entirely legitimate and the robot should follow this.
* Anchor tags (A HREF="page.html#section). In this case, the robot should strip the text after the # and simply go to the page.
* Extra attributes such as mouseovers, which you should simply ignore

"A HREF="folder/mainpage.ht ml"
onMouseOver="display(6);display(8);self.status ='status message'; return true " onMouseOut="display(5);display(7);""img src="image.gif" align="MIDDLE" border="0" width="24" height="23" naturalsizeflag="3" /"/a
* Quotes in URLs - some content-management programs will insert single or double quotes (', ")in a URL string. In general, stripping these characters should give you a working URL.

Relative Links

Absolute links (those starting with http://) are easy. Relative links are trickier, because page creators are fairly bad at understanding them, content-management programs generate weird ones and browsers go to great lengths to decode them.

The rules are described in RFC 1808 and RFC 2396: basically it is like Unix addressing, where the location starts from the current directory and go down through child directories using slashes (/) and up through parent directories using two dots (..). A variation starts with a slash, meaning the root directory of the host. With these tools, any page can refer to any other page on the site, but it can get confusing:

* some content-management systems add a dot (.), meaning the current directory, to relative links, even though it's not necessary. Ignore this.
* confused page creators can add excessive parent directories, accidentally directing clients above the root level of the host. Browsers attempt to compensate for this by rewriting the URL to start at the host, so your robot should do the same.
* other bizarre combinations can occur, such as this one reported by a search engine administrator: "a href="foldername/http://mydomain"". Your robot should be robust enough to ignore these without problems.

Capitalization and Case Sensitivity new
Some web servers will match words in the path in either upper or lower case letters. Others require exact matches, so a link to www.example.com/SubDir/SpecialPage.html is different from www.example.com/subdir/specialpage.html. Therefore, your robot should store and request pages using the exact case of the original URL. Note that domain and host names are defined by the specification to be case-insensitive, so you don't have to worry about them. (Thanks to Peter Eriksen for reminding me of this problem.)
BASE Tag

Stored in the "head" section of an HTML document, the BASE tag defines an alternate default location for relative links. So if the page is www.example.com but it includes base href="http://www.example.com/subdir/index.html", all relative links in this document should start from the directory subdir, rather than the root.
Other Kinds of Links

Link Tag
A rarely-used tag that can only appear in the HEAD section of the document, LINK tags can point to alternate versions of documents such as translations or printable versions. They include the familiar HREF attribute and some metadata such as the title, MIME type and relationship that you might find useful.
Links to Frames

The links to pages in framesets are in src attributes in the tag, like this: "frame src="page.html"", within a larger "frameset" tag. Just treat the SRC as HREF and you're fine. There may also be normal text, including tags, within the "noframes" tag.

Note that the linked framed page may be somewhat unintelligible by itself, as when the search engine returns a link in a results list, and the end-user clicks on that link. Smart page creators work around this with navigation and JavaScript, but there's not much a search engine can do about it without storing a lot of context information.
Image Links

In general, IMG SRC attributes do not contain links to indexable pages, so requesting them would waste processor power, bandwidth and time.

Client Side Image Maps are HTML coordinates within an "area" tag or a larger "map" tag using the familiar HREF format. All you have to do is ignore the shape and coords attributes.

Server Side Image Maps are special files only existing on the server. They also include a coordinate dispatch system, but it's harder to get at. You have three choices

* ignore these links
* try sending all or a random set of coordinates to the server and see what happens (for example "http://www.acme.com/cgi-bin/competition?10,27")
* ask the server to send you the image map file and try to decode it yourself. For an example, see the NCSA tutorial.

Object Links
Objects are the generic class of which Images are specific instances, but they also include Java applets, graphical data and so on.
JavaScript Links

JavaScript can generate text using the commands document.write and document.writeln. If you see those commands, I'd recommend parsing through to the end of the JavaScript and extracting any links you can locate, probably indicated by HREF, http://, .html or .htm.

JavaScript is also used for menus and scrolling navigation links. In this example, the JavaScript command is "onChange" and you can see the HREF is dynamically combined with the links.

"p"Choose a Page:"/p"
"form name="jsMenu""
"select name="select"
onChange= "if(options[selectedIndex].value)
window.location.href= (options[selectedIndex].value)" "
"option value="..index.html""Home"/option"
"option value="a.html""Page A"/option"
"option value="b.html""Page B"/option"
"option value="c.html""Page C"/option"
"/select"
"/form"

Again, the best that you're likely to be able to do is parse through this text and locate the ".html" (and other known page extensions).

Page Redirection
There are two ways that webmasters can indicate to the clients that a page has moved or is not really available at the original URL.
Server Redirects
Server Redirects send back an HTTP status code value of 301 (moved permanently) or 302 (moved temporarily). They also send back a Location field with the new URL. In that case, the robot should use the new URL and avoid asking for the old one again.
Meta Refresh

The META refresh tag is designed to update the contents of a page, perhaps after performing processing or as part of a graphic interaction. Some sites use the META refresh tag to redirect clients from one page to another, although the HTML 4.0.1 spec says not to do this. However, it may be the only option for page creators without access to server redirects. The syntax is:

"meta equiv="Refresh" content="1; URL=page.html""

The slightly odd construction indicates how long before the refresh and the link to the target page: note that it does not have quotes after the URL= but does at the end of the string. Your robot should follow this link, but I'm not sure if it should index the contents of the referring page. Perhaps if it's over a certain length, or if the refresh interval is over a certain time. Some search engine optimization gateway pages use this technique to improve their rankings in the engine while still showing browser clients the full contents of their pages.
Directory Listings
Some servers will display a listing of the contents of a directory, rather than a web page, such as Apache's mod_autoindex module. The robot should follow these links normally.
File Name Extensions

The robot should accept pages which end in ".txt" with a MIME type of text/plain -- they are almost always useful and worth indexing. However, if the page ends in ".log", it's probably a web server log file, which should not be indexed.

Most HTML pages with a MIME type of text/html which do not end in ".htm" or ".html" are dynamically generated by a script or program on the server. But they are generally straightforward HTML and should be indexed normally.

Common file name extensions include:

* .ssi and .shtml - Server-Side Includes
* .pl - Perl
* .cfm - Cold Fusion
* .asp - Active Server Pages
* .lasso - Lasso Web Application Server
* .nclk - NetCloak
* .xml - XML text files (MIME type text/xml, becoming increasingly important!)

Dynamic Data Issues
Dynamic HTML pages are generated by server applications when a specific URL is requested. There is no definitive way to know if a page is dynamically generated, but those with URLs including the characters ? (question mark), = (equals sign), & (ampersand), $ (dollar sign) and ; (semicolon) tend to be dynamic. While a few of these pages are rendered as Java applets or JavaScript, most are just HTML assembled on-the-fly, and are easily indexable.
Entry IDs
These help the web server analysts track the movement of the clients through the site: they tend to have simple data at the end, such as WebMonkey's ?tw=frontdoor. However, they are duplicate pages so you may want to strip the ends if you recognize a pattern.
Session IDs
These numbers are trickier -- they are generated automatically when a client enters a site to create a session or state within the stateless HTML/HTTP system. So every time a robot evaluates a page, it will see URLs that are different from those it has seen before, and will attempt to follow those links and index those "new" pages. Session IDs (usually from ASP or Java servers) tend to include the text $sessionID. If possible, your indexing robot should recognize this string and compensate for the apparent discrepancies.
Cookies

Cookies are a much more sophisticated way to store state information about a client. For more information, see the Builder.com article. In general, if you include a way to store cookies from a site and send them back on request, interaction with sites that use them will be smoother and more straightforward. Excalibur stores cookies and sends them back correctly.
Domino URLs
Lotus Domino is a web server that generates multiple dynamic views of pages, so that users can show or hide parts of the page. For an example, see the Notes.net author index - clicking on any combination of the triangles will generate a different URL and a different version of the same page. To avoid this, you may want to automatically ignore all URLs that contain but do not end in "OpenDocument" and "OpenView&CollapseView" and ignore all URLs that have "ExpandView" or "OpenView&Start". Excalibur and Ultraseek have settings to do this automatically, and AltaVista Search includes examples of writing these rules.
Infinite Links
Infinite Autolinks are often generated by server applications that simply respond to requests for linked pages. For example, the WebEvent calendar at UMBC provides links to the next month, the next year, and so on, more or less forever. A human will notice that there are no events scheduled for 2010, for example, but a robot will not, and will simply continue to follow links until stopped. You may want to put a limit on pages per top directory, links followed from a single page, or use other throttling techniques to keep your robot under control.
Infinite Loops
Infinitely Expanding Loops generally occur when a server has a special error page (HTTP status 404, file not found), but the page itself contains links to other pages which are not found. This can create URLs that add directory names to the link forever: /error/error/error/error/error.html. I recommend evaluating URLs before following them, and never going more than three or four levels deep with the same directory name.
Default Pages

Sites will probably have links to a single page as both a directory URL (www.example.com/dir/) and a default name within the directory (www.example.com/dir/index.html). The server will automatically serve the default name when the directory is requested. The most common names are:

* index.html or index.htm
* default.html or default.htm or default.asp
* main.html or main.htm
* home.html or home.htm

You may want to index these separately or check for duplicates and delete one version or the other.
Problem Link Status Codes
The HTTP/1.1 standard includes a set of status and error messages returned by web servers to clients, including browsers and robots. Your robot should recognize these codes and handle them correctly.
2xx: Successful

Web servers will reply with a 200 when they serve a normal page, or when the client sends an "If-Modified-Since" request and the page has been changed since the indicated date. Other 2xx codes can be safely treated the same way.
404: Page Not Found and 410: Gone
When a URL cannot be resolved, web servers are supposed to return a 404 code, indicating that there is no page at this address. This may be permanent or temporary: there's no way to be sure. Many search engines track how many times this code is returned, and purge the page from their system after three or four consecutive errors. If you see a 410 code, the page is gone and there is no way to get it, so you shouldn't try again.
Other 4xx Status Codes
These codes indicate problems with the URL or request. I recommend that you track these URLs and do not retry more than once per month to once per year unless requested.
5xx Status Codes
These codes tend to be transitory problems and you can retry them more often, perhaps once a day or once a week.
Updating the Index

To update the index, you should revisit the pages periodically to locate new and changed pages.

It's not clear how often is right. Many search engines track how often various sites and pages change, and revisit them according to their own internal schedules. Note that some servers pay transmission fees by the byte, especially those in low-technology regions, so constant revisiting can cost them significant amounts of money.
Expires

HTTP/1.1 includes an Expires field which tells you when the information on the page is no longer current. This is mainly used for caching but is also good for indexers, which can revisit the site on the expiration date.
If-Modified-Since
Servers using HTTP/1.1 (and extensions of HTTP/1.0) allow clients to send an If-Modified-Since field with a date and only get the contents of a page if the page is marked as modified after that date. If the page is older than the date, the server returns a status code of 304. This is quite efficient, reducing the CPU and network load on both server and client machines.
Modification Date Problems

Some servers, especially those sending dynamic data, always set the modification date to the current date and time. Some search engines check all pages against the contents of the index before adding them -- if the page is the same as one that is in the index, the new page is considered a duplicate and is not added. Other engines simply re-index the page with the new date.

Note: many pages change only in the content of the links to advertising banners. To avoid excess index updating load, the robot could ignore offsite link changes in IMG tags when computing a checksum.

Indexing Pages

Indexing is not just about following links, it's also about understanding HTML pages and making good decisions based on common practices on the Web.

Content
Action
Attributes
In addition to indexing plain text surrounded by tags, search engines should generally index certain selected tag attribute text. The most important is the ALT attribute in the IMG tag, which contains a textual description of the image linked into the page -- an excellent piece of additional data. Much more rare, but still useful, is the LONGDESC attribute, which is a link to a page containing more information about the image or object: you may want to index this page separately, or incorporate it into the contents of the linking page.
JavaScript Text

JavaScript text not only contains links, but also textual content. In general, you should look for the JavaScript command document.write or document.writeln and then look for text in single or double quotes. Note that the backslash (\) escapes quote marks so they are treated as literals rather than as begin and end points. For more information see ProjectCool's JavaScript Structure Guidelines.

Note that some search engines index whole JavaScripts, without doing any parsing, which includes a lot of junk.
XML
You can index XML by simply indexing the data between tag fields. It's not wonderful, but it's a lot better than ignoring it. In future versions, tracking the fields and hierarchy will make searching very interesting.
Style Sheets
Style Sheets (most commonly CSS, Cascading Style Sheets) allow web designers to make their pages look nice. Indexers should ignore these, but some do not. Never follow links to pages ending in .css or index or extract any text in a page within the "style""/style" tags.
NOINDEX tag
There is no standard way for page creators to tell search engines not to index parts of pages, such as navigation and copyright data. Therefore, I recommend that you recognize a pseudo-tag, by ignoring all text within the comments "!-- noindex --" and "!-- /noindex --", and you should write your indexer so it's easy to add other tags or forms for marking text not to be indexed.
HTML extended characters
You should recognize and handle HTML extended characters (for example for non-breaking space and & for ampersand) in indexing and in storing a page extract for later display in the results listing.
Handling Multiple Languages

Indexing text works fairly well even if you don't know much about the language. The more information you can add regarding word breaks and punctuation, the better it works, so consider designing your system in ways that allow future extensions in this area.

HTML has a way for page creators to specify the character encoding in the Content-Type header field.
Unicode
HTML and XML support Unicode, the almost-universal encoding system. Your indexer should do the same.
Language Codes
The language of a page can be set for a whole page: "html lang="de"" or for blocks within a page: "p lang="es"". This is rare but should be recognized when found. For more information, see the HTML spec Language section.
Metadata
In the HTML context, metadata is text in the META tags in the "head" section of the HTML page.
META ROBOTS tags

In addition to the robots.txt file, the Robots Exclusion Protocol allows page creators to set up robot controls within the header of each page. Be sure you recognize these options:

* "meta name="robots" content="noindex"" do not index the contents of the page, but do follow the links.
* "meta name="robots" content="nofollow"", you can index the page but do not follow the links
* "meta name="robots" content="noindex,nofollow"", you should neither index the contents nor follow the links.

I'd recommend re-fetching these pages and checking them in future indexing crawls, however, because the settings may change. For more information, see the HTML Author's Guide to the Robots META tag.
Keywords

Page creators can add keywords to their pages to describe them more clearly, and these should be indexed together with the text of the page. For example, a page on cats and dogs as pets might have the following keywords:

"META NAME="keywords" CONTENT="cats, dogs, small animals, pets,
companions, chat, catz, dogz""

Some keywords are delimited by spaces, some by commas, and some by both, and extra white space should be ignored. Where the creator has added commas, I recommend that you treat the contents as phrases. You should also use this information to help weight the page in the results rankings, but beware of keyword spamming (adding many repetitions of a word).
Description
Page creators can also set up a description of their page so you don't have to extract it. You should also include the text of this description in the index.

"meta name="description" content="All about living with cats and dogs.""

Publication Dates
Web page publication dates are currently derived from the page modification date. This is often wrong, as the page can be opened and saved without any changes. Future systems will store a publication date in the metadata, so an indexer should be ready to get the date from the contents of the page when that system becomes standardized.
Dublin Core

The Dublin Core initiative is extending the META tags to standardize information on authorship, publication, scope, rights and other information. For example, the SearchTools home page could be indexed with the following tags:

"meta name="DC.Title" content="Guide to Site, Intranet and Portal Search""
"meta name="DC.Creator" content="Avi Rappoport""
"meta name="DC.Publisher" content="Search Tools Consulting""
"meta name="DC.Language" content="en""

While you may not want to index these tags now, be sure to design for future compatibility.
Page Descriptions

Most search services display some text from the matched pages, allowing searchers to learn something about the contents before they click on the link. This can come from one or more of several sources:

* the contents of the Meta Description tag, written by the page creator
* matching lines: sentences which contain the search terms
* first useful text: avoiding navigation information by locating the first header tag and extracting text starting there
* summary text: uses a special formula designed to find the sentences which best summarize the page content
* top text: text from the top of the page

The first three options are the best: search engines which attempt to find the "best" sentences usually fail; those which extract the first text from the page often display useless navigation information or even JavaScript or CSS.

How To Handle Redirecting default.asp in IIS? Duplicate Content

Tuesday, October 28, 2008 at 11:44 PM Posted by Vasu

A Google Groups thread has discussion from SEOs and a Googler on the topic of removing the default.asp from your web site, through a redirect method on an IIS server.

What is the typical issue with IIS servers and redirecting? As the thread creator said:

I just realized that I have a different page rank for www.mywebsite.com and for www.mywebsite.com/default.asp. I would like to combine them into only one: www.mywebsite.com
From my default.asp page (wich is my default page off course...), do you know a way to make a 301 redirect to the root / without doing an endless loop?

John Honeck has tips at his blog post named 301 Redirects in ASP on an IIS Server. Does this answer all the questions? The forum discussion makes it sound like it does not.

Googler, JohnMu, called this the "big issue with IIS". John suspects the "newest version of IIS can handle things a bit better," but he said that most hosting companies are not running that version of IIS yet. So what options do you have?

John explains that using sessions to manage this won't work for search engines, because spiders don't handle sessions well. Thus if a spider find it, they will just run into an infinite loop on your site and that can be bad for many reasons.

John recommends the following:

The best solution is to make sure that there is absolutely no mention of "/default.asp(x)" on your site, instead only mentions of "/". You can confirm this by using a crawler such as Xenu's Link Sleuth. However, take care that you do not use forms anywhere, because they will generally return their results back to the file ("/ default.asp(x)").

For more details on this issue and troubleshooting it for your site, I highly recommend you read the whole thread.

5 Tools for On-page Image Usage Analysis

at 11:30 PM Posted by Vasu

Image optimization is both vital for “search engine friendliness” and web accessibility. Let’s look at a few top tools that can help you analyze both the aspects of image proper usage:

Juicy Studio Image Analyser is a handy online tool that will look at each image on a given page and evaluate the following parameters:

image width / height;
alternative text;
an URL to an image long description.

Juicy Studio: Image Analyser

Note that some of the “errors” found by the tool should not necessarily be corrected (e.g. very seldom an image needs a long description URL), so use it rather for informational purposes than as a call to action.

Alt Text Checker (by Durham University) will list an alt text information next to each image found on the page:

Information Technology Service : Alt Text Checker - Durham University

Page Size Extractor will give you a quick idea of how the page images influence the page size and hence load time by giving:

total number of on-page images;
the largest image size;
the total image size.

Page Size Extractor - Image size analyzer

Web Developer FireFox: Toolbar offers an array of image analyzing tools:

display alt attributes;
display image dimensions;
display image sizes;
display image paths;
find broken images;
outline images missing alt attributes;
hide images / background images;

Web Developer Toolbar - Image Analyzer

Firefox Accessibility Extension offers a most useful feature summarizing all page images in the form of a handy table (the feature can found under “Text equivalents” => “List of images“). The table is extremely easy to use as (1) it highlights “the problematic” images and (2) it can be sorted by any of the following parameters:

Image alt text;
Image source link;
Image width;
Image height;

Accessibility Extension

Beyond Link Building Tools

at 11:29 PM Posted by Vasu

How did folks build links before tools were available? Just a few years ago, there was no way to identify hubs, authorities, vortals, or spokes, rims, chutes and ladders. (Click here for the full effect of that sentence via a ten second mp3 audio message.) When I wander around conference expo halls, I see booth after booth of tools that generate reams and reams of data. But I never see a booth with a person offering to show you the exact process by which you can obtain a single high value high link that’s been identified as being perfect for your particular content. Where’s that booth? Where is the speaker who will go beyond the tools and to the heart of what’s important about finding links? Where is the service with the slogan, “Wicked Cool - Not Actionable “. Finally, where is the person who will rise above it and remind everyone that tools are dead weight when it comes time to finally, y’know, make something real happen?

To put this in more actionable terms, if you pull competitor linking data and find out one of your competitors has 288 backlinks with two word anchor text from Pagerank 3 or better sites, and you have 281, then all you need to do is get yourself eight more links and your problems are over, right? Right? Or wait, maybe they don’t have any keyword anchor backlinks from Pagerank 5 or better. All you need is one of those to win, right? Or maybe you notice they have 22 backlinks without any anchor text from .edu’s, whereas you only have 21. Could it really be that your path to winning the ranking’s game is as simple as two more .edu links? And if it is, what technique are you going to use to actually get those links? Real or fake? Hmmmm. Your sudden interest in offering a university discount program reeks of paid links in disguise. Don’t laugh, it’s been done to death already. Have a look. And if this technique did work, is it sustainable? Do you want your rankings to be based on a technique you’d never thought of until two minutes ago and which is based on a perceived white hat loophole?

Speaking of technique, it’s important to note that if the doctor doing your open heart surgery is using a HeartFixerPRO3000, if his name is Homer J. Simpson, it just doesn’t matter.

On another topic, when another site links to a page on your site, there are four ways you can know this happened. No more, no less.

The link is clicked and thus a trail is left in your server logs.
The link has not been clicked but spiders have crawled the page on which that link lives, meaning link searches, alertbots, and curious competitors can find it.
The link has not been clicked or crawled, but the person putting the link in place tells you about it.
The link has not been clicked or crawled, but you happen to find it.

That’s it. There is no way any site owner can ever know with 100% accuracy exactly how many links there are from other sites pointing to his site.

I made a totally unplanned comment at SMX East that resulted in questions. My last column, The Great Link Race Has Begun, But To Where? was my attempt to explain what I meant, using identical quintuplets as a metaphor. Here’s a different example. You know how on busy highways McDonald’s, Burger King, Taco Bell, Sonic, and sometimes Wendy’s and KFC can be all found within blocks of each other? They all seem to do fine. Near my house five have coexisted for over 15 years, so they must be making money. But you never see two McDonald’s literally right next door to each other, even though they could be. Why? There’s certainly plenty of money being spent at other places. More to the point, if you and I decided we too wanted a piece of the fast food action (or search result action) on that busy highway, what would we have to do to succeed?

8 Social Media Sites for Local Networking

at 11:09 PM Posted by Vasu

Should small/local businesses bother with social media? Or is local search where it’s at — targeting potential customers in their own cities and towns?

The good news is that small business owners don’t have to choose one or the other. Listed below are eight sites at the intersection of social media and local search, the places where these two roads meet. When you get there, keep in mind that it’s not about sales pitches and spamming; it’s about making real connections with other human beings. Think of it as the online version of going to a chamber of commerce meeting. That’s the first piece of offline marketing advice you were given; here’s how to take the same idea online.

Social Media & Local Search: Where Two Roads Meet

1. Flickr

You may think of Flickr as a photo storage/sharing site, but the heart of Flickr is its groups. Flickr has tens of thousands — maybe hundreds of thousands — of groups, and many groups are local in nature. These groups offer a great opportunity to connect with your neighbors, potential customers who might be interested in your products/services.

If you’re a small business owner in Columbus, Ohio, for example, you might want to join this Flickr Meetup - Columbus Ohio group.

This group has 624 members at the moment; no doubt some of those folks are inactive, but it’s still a great way to connect with local residents. These members are uploading local photos, of course, but they’re also talking about local events, local news, and local businesses in the group’s discussion board. Have a look:

Note that all of these discussions are active, with posts within the past two days. And if you owned an independent hotel in Columbus, or maybe a Bed & Breakfast, wouldn’t you like to answer the question from that user who started the any suggestions for a hotel? thread? That’s a gift-wrapped opportunity to start a conversation with a potential customer!

Visit the Flickr Groups section and do a search for your city. You’ll probably find an overwhelming number of matches (there are almost 600 for Columbus, Ohio). Look for groups with recently active discussions — that’s the most important thing. The number of members is important, but not as important as joining an active group where your neighbors are talking.

2. Facebook

When you join Facebook, you have to list a hometown. Facebook automatically puts you in a “network” with everyone else in your hometown. That’s the good news. The bad news? It used to be easy to browse through your local network to meet neighbors, but now it’s a chore. Still, Facebook is such a popular site, it’s probably worth the effort to try to make local connections, even with the added hassles.

Click on the “Settings” button at the top of the screen, then choose “Account Settings.” Then click the “Networks” tab. This will tell you how many people are in the local network, and you can click to browse the network membership. Have a look at my Tri-Cities network:

You can use the panel on the right to further sort the members of your network; if you’re looking to connect with adult males (because you own a fishing shop), you can do that.

3. StumbleUpon

Like Facebook, you have to list a hometown when you join StumbleUpon. The cool thing is that, on your profile page, your hometown shows up as a clickable link that shows all StumbleUpon users from your hometown. If you lived in Seattle, you’d end up on a page like this:

Note, too, that you can also separate users by gender on StumbleUpon (in case you own a women’s clothing boutique, for example, and only want to connect with women).

4. Twitter

Unlike some of the other social media sites listed here, Twitter is all about conversations, making it possibly the best place to reach out and find people in your neighborhood to connect with. Twitter’s advanced search page includes a geo-search option. Give it a city and state (or a zip code), and get back a list of messages (”tweets”) from people in that area. Have a look:

TwitterLocal offers a similar service, but it doesn’t appear to update nearly as quickly as Twitter’s own advanced search.

5. Yahoo Answers

Yahoo Answers can be a productive marketing tool for service-oriented businesses, or for anyone whose knowledge and expertise is a primary selling point. But the site gets so much traffic that it can be overwhelming for a local business. Fortunately, a local business owner can bypass most of that and get right into the Q&A that matters — the stuff about your hometown. If you’re a photographer in Atlanta, you might have something to say about the top question in the Atlanta Q&A section:

To find the local sections of Yahoo Answers, look for the “Local Businesses” category on the home page, and then drill down until you find the right city for you. The drawback here is that only major cities are covered with specific categories.

6. outside.in

You’ve probably heard about the benefits of reaching out and starting relationships with bloggers. outside.in is one of two sites I’ll mention that can help you find local bloggers. outside.in is a content aggregator; they show content from both traditional media and blogs.

You can’t contact the local bloggers directly through outside.in; it’s just for locating them. You’ll want to visit the local blogs you find, start reading them regularly, leaving quality comments, and eventually introduce yourself and start that relationship.

7. Placeblogger

Placeblogger is a simple directory for local blogs. If you’re a local business owner in Houston, and you’re looking for local blogs, the Houston directory page is the place to go:

You’ll want to research the blogs listed to make sure they’re still active, then follow the steps mentioned above in the outside.in section.

8. LinkedIn

LinkedIn is a great social site for business professionals; it’s not a place to sell products, but you may be able to connect with people looking for your area of expertise. The advanced search page lets you look for other members in your area. Here’s a search for people in the computer repair industry in Los Angeles:

The ability to drill down to find people in your area and industry can help you find new business partners, employees, and other opportunities you may not know of yet.

Final Thoughts

Many small/local business owners think social media is a waste of time. For some, it would be a waste of time to jump into social media without looking for the local angle. But the sites above are at the intersection of social media and local search. They offer opportunities that almost any small/local business should be interested in: the opportunity to find and connect with potential customers in their hometown.

No matter which social media site fits best, it’s important to get involved not with an eye toward using it for sales pitches and spam; no one likes that. The idea is to connect with local people, not to alienate them. Focus not on what you can get from the community you join, but on what you can give. That’s the best recipe for local-social success.

SEO News & Search Engine Updates