SEO News & Search Engine Updates: November 02, 2008

URLs are simple things. Or so you'd think. Let's say you wanted to detect an URL in a block of text and convert it into a bona fide hyperlink. No problem, right?

Visit my website at http://www.example.com, it's awesome!

To locate the URL in the above text, a simple regular expression should suffice -- we'll look for a string at a word boundary beginning with http:// , followed by one or more non-space characters:

\bhttp://[^\s]+

Piece of cake. This seems to work. There's plenty of forum and discussion software out there which auto-links using exactly this approach. Although it mostly works, it's far from perfect. What if the text block looked like this?

My website (http://www.example.com) is awesome.

This URL will be incorrectly encoded with the final paren. This, by the way, is an extremely common way average everyday users include URLs in their text.

What's truly aggravating is that parens in URLs are perfectly legal. They're part of the spec and everything:

only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

Certain sites, most notably Wikipedia and MSDN, love to generate URLs with parens. The sites are lousy with the damn things:

http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)
http://msdn.microsoft.com/en-us/library/aa752574(VS.85).aspx

URLs with actual parens in them means we can't take the easy way out and ignore the final paren. You could force users to escape the parens, but that's sort of draconian, and it's a little unreasonable to expect your users to know how to escape characters in the URL.

http://en.wikipedia.org/wiki/PC_Tools_%28Central_Point_Software%29
http://msdn.microsoft.com/en-us/library/aa752574%28VS.85%29.aspx

To detect URLs correctly in all most cases, you have to come up with something more sophisticated. Granted, this isn't the toughest problem in computer science, but it's one that many coders get wrong. Even coders with years of experience, like, say, Paul Graham.

If we're more clever in constructing the regular expression, we can do a better job.

\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]

1. The primary improvement here is that we're only accepting a whitelist of known good URL characters. Allowing arbitrary random characters in URLs is setting yourself up for XSS exploits, and I can tell you that from personal experience. Don't do it!
2. We only allow certain characters to "end" the URL. Ending a URL in common punctuation marks like period, exclamation point, semicolon, etc means those characters will be considered end-of-hyperlink characters and not included in the URL.
3. Parens, if present, are allowed in the URL -- and we absorb the leading paren, if it is there, too.

I couldn't come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens. Thus, there has to be a handful of postfix code to detect and discard the user-enclosed parens from the matched URLs:

if (s.StartsWith("(") && s.EndsWith(")"))
{
return s.Substring(1, s.Length - 2);
}

That's a whole lot of extra work, just because the URL spec allows parens. We can't fix Wikipedia or MSDN and we certainly can't change the URL spec. But we can ensure that our websites avoid becoming part of the problem. Avoid using parens (or any unusual characters, for that matter) in URLs you create. They're annoying to use, and rarely handled correctly by auto-linking code.

In the October 2008 survey we received responses from 182,226,259 sites, which reflects growth of 948 thousand since last month.

Apache once again shows the largest growth, gaining 463 thousand sites this month. ThePlanet.com gains 1.3 million sites this month — nearly all of which are running on Apache — but this includes a large number of 'link farm' sites that use .pl domains to propagate search terms using pornographic phrases.

Google shows the next largest growth and boosts its total by 411 thousand sites. Google now runs 10.5 million sites on its own webserver software, which is used to host its own services in addition to user-generated applications and blogs. Some server names include:

* GFE/1.3, which is used by Google's Blogger service to publish third party blogs under the blogspot.com domain, and spreadsheets and other documents under docs.google.com.
* GWS-GRFE/0.50, which runs Google Groups.
* gws. This simple, lowercase name is used by Google's main search site at google.com and Google Image Search.
* Google Frontend, which is used to run third party applications on Google App Engine (often using the appspot.com domain) and Google Mashups.

Total Sites Across All Domains August 1995 - October 2008

Total Sites Across All Domains, August 1995 - October 2008

Market Share for Top Servers Across All Domains August 1995 - October 2008

Graph of market share for top servers across all domains, August 1995 - October 2008

Top Developers
Developer September 2008 Percent October 2008 Percent Change
Apache 91,425,295 50.43% 91,888,508 50.43% -0.01
Microsoft 62,374,823 34.41% 62,766,928 34.44% 0.04
Google 10,076,405 5.56% 10,487,607 5.76% 0.20
lighttpd 3,095,928 1.71% 3,072,457 1.69% -0.02
Active Sites
Developer September 2008 Percent October 2008 Percent Change
Apache 33,719,369 46.75% 33,310,242 46.26% -0.49
Microsoft 25,155,273 34.88% 25,594,704 35.55% 0.67
Google 7,714,617 10.70% 7,645,615 10.62% -0.08
lighttpd 144,499 0.20% 134,161 0.19% -0.01
Totals for Active Servers Across All Domains
June 2000 - October 2008

SEO News & Search Engine Updates

The Problem With URLs

October 2008 Web Server Survey

Blog Archive

Recent Post

Labels