Duplicate content

Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Store items shown or linked via multiple distinct URLs
  • Printer-only versions of web pages
However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.
Google tries hard to index and show pages with distinct information. This filtering means, for instance, that if your site has a "regular" and "printer" version of each article, and neither of these is blocked in robots.txt or with a noindex meta tag, we'll choose one of them to list. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.
There are some steps you can take to proactively address duplicate content issues, and ensure that visitors see the content you want them to.
  • Consider blocking pages from indexing: Rather than letting Google's algorithms determine the "best" version of a document, you may wish to help guide us to your preferred version. For instance, if you don't want us to index the printer versions of your site's articles, disallow those directories or make use of regular expressions in your robots.txt file.
  • Use 301s: If you've restructured your site, use 301 redirects ("RedirectPermanent") in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.)
  • Be consistent: Try to keep your internal linking consistent. For example, don't link to http://www.example.com/page/ and http://www.example.com/page and http://www.example.com/page/index.htm.
  • Use top-level domains: To help us serve the most appropriate version of a document, use top-level domains whenever possible to handle country-specific content. We're more likely to know that www.example.de contains Germany-focused content, for instance, than www.example.com/de or de.example.com.
  • Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to block the version on their sites with robots.txt.
  • Use Webmaster Tools to tell us how you prefer your site to be indexed: You can tell Google your preferred domain (for example, www.example.com or http://example.com).
  • Minimize boilerplate repetition: For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.
  • Avoid publishing stubs: Users don't like seeing "empty" pages, so avoid placeholders where possible. For example, don't publish pages for which you don't yet have real content. If you do create placeholder pages, use robots.txt to block these from being crawled.
  • Understand your content management system: Make sure you're familiar with how content is displayed on your web site. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, in an archive page, and in a page of other entries with the same label.
  • Minimize similar content: If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.
Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.
However, if our review indicated that you engaged in deceptive practices and your site has been removed from our search results, review your site carefully. If your site has been removed from our search results, review our webmaster guidelines for more information. Once you've made your changes and are confident that your site no longer violates our guidelines, submit your site for reconsideration.
If you find that another site is duplicating your content by scraping (misappropriating and republishing) it, it's unlikely that this will negatively impact your site's ranking in Google search results pages. If you do spot a case that's particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and request removal of the other site from Google's index.

Google Offers SEO Starter Guide

Google is getting into the SEO consulting business. Well, not quite. But, Google is now formally offering an “SEO Starter Guide” with practical advice for webmasters about improving search engine visibility and increasing traffic to a web site.

It comes in the form of a 22-page PDF announced today on the Webmaster Central blog and at the PubCon show in Las Vegas. According to the Google reps at PubCon, this is the same guide Google uses internally for its own sites (YouTube, etc.).

The guide is well written and geared toward webmasters and business owners who need a basic training in SEO. Topics covered include:

* Page Titles
* Description Meta Tag
* URL Structure
* Site Navigation
* Creating Quality Content
* Anchor Text
* Heading Tags (H1s, H2s, etc.)
* Image Optimization
* Robots.txt
* Rel=”nofollow”
* Website Promotion
* Webmaster Tools
* Analytics
* More Resources

As a longtime SEO, I’m most struck by the fact that an entire section of the guide is devoted to how to use the rel=”nofollow” attribute on individual links. Many in the SEO industry have thought this attribute is a red flag, something that tells Google that a professional SEO has been tweaking the page, and not something that an average webmaster would even know about. That’s clearly not the case anymore; rel=”nofollow” is more mainstream now thanks to a full page of explanation in this SEO guide. Have a look:

Google Explains Rel=Nofollow in SEO Guide

There are one or two other minor issues that some SEO consultants might quibble about (like the recommendation to use an XML Sitemap, something that many SEOs feel is unnecessary for most websites). But on the whole, it’s an excellent document covering many of the things you’d want in an SEO 101-type guide.

What do the robots.txt file analysis results mean?

When you test a URL against a robots.txt file, you will see one of the following results:

* Allowed— Googlebot will crawl the URL.
* Blocked— Googlebot will not crawl the URL.
* Not in domain— This URL is not on the same domain as the robots.txt file and therefore, you cannot block it.
* Syntax not understood— Googlebot does not recognize this as a valid URL.

Additionally you may see the following message:

* Detected as a directory; specific files may have different restrictions— Although this directory is blocked or allowed, there may be other, more specific rules in the file that block or allow URLs in the directory, so you will want to check those as well.

If Googlebot has difficulty understanding parts of your robots.txt file, you will see one of the following parsing results, which you will want to fix:

* Accepted, but should be Disallow— You misspelled "Disallow."
* Accepted, but should be User-agent— You misspelled "user-agent."
* Accepted, but correct syntax includes a colon (Rule: path)— You forgot to put a colon between "Allow" or "Disallow" and the path.
* Rule ignored by Googlebot— This is not a rule that Googlebot follows (for example, "Crawl-delay").
* No user-agent specified— You have rules that aren't associated with a user-agent.
* Syntax not understood— Googlebot does not understand this line.
* robots.txt file does not appear to be valid— Googlebot doesn't understand any parts of this file and therefore, doesn't recognize it as a valid a robots.txt file.

Top Blog List

S.No. Site Category
1 John Battelle Searchblog SEO / SEM
2 Problogger Blogging
3 SEOmoz SEO / SEM
4 Matt Cutts SEO / SEM
4 Copyblogger Internet Marketing
6 Search Engine Land SEO / SEM
7 Search Engine Watch SEO / SEM
8 Performancing Blogging
9 Shoemoney Make Money Online
10 John Chow Make Money Online
11 Entrepreneurs Journey Internet Marketing
12 Dosh Dosh Social Media
13 SEO Book SEO / SEM
14 Search Engine Journal SEO / SEM
15 Net Profits Today Affiliate Marketing
16 Daily Blog Tips Blogging
16 Top Rank Blog SEO / SEM
18 Search Engine Guide SEO / SEM
19 Pronet Advertising Social Media
20 Search Engine Roundtable SEO / SEM
21 WebProNews SEO / SEM
22 Marketing Pilgrim Internet Marketing
23 ClickZ SEO / SEM
24 Chris Garrett Blogging
25 Chris Brogan Social Media
26 Quick Sprout Social Media
27 Stuntdubl SEO / SEM
28 SEO Scoop SEO / SEM
29 Stephan Spencer SEO / SEM
30 Spark Plugging Make Money Online
31 HubSpot Internet Marketing Internet Marketing
31 Blog Storm SEO / SEM
33 SEO by the Sea SEO / SEM
34 Courtney Tuttle Internet Marketing
35 Web Ink Now Social Media
35 Bruce Clay SEO / SEM
35 Blogging Tips Blogging
35 ReveNews Make Money Online
39 Sugarrae SEO / SEM
39 Andy Beard Internet Marketing
41 The Link Spiel Link Building
41 Carl Ocab Make Money Online
41 Jim Kukral Internet Marketing
44 Small Business SEM SEO / SEM
44 Blueverse Make Money Online
46 Social Media Explorer Social Media
47 Vanessa Fox Nude SEO / SEM
47 1.00E+21 Social Media
47 Redfly Marketing SEO / SEM
50 Affiliate Tip Affiliate Marketing
50 PPC Hero SEO / SEM
50 Strategic Profits Make Money Online
50 Search Marketing Gurus SEO / SEM
50 SEO Fast Start SEO / SEM
50 Stone Temple SEO / SEM
56 Freelance Folder Copywriting
57 Techipedia Social Media
57 Dean Hunt Internet Marketing
57 Ian Fernando Make Money Online
60 Blogger Unleashed Blogging
60 SEO Black Hat SEO / SEM
60 Conversation Marketing Internet Marketing
63 Winning the Web Internet Marketing
63 SEO Design Solutions SEO / SEM
63 Andy Wibbels Internet Marketing
66 Marketing Tips Blog Internet Marketing
66 Wiep.net Link Building
66 45n5 Make Money Online
66 eXtra For Every Publisher Make Money Online
66 Graywolf SEO Blog SEO / SEM
71 Search Engine People SEO / SEM
71 Brent Csutoras Social Media
71 Michelle MacPhearson Social Media
71 Dave Naylor SEO / SEM
75 Garry Conn Make Money Online
75 Caroline Middlebrook Make Money Online
75 Local SEO Guide SEO / SEM
78 Solo SEO SEO / SEM
78 Blogsessive Blogging
80 Affiliate Watcher Affiliate Marketing
80 Balkhis Make Money Online
80 Jim Boykin Link Building
83 SEO Smarty SEO / SEM
84 aimClear SEO / SEM
84 Super Affiliate Mindset Affiliate Marketing
84 Dan Zarrella Social Media
87 Traffikd Social Media
88 Yoast SEO / SEM
88 Stephan Miller Make Money Online
88 CDF Networks Affiliate Marketing
88 5 Star Affiliate Programs Affiliate Marketing
92 Cre8pc SEO / SEM
92 John Andrews SEO / SEM
94 SEO ROI SEO / SEM
94 ViperChill Social Media
94 Grownup Geek Make Money Online
97 Self Made Minds Make Money Online
97 Slightly Shady SEO SEO / SEM
97 Cash Tactics Make Money Online
97 Eric Ward Link Building
101 Shimon Sandler SEO / SEM
101 Why Do Work Make Money Online
101 Cornwall SEO Social Media
101 SearchRank SEO / SEM
101 Who Is Andrew Wee Make Money Online
106 John Cow Make Money Online
106 Hamlet Batista SEO / SEM
108 Vinny Lingham Internet Marketing
108 Josh Spaulding Internet Marketing
110 Huomah SEO / SEM
110 MEMWG Make Money Online
110 SEO Theory SEO / SEM
113 Bill Hartzer SEO / SEM
113 SEOish SEO / SEM
115 SEO Chicks SEO / SEM
115 Thou Shall Blog Blogging
115 Google Lady Affiliate Marketing
118 SeoPedia SEO / SEM
118 Eric Lander SEO / SEM
120 Here.org.uk Affiliate Marketing
120 Sebastian's Pamphlets SEO / SEM
122 The Net Fool Make Money Online
122 Blogging Bits Blogging
122 Nate Whitehill Make Money Online
125 Collective Thoughts Social Media
125 Mason World Internet Marketing
125 Can I Make Big Money Online Make Money Online
125 Affiliate Confession Affiliate Marketing
125 Blue Hat SEO SEO / SEM
125 Aojon Affiliate Marketing
131 Muhammad Saleem Social Media
132 Pure Blogging Blogging
132 Terry Dean Internet Marketing
132 3 Dog Media Social Media
132 Small Fuel Internet Marketing
132 NowSourcing Social Media
132 Ades Blog Make Money Online
138 The University Kid Make Money Online
139 Jon Waraas Make Money Online
139 The Mad Hat SEO / SEM
139 Just Make Money Online Make Money Online
142 Scott Monty Social Media
142 Retire at 21 Make Money Online
142 Blogging Experiment Blogging
145 Social Desire Social Media
146 Blogtrepreneur Blogging
147 Collin Lahay Link Building
147 Jangro Internet Marketing
149 SEOptimise SEO / SEM
149 SEM Clubhouse SEO / SEM
151 eMonetized Make Money Online
151 Zac Johnson Affiliate Marketing
151 The Niche Store Builder Make Money Online
154 Palatnik Factor Internet Marketing
154 Dazzlin Donna Make Money Online
156 Cath Lawson Make Money Online
156 Mr Javo Internet Marketing
158 All Things SEM SEO / SEM
159 CPA Affiliates Affiliate Marketing
159 ShandyKing SEO / SEM
161 Problogineer Blogging
161 Stand Out Blogger Blogging
163 Jim Karter Make Money Online
164 Tyler Cruz Make Money Online
165 Nickycakes Affiliate Marketing
165 Net Business Blog Make Money Online
167 Scribbles and Words Blogging
167 Jonathan Volk Affiliate Marketing
167 Rajaie Talks Make Money Online
170 Slyvisions Make Money Online
170 Gather Success Make Money Online
170 Dat Money Make Money Online
170 Sabahan Make Money Online
174 IM with Joe Internet Marketing
174 Darin.cc SEO / SEM
176 Blogging Secret Blogging
176 Tim Nash SEO / SEM
178 Ask Shane Internet Marketing
179 Lost Art of Blogging Blogging
180 Preblogging Blogging
180 SEO Refugee SEO / SEM
182 Utah SEO Pro SEO / SEM
183 Uber Affiliate Affiliate Marketing
183 David Dalka Internet Marketing
185 Search For Blogging Blogging
186 Infected by Bugs Make Money Online
187 Newest on the Net Internet Marketing
187 Big Marketing Blog Internet Marketing
187 The Income Academy Make Money Online
190 Onreact SEO SEO / SEM
191 That Pam Chick SEO / SEM
191 Etienne Teo Make Money Online
191 Blog About Your Blog Blogging
191 Money Bites Make Money Online
195 Yimto Affiliate Marketing
196 Turnip of Power Make Money Online
196 Karl Ribas SEO / SEM
198 Bukiki HomeBiz Make Money Online
198 Marcus Hochstadt Make Money Online
200 Pat B. Doyle Make Money Online
201 We Build Pages Internet Marketing
201 Feed Flare Blogging
203 The Blog Entrepreneur Blogging
203 Enkay Blog Make Money Online
203 It's Write Now Copywriting
206 Search and Social SEO / SEM
207 Chris Hooley SEO / SEM
208 Blogging Fingers Blogging
209 SEO Scientist SEO / SEM
210 Richard Lee Make Money Online
210 Gonzo SEO SEO / SEM
210 One Man's Goal Make Money Online
213 Cash Quests Make Money Online
214 PQ Internet Internet Marketing
214 Christian Affiliate Marketers Affiliate Marketing
216 The Writers Manifesto Copywriting
217 Copy Brighter Social Media
218 Ask Kalena SEO / SEM
219 Affiliate Toolbox Affiliate Marketing
219 Internet Babel Make Money Online
221 Big Ben Patton Internet Marketing
222 Ryan Shamus Internet Marketing
223 Pimp My PageRank SEO / SEM
224 Fraser's Affiliate Marketing Blog Affiliate Marketing
225 Nick Wilsdon SEO / SEM
226 Cyber Cashology Make Money Online
227 The Home Business Archive Internet Marketing
228 Wingnut SEO SEO / SEM
229 Out of My Gord SEO / SEM
230 Sitemost SEO / SEM
231 Bruce Hopkins Make Money Online
232 Think Like an SOB Affiliate Marketing
233 Mubin Ahmed Make Money Online
234 James D. Brausch Internet Marketing

Canonical URL Issues

There's a potential canonical URL issue that we've not touched on often, if ever. It's the kind of thing that might cause indexing issues or split PageRank into different "piles" - and even, potentially, generate duplicate URL problems.

This canonical problem comes from adding a period to the end of a domain name - http://www.example.com. - and that can trigger a cascade of potential problems. If the trailing period is at the end of the domain name and the site's navigation uses relative urls, then the extra period gets carried forward, and forward, and forward, through succeeding links.

There's a new thread in our Apache Forum that touches on the issue, and it also shares a fix - http://www.webmasterworld.com/apache/3718084.htm As moderator jdMorgan observes, even google.com. has this problem!

This kind of link can be generated innocently enough by forum software that automatically creates links for text strings that look like urls but are at the end of a sentence. And many servers will not have a problem resolving that url with an extra period.

So, for the sake of a complete reference, I'd like to collect the potential canonical url issues all in one place.

Canonical URL Issues

Different domain names serving the same content (302 redirects can make this kind of mess)
Different hostnames within one domain, such as "with-www" and "no-www" versions
With and without "index.html" for the domain root or a subdirectory root
Different protocols - https and http
Trailing period on the domain name
Double forward slash in the file path - http://example.com//page.html
Swapping the order of query string parameters
URL rewrite that allows typos for the "keyworded" virtual directory name
Any forum software or CMS that generates alternate URLs for the same content
URLs that include session parameters, click path tracking, etc.
Adding a port number to the domain name: example.com:443
URLs with unneeded query strings or extra parameters in the query string