Duplicate Content Issues

What is Duplicate Content?

Duplicate content is identical (or near identical) content that can be found in more than one place on the Internet. Although Google will not release the exact figures, Go Up estimate that that if 70% of content on a page is identical to the content of different page, this page will be considered a duplicate.

Why is it bad?

Duplicate content has been a problem with Inbound Marketing and SEO for a long time. But, with the introduction of Google’s Panda Update in February 2011, the problem has become a lot more significant. Prior to Google Panda, duplicate content issues would usually only affect the ranking capacity for the duplicate pages. Post Panda however and duplicate content can seriously affect the ranking ability of the entire site. Because of this duplicate content problems have gone from being relatively serious to extremely serious.

Here are a few of the issues with Duplicate content:

- The Search Engines do not want to show more than one version of the same content in the search results; this would reduce the results’ relevancy and diversity.

- The Search Engines try to resolve this by finding the canonical (original) version of the content and only showing this. To achieve this they assume that the page with the highest Page Metrics and Domain Metrics is the original. This is often not the case.

- Duplicate content can often mean Copyright infringement.

- It can be considered Web-Spam, incurring severe Search Engine penalties.

- It can result in a loss of Link Juice and traffic, or division of link juice between different URL’s.

- The Search Engines can’t decide between directing the Link Metrics to a single version of the page, or to multiple duplicates of the same page.

- Since the launch of Google’s Panda, duplicate content is now considered to be an indicator of a poor quality site, offering poor user experience. Google wish to avoid returning poor quality sites in its results pages, and so penalise sites with excessive duplicate content.

- Duplicate pages will use up your crawl allowance. Although there is no exact limit for the amount of pages crawled by the Google bots in each crawl session, there are definite patterns that point to crawl limit themes. For example, Google will usually allocate more time for a site with high authority to be crawled than it would for a site with low authority. Too many pages of duplicate content use up precious crawl allowances, often leading to more important pages not being crawled.

Duplicate content can cause websites to suffer rankings and traffic losses, incur spam penalties, or even legal issues. It can make Search Engines reveal less relevant results. It should therefore be treated very seriously.

.

Duplicate Content caused by the website architecture:

These are some examples of duplicate content problems caused by un-search friendly site structure and coding.

.

1. Multiple URL’s of the same site

This is extremely common, occurring when one identical versions of a site exist on separate URLs, instead of diverting to one canonical URL. It splits domain metrics, page metrics and link metrics across different sites, forcing the website to compete with different versions of itself! Common examples are:

http://GoUp.co.uk/

https://GoUp.co.uk/

http://www.GoUp.co.uk/

https://www.GoUp.co.uk/

http://GoUp.co.uk/index.php

https://GoUp.co.uk/index.php

http://www.GoUp.co.uk/index.php

https://www.GoUp.co.uk/index.php

.

To check if this is an issue, type in the aforementioned URL’s and see if they divert to the canonical version of the website, or if they all exist as duplicates.

The Fix:

In short:

- 301-redirect all duplicate pages

- Canonical tag all duplicate pages

- Set your preferred domain in Google Webmaster Tools

- Robots.txt & Robots NoIndex Tag all duplicate pages

- NoFollow all links to duplicate pages

NB this does not work for duplicate problems related to index.php. This is covered later in this section.

.

Explained:

.

Use the 301-redirect.

Identify the canonical version of the website. This is usually the http://www.GoUp.co.uk version. 301-Redirect all of the pages on the non-canonical URL’s to it’s counterpart on the canonical website. When multiple duplicate pages are 301-redirected to form a single page, they stop competing with one another and combine their metrics to generate a stronger relevancy and popularity signal for the search engines.

NEVER use 302-Redirects as these are considered temporary redirects and thus do not pass link juice.

Note: This does not apply to the http://www.GoUp.cp.uk/index.php version. We will explain how to rectify this problem a little later.

.

And/ or

.

Use the Rel=Canonical Tag.

The Rel=Canonical Tag works for Bing and Google. It also passes link juice but can take much less time to implement, especially when dealing with larger sites. The Go Up best practice recommendation is to use both the 301-Redirect and Rel=Canonical Tag whenever possible/ practical.

Note: This does not apply to the http://www.GoUp.cp.uk/index.php version. We will explain how to rectify this problem a little later.

.

Set your preferred domain in Google Webmaster Tools.

Set your preferred domain in Google Webmaster Tools. This will help guarantee that the listing in the search results is your preferred listing.

.

Use Robots.txt, Robots NoIndex and NoFollow.

Use Robots Tags and Robots.txt to block the search engines from crawling the duplicate versions of the page. You can also use the Robots NoIndex to stop the crawlers from indexing your page. If you use this method, make sure to use NoFollow Tags all of your links to these pages, to stop the leaking of link juice.

.

Example: Meta Robots Code:

.

<head>
<meta name=”robots” content=”noindex, follow” />
</head>

.

A common but often overlooked example:

.

HTTPS duplicate content

E-Commerce Websites using the HTTPS Secure Server URL often run into duplicate content issues. When the secure https:// connection (often used at ‘checkout’ in ecommerce sites) duplicate content issues often occur. This happens when web developers link back to their main site using relative links (in this case: https:// links) instead of absolute links (ie: http:// links). This would result in your homepage being called as https://www.GoUp.co.uk, creating a duplicate version of the site!

.

The Fix

Make sure that all links from the secure pages of your site call the http://www.GoUp.co.uk version of your site. If you do not wish to use absolute links to resolve this, you can use 301-redirects or canonical URL tags instead.

You can also use robots.txt to limit or prevent crawler access to the duplicate https:// version of your website. For this, you will need to create a separate robots.txt file to any existing on the http:// version. To allow the crawling of all pages on your http:// version and not on your https:// version, use this code:

.

HTTP:

User-agent: *

Disallow:

.

HTTPS:
User-agent: *

Disallow: /

.

/index.html duplicate content problems

Many websites link to their homepage in a form similar to: http://www.GoUp.co.uk/index.php

However, because most inbound (external) links will link to your site’s homepage using: http://www.GoUp.co.uk/, duplicate content issues may arise, splitting the site’s link juice.

.

The Fix

 

This problem takes a bit of know how to resolve. Do not use a traditional 301 redirect straight away, if done here you could trap the search spider in an endless loop between the canonical page and the index.php page. (Talk about ways to ensure your site never get’s crawled…gulp!)

 

1. Firstly you need to copy the contents of the index.php to another file, e.g. ‘sitehome.php ‘

 

2. After this, you must make an Apache DirectoryIndex directive for your document root and set it to sitehome.php. Avoid setting this directive server wide; this will cause issues for other folders still requiring index.php as a directory index.

 

3. Thirdly you must put the sitemap.php in an .htaccess in your document root: DirectoryIndex sitehome.php. If you are not using per-directory context files then, as per the below example, it should be placed in your httpd.conf:

.

<Directory /your/document/root/GoUp.co.uk/>

DirectoryIndex sitehome.php

</Directory>

.

Replace all the contents of your original index.php file with this coding:

.

<? Header(“location: http://www.GoUp.co.uk”); ? >

.

From now on, when typing in the canonical URL sitehome.php will be read. The index.php will no longer be the default filename. Any and all requests for index.php from old links should now be 301 redirected without causing an endless loop. Hooray!

 

CMS users beware: When finished with this ensure that all internal links point to the canonical URL to avoid further loop problems.

 

Spam issues

.

A common tactic used by spammers and black-hat SEOs is to buy multiple URL’s targeting specific keywords applicable to different local markets or specific products. They direct these URL’s to unique landing pages, which then lead onto the same canonical site. An example of this would be a car dealer called www.londonlocalcars.co.uk wishing to appear more local to different areas (thus getting that ‘local feel’ whilst also ranking for the local results) in London by buying up the URL’s www.belgravialocalcars.co.uk, www.kensingtonlocalcars.co.uk, www.hounslowlocalcards.co.uk etc. They would create a semi-unique landing page for each of these, which would then link onto the main site. This is also practiced by businesses who wish to rank as a leading niche vendor, so www.bestplumbersever.com would also own www.electric-boilers-r-us.com.

This leads to poor link juice distribution. In extreme cases it can be considered as spammy behaviour, resulting in severe penalties. Avoid at all costs.

.

2. Content Sort Orders

Some clever websites, such as Amazon, sort content on a page depending on user requests. For instance, there may be a pull down in which users can change the sorting of the content by item price (lowest to highest or highest to lowest), item size, manufacture date etc. The downside is that this is not nearly as optimised as using a straightforward navigational structure, and can result in duplicate content issues, as the crawlers find different versions of the same content.

.

The Fix:

Use the Rel=Canonical Tag to let the search spiders know that these different sort orders are the same content as the original version of the page.

Or

Require a cookie for the viewing of the different sort orders of the page. Search Engines spiders do not accept cookies and so will not be able to crawl or index these different sort orders. This can also be useful if you want to prevent search engines from accessing certain content on your site, whilst allowing users to still see it. A word of warning though: used incorrectly and this could be seen as Cloaking, which the search engines are not at all fond of.

.

3. Printer-Friendly versions

Printer-friendly versions of content often lead to the printer friendly version being indexed along with the normal version. This causes duplicate content issues.

.

The Fix:

NoFollow all internal links, then use the Robots Tags or Robots.txt to block the search engines from crawling these pages.

.

4. URL Parameters

URL parameters such as click tracking and some analytics code can cause duplicate content issues.

.

The Fix:

.

Block crawler access to the URL.

Block URLs with parameters from Google and Co’s crawlers using Robots Tags or Robots.txt Note: Make absolutely sure that the canonical URL (which does not have the tracking code) is not blocked as well. It is also worth noting that this will cut off any link juice from inbound links to the URL with parameters.

.

5. Session IDs

Many sites track their visitors by adding unique Session Ids to the end of their URLs. Because the end of each URL will be different, the search spiders see them as completely different URLs containing identical content.

.

The Fix:

If you require using Session IDs, use the canonical tag to tell Google and pals which is the original.

You could also use 301-redirects to redirect URLs with parameters to the canonical version.

.

6. Different TLD’s, same content

This mainly applies to businesses wishing to rank internationally. An example would be Go Up attempting to rank in France, the UK and the US, and so using the .fr, .co.uk and .com TLDs. If the content on each TLD is too similar to that of the other TLDs, then there will be duplicate content issues, and it is more than likely that Google will not rank their .fr or .co.uk TLDs at all, losing all rankings in these two countries.

.

The Fix:

If you are going to target multiple countries by attempting to rank across multiple TLDs, make sure that the content on each site is as unique as possible.

.

7. CMS’s which create multiple URLs for the same content

Some search unfriendly CMS’s will create multiple versions of the same page spread across different URLs. This causes duplicate content issues.

.

The Fix

If your CMS is adaptable, figure out how to prevent it from creating this problem. If it is not, it may be worth investing in a CMS that does not cause this problem. Delete the duplicate pages then 301-redirect the defunct pages to their canonical versions. Alternatively you could use a canonical tag to tell the search engines that they are duplicates, or use a robots.txt NoIndex or Disallow files, and NoFollow all of the links. The first option is definitely the best.

.

8. Using the same page for multiple store locations.

Following on from ‘spam issues’ in ‘multiple URLs for the same site’. Spammers often use multiple unique URLs, which point to unique landing pages that in turn link to the main site. This is deemed manipulative and carries a penalty. Perhaps more common is when a business has multiple shops or outlets. It will have a ‘locations’ page, on which it lists all of its locations. This misses an opportunity to rank highly in keywords and local search for each different outlet.

.

The Fix:

301-redirect all superfluous URLs, pointing them to a suitable counterpart on the canonical version of the site. Instead of having multiple locations on one page, leverage these locations for local search ranking by creating a separate, unique page for each store. Use unique maps, addresses, directions, shop details etc for each page, and make them rich with non-cannibalised local keywords. Optimise the pages making good use of unique Header and Title Tags, Alt Attributes etc, and watch as they now start to individually rank in local search.

.

9. Paginated Archives and articles

The most common example of this is in news stories or articles. If an article is long, and the Webmaster decides to spread it across multiple pages, as opposed to keeping it on a single page, you will often see a ‘next’ and ‘previous’ button at the end of each page. This then takes you to the continuation of the article on the following page.

This has caused duplicate content problems for a long time now. Google does not know that these separate pages are just continuations of the same article/ archive, and see multiple variations of the same topic, treating it as duplicate content. They also have no means of determining which of these pages is the ‘first’ page, and so which to show in the indices.

.

The Fix:

There are two options:

1. Keep all of the article or article archive to a single, extremely long page. This can look a bit messy, but it is better than running into search engine mayhem.

2. Rel=Prev/ Rel=Next Function. This is a brand new fix, as of September 2011, and it works fantastically. Read more at Rel=Prev/ Rel=Next Function.

.

10. Content Syndication causing duplicate content.

Content Syndication is a common way to fill out your website content. It is when you literally ‘buy’ ready-made content for your site from a content syndicator. You can then publish this content on your site. The problem is that this content is often bought by several sites, or published on the syndicating site, resulting in duplicate versions across the web.

.

The Fix:

There are several ways to fix this.

1. Make sure that you are the sole owner: The best way is to ensure that you be the sole owner and publisher of the syndicate content. When purchasing the content, request that this is put in the contract. If there is no contract, request that they confirm it in an email, so that you have a copy for your records.

This really should be a prerequisite for any content you purchase. If there are other existing copies of the content across the web, the SEO benefits for publishing this content will be all but worthless.

2. Use the Syndication-Source tag: If you have purchased the content for reasons other than forn SEO benefit, perhaps simply to enhance user experience, then you can use the Syndication-Source tag. This tag is similar to the Rel=canonical tag, but is specifically for syndicated content. Read more at Syndication/ Source Tag.

.

11. Duplicate content caused by using multiple servers.

If you host your website across multiple servers, chances are you will suffer from some extremely severe duplicate content issues. The only way to resolve this is to transfer all of your web data onto a single server, even if it has to be a particularly giant one!

.

On Page Duplicate Content Issues

.

1. Keyword cannibalism

Keyword cannibalism creates a scenario in which a site is competing with itself for search engine rankings. It occurs when multiple pages from the same website target the same keywords, phrases or topic areas. Google then must decide which of the pages is the most relevant. For instance, if you run an online pet shop, and four of your categories concentrate on dogs, all targeting similar keywords, anchor text and content, these pages will be unnecessarily taking traffic away from one another. Problems include:

Inbound links: External linkers who wish to link to your dogs page, will have to decide between four different categories. This splits your link juice.

Anchor text: Since there are so many pages targeting the same topics, you will be unable to focus all of your internal anchor text towards one specific page, again splitting the link juice.

Visitor’s user experience: Your visitor’s site experience is also going to suffer as they become bored reading scattered, navigationally confusing content.

Conversion rates: If one page on a topic is converting well, and different pages on the same topic are not converting so well, it is much better to consolidate all of the content onto one page, deleting any repetitive or superfluous content.

.

The Fix:

a. Efficiently categorise each topic. In this case, you could have an umbrella category page for ‘dogs’. In this, provide a general topic overview, and then create subfolders to the more specific pages. For instance: ‘Spaniels,’ ‘Retrievers,’ ‘Labradors,’ etc.

b. If you already have this issue, and wish to consolidate your cannibalised pages, identify the less converting pages and integrate any crucial content onto the well converting page, deleting any repetitions. 301-redirect the now defunct pages to the new main page, transferring any link juice and traffic. Make sure to delete any internal links pointing to the now defunct pages.

c. If multiple pages target similar keywords, find out which of these pages has the highest conversion rate for these terms. Link back to the relevant page from the cannibal pages. The back link should be placed within the content using a hyperlink from the first mention of the cannibalised keyword. For instance, with the Go Up website, you will notice that on every page which mentions ‘SEO,’ there is a hyperlink from the first mention of the term SEO to our most converting page for that term.

.

2. Manufacturers product descriptions

On ecommerce sites it is common to use the manufacturers product description to describe the products. Naturally, this leads to en-masse duplicate content issues, as each web-vendor uses the same product description for the product.

.

The Fix:

For smaller sites: Write unique product descriptions in place of the generic ones. This can be very time consuming but it is definitely worth it. It will allow you to rank in the SERPs for each individual item.

For large sites: The more products you stock, the less practical becomes to write individual unique product descriptions. Instead, take advantage of user generated reviews and product descriptions in conjunction with the manufacturers description. When there are enough user-generated reviews and descriptions, it will no longer be seen as duplicate and will begin to rank. This technique can also be used to great effect by smaller sites seeking great long tail keyword targeting. It is also much appreciated by the consumer, as they consider customer reviews to be much less biased than those of the manufacturer/ retailer.

.

3. Meta Tags

Often web-developers or SEOs will target the same keywords throughout all of the site’s metta tags. This leads to keyword cannibalism issues, as well as duplicate content issues. As well as forcing the site to compete with itself for rankings, it is often penalised by the engines as spammy behaviour.

.

The Fix:

Create unique, personalised Meta Tags for each individual page. Make sure that they relevant for the page. This not only fixes a problem but also brings many rewards in the rankings.

.

4. Page text

When the text and content of a page is too similar to that of another page it is considered a duplicate, and is subject to all of the aforementioned duplicate content problems. This is often the result of lazy webmasters, content syndication or out and out content theft. We will deal with each problem in turn:

a. Lazy webmasters. This occurs when a Webmaster either has no unique content for each page, so simply rehashes the content from a different site page, or when he creates too many similarities between all of the pages on his site.

.

The Fix

Create fresh, unique content for each individual page

.

b. Content Syndication. This is when someone creates content with the intent of having it published on someone else’s site in return for a reward (often a link). The problem is that this content is often available on the original publishers site as well, and perhaps on multiple other sites that it has been sold to, causing duplicate content.

.

The Fix

If you are a publisher, do not syndicate content to multiple sites or place content that you have syndicated out on your site. If you are the purchaser and the publisher also has the content on their site, request that it is taken down. If this is not possible, place a link to the original content within the copy of the content on your site. This tells Google that it is not the original and avoids duplicate issues.

c. Content Scraping and Copyright issues. This is when someone steals the content of your site, and duplicates it on their site. It causes duplicate content issues and can sometimes result in the search engines mistaking the stealing site to be the original, penalising the real publishing site!

.

The Fix

A great way to see if your content is being scraped is to use Duplichecker, or better yet Copyscape, which checks for duplicates over the web. File a DMCA Infringement Request with Google, Bing and Yahoo. This will inform the search engines of the problem, often resulting in the offending page being removed from the indices. If you file the DMCA infringement request with the site’s hosting company you will often find that they will remove the page themselves. If this is unsuccessful, you could always threaten legal action unless the page is removed.

.

5. User generated content causing duplicate content.

Sites that take advantage of UGC (user generated content) often find that their users post duplicate content. Sometimes this can even cause copyright problems.

.

The Fix:

This is very difficult to control. Put a notice on the page stating that users are not to post duplicate content, and that any duplicate content will be removed. Also state that any user who illegally infringes copyright on your site will be barred. Place a notice on your site stating that, when notified, any duplicate content will be removed as soon as possible. This will give you an opportunity to remove any UGC (or non-UGC for that matter) on your site before the owner of the content takes further action.

.

Other methods to deal with duplicate content.

.

URL Removal

.

Google

You can request that Google remove the page from the index. You will have to plug in each page separately.

Note: This is definitely a last resort, as it is entirely up to Google’s discretion and possible that they will ignore your request.

You must first Robots.txt block, Meta NoIndex or 404 Page Not Found the page before requesting removal.

After you have done this, go to Google Webmaster Tools, click “Site Configuration” – “Crawler Access” – “Remove URL”.

BING

Follow the same initial steps that you would for Google URL removal. 404, Robots.txt block or NoIndex the pages. Then go to Bing Webmaster Central. Click on “Index” – “Block URLs” – “Block URL and Cache”.

Bing actually gives you more options that Google, allowing you to block a page, directory or even a whole site.

.

Google Parameter Blocking

This is definitely the quickest way to block multiple duplicate pages from the index. It does, however, have some pretty major drawbacks, as we explain in our page description. Please see Google Parameter Blocking for more details.

.

FAQ’s.

Does Google consider duplicate coding to be duplicate content? No. Google and friends are much more interested in the actual text on a page than they are in the page’s coding.

Are common page elements (Navigational bar etc) considered duplicate content? Common page elements are page elements shared across a website. These are not considered to be duplicate content and are easily identifiable by the search engines.

.

Diagnosing Duplicate Content Issues

The easiest way to prevent duplicate content is to make sure that you do not create any when building the site. Keep a list of all of the pages on your website- perhaps in a word document so that you can delete the pages and add new ones without any hassle. Include a brief outline of the content of these pages, and categorise them. You will quickly spot if any two pages have identical themes or content. Make sure that you follow all of the aforementioned instructions and you should be okay. Of course, your website may well be well established or extremely large, which makes exact mapping of the website to be difficult. In this case, you can use some of the following software to help:

.

Google Webmaster Tools

Google Webmaster Tools provides a great function to get an overview of your site’s duplicate content problems. It allows you to view duplicate Meta Tags, especially Title Tags, throughout your site. In your Google Webmaster Tools account, go to “Diagnostics” > “HTML Suggestions”, and it will pull up something like this:

Clicking on ‘duplicate Meta descriptions’ or ‘duplicate Title Tags’ will flag up a list of the duplicate descriptions and their pages. This is often a very good place to start.

If you want a more thorough survey of your site’s duplicate content, there are many other software vendors out there that can do this.

.

Copyscrape

www.copyscrape.com

Copyscrape allows you to type in a URL, and identify if the copy on that URL has been duplicated anywhere across the Internet. It only checks the html content, not the design or Meta Data.

.

SEOmoz Pro

SEOmoz has some great duplicate content tracking software. It comes at a price, so is perhaps best to be used only the SEO professional.

.

Written by Edward Coram-James

Copyright © 2012 Go Up Ltd. All rights reserved.


Go Up Twitter Go Up Facebook Go Up Google Plus Go Up Rss Feed LinkedIn