Duplicate content on a site is where two or more pages host the same content, whether it be the text on a page, or an entire page being available through multiple URLs.
When this happens, it poses a problem for search engines, seeing as it’s their job to provide the most relevant page for the user. If the same page is available through multiple URLs, they may well end up competing with each other, eventually resulting in problems for how this content ranks.
Though it won’t lead to a Google duplicate content penalty (unless done in an obviously spammy manner) it’s still a pain that should be dealt with and avoided for any website.
Why Does Duplicate Content Matter?
There are various reasons why duplicate pages matter for both search engines and site owners.
For Search Engines
Duplicate content represents problems for each engines including:
- Not understanding which version(s) of the page should be included or excluded from indexing.
- Not knowing whether to direct link metrics (trust, authority, link equity, anchor text et al) to one page or keep them separated across multiple versions.
- Not knowing which version(s) to rank for search engine results pages.
For Site Owners
Duplicate content on your website can see a decline in rankings and will lead to organic traffic losses. The losses can come from two sources:
- To give users the most relevant results, search engines rarely show multiple versions of the same content, instead choosing the version it feels is the best result. This means your duplicate content isn’t visible for users.
- Your link equity becomes diluted as other sites must choose between the duplicate pages. Instead of inbound links directing to one piece of content, they link to various pieces, spreading that link equity between the duplicate pages. Inbound links are a Google ranking factor, so it will impact the search visibility of the content.
Because of all this, a piece of content does not achieve the search visibility it would if duplicates didn’t exist.
How is Duplicate Content Created?
Duplicate content can spawn accidentally through on-site factors, not necessarily just articles being plastered across multiple pages/sites. One example would be through URL parameters – commonly seen on ecommerce sites, where the URLs generated by filters should be handled properly in order to avoid a massive duplicate content issue.
Other examples include:
Lack of a Preferred Domain: With any site, it’s important to ensure that each page has an absolutely preferred URL, with this rule running across the entire domain. If you’ve got a page which is available through multiple variants, such as with and without www., HTTP and HTTPS, as well as with and without capital letters, they’ll be treated as separate URLs. This really becomes an issue if you’re internally linking to the different variants; a common issue.
Scraped Content: Content on a website not only includes blog posts and editorials, but also product pages. Scrapers republishing your content on their own sites is a common source of duplicate content. E-commerce sites can face problems with product information, because if many websites sell the same items, you’re likely using the same manufacturer descriptions of said items. This gives the impression that all sites have copied content from the same source.
Boilerplate Content: Google refers to this as repetitive swathes of text, such as including lengthy copyright text on the bottom of every page. Preferably, you’d just have a link through to a page with said content.
Different Regions: Some sites have different pages for different regions, though some may be in the same language, without any indication – usually a Hreflang tag – that there’s a difference between the domains.#
Session IDs: These are used to keep track of users as they’re browsing your site – in some cases, these may lead to every internal link on the website getting that Session ID added to the URL, creating various new URLs.
How to Identify Duplicate Content
With duplicate content, and all other issues that your site may have, it’s vital to identify it as soon as possible.
When it comes to finding duplicate content in the form of parameters or title issues, Screaming Frog would be the first port of call.
A real favourite of ours for technical audits and general site overviews, Screaming Frog will flag up any duplicate titles/descriptions on the site, which usually leads to finding the offending pages. Duplicate content can also be found using various other tools, another favourite of ours being SEMrush.
When it comes purely down to content, Copyscape is a handy tool for finding large swathes of duplicate text. Running the URL through Copyscape will provide you with other pages that have the same text.
Though not necessarily a tool, a quick site:domain search can work wonders when it comes to finding duplication issues. You can find pages with duplicate titles/descriptions, parameter issues, and can also include a line of text to find any potential boilerplate content issues.
Once you know about the issue on the site, or you’re looking to put out the smoke before the fire, here’s how you can deal with and avoid duplicate content:
Implement Canonical Tags
When it comes to dealing with duplicate content, the go-to solution is generally the use of canonical tags.
A canonical tag provides search engines with the preferred URL for that page by using the rel=canonical tag within its code. By setting the preferred URL, it tells search engines to divert any attention through to the canonical URL, consolidating all signals and acting like a 301 redirect in the sense that all “link juice” is passed through to the preferred URL.
An example of this tag can be provided by our very own home page – this actually had to be updated very recently considering the move over to HTTPS:
<link rel=”canonical” href=”https://www.ricemedia.co.uk/” />
This becomes very useful for ecommerce sites, and the handling of URL parameters for category pages. For example, let’s take a look at an ecommerce site very dear to us: Diamond Heaven.
The following URL has parameters based on options chosen within the filter (Infinity collection, Rubover setting, just in case you’re interested..)
https://www.diamond-heaven.co.uk/diamond-rings/solitaire#collectionID=12&settingID=16
By checking these sources, the canonical tag for this page has been set, ensuring that there aren’t indexed URLs for each different option in the filter.
Set Up 301 Redirects
This is a similar concept when it comes to how it all works, considering that both 301 redirects and canonicalisation will divert all attention and consolidate all signals through to the target page.
The 301 redirect is usually set up within the htaccess file, though it can also be done through plugins for CMS’ such as WordPress.
301 redirects can also be set up for cannibalising content (sounds odd, but this refers to two pages targeting the same subject/keyword). Not necessarily duplicate content, but #12 in this list of SEO tips offer more insights into this handy idea.
Meta Robots NoIndex
One meta tag that is useful in removing duplicate content is meta robots, when combined with the values “noindex, follow”. Also known as Meta Nonindex, Follow and technically known as content=”noindex,follow”, this tag can be added to the HTML head of every page that you want to exclude from the search engine’s index.
Meta robots tag lets search engines crawl the links on a page but stops them from including those links in the index. It’s vital that your duplicate page can be crawled, even if you’ve informed Google to not index it, while Google penalises sites that restrict crawl access to duplicate content on your site. It also allows search engines to identify everything on your site, just in case there’s an error in your code.
Search Console Parameters
Though already mentioned in the canonicalisation section, parameters can be dealt with in another way, by telling Google directly how to handle them.
This is done via the URL Parameters section of Google Search Console.
Here, you can provide instructions for Google when it comes to how they handle the parameters set within URLs of the pages on your site.
For the example shown above, this refers to multiple pages being created within an ecommerce site for its products. This has been configured to count as paginated content.
Google does note that you should be careful with this tool, considering that if a mistake is made within a set of instructions, key URLs may not be crawled from thereon.
Learn more: https://support.google.com/webmasters/answer/6080548?visit_id=1-636229222524655029-2148661098&rd=1
Set Up A Preferred Domain
As mentioned earlier, having a completely preferred version of each page is key – a part of this comes down to the multiple variants that can be created for the URL of each page.
Say you have two pages which are completely accessible through both HTTP and HTTPS – neither of which have been preferred, and both of which have been internally linked to. These will be treated as different URLs, and thus duplicates of one another.
This could cause issues when it comes to search engines deliberating which one should be displayed within search results. That being said, non-preferred versions of each URL should redirect through to the preferred version, or there should be a canonical tag set up at the very least.
For example – if you try going through to a HTTP version of our latest blog post on the recent Google algorithm update, you’re redirected through to the preferred HTTPS version, avoiding any confusion.
If your business’s site is struggling with duplicate content and also needs support with creating unique content for your target audience, then do not hesitate to get in touch with Birmingham SEO Agency, Ricemedia. Our technical SEO team will identify and fix all issues regarding duplicate content to help you rank better.