Beginners Guide to XML Sitemaps

When doing a technical audit for a new client, I’ve found the reoccurring issue of sitemaps comes up time and time again. Because a sitemap isn’t a requirement, it’s something that can get put on the back burner. But don’t worry, the lack of sitemap is an issue that can be easily rectified – you just need to know how. This guide will also tell you how to improve a sitemap that you already may have, ensuring search engines have the optimum information about your important pages. The sitemap suggestions are based on Google’s guidelines and suggestions. Let’s begin!

Contents

1. Introduction to Sitemaps

1.1 What is a sitemap?

A sitemap is a file that provides a list of URLs on your website for search engines. This list of URLs will help Google understand more about your website when it crawls it, in terms of its organisation and site structure. In most cases, a sitemap is formatted as an XML file. Whilst Google does support sitemaps created as an RSS feed or a .txt file, the XML file is the most common one.

1.2 What does a sitemap look like?

A standard sitemap in XML format as shown within Sitemaps.org looks like this:

<?xml version=”1.0″ encoding=”UTF-8″?>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>

The main components of a sitemap are:

<urlset> (required)
Encapsulates the file and references the current protocol standard.

<url> (required)
The container or parent tag of each URL. All elements associated with an individual URL will be nested within this.

<loc> (required)
The URL of the page, it should be a full URL containing the domain name.

<lastmod> (optional)
The last modified date of the URL. Should be in W3C Datetime format.

<changefreq> (optional)
How frequently a page changes. Often ignored by search engines.

<priority> (optional)
Priority of the URL compared to others in the list. Ranges from least important to important (0.0 to 1). It’s been confirmed that Google ignores this, however search engines like Bing do use it.

1.3 Sitemap Guidelines

When making a sitemap, there are some limits on the size, the number of URLs and format of the URLs can that be added:

  • Be aware of size limitations – Each sitemap should not contain more than 50,000 URLs and the file size should not exceed 50MB (uncompressed) – Although this is the limit, ideally you wouldn’t want to be reaching the top limit of either 50,000 or 50MB. You should break the sitemap down into smaller sitemaps to prevent potential issues with multiple requests to your sitemap.
  • Use a sitemap index file if needed – In the case of multiple sitemaps this allows for just one sitemap URL to be submitted to Google (See point 1.4)
  • Only use canonical URLs – Only provide Google with your preferred URLs that contain your full domain. (See point 4.1)
  • Use the correct format – All URLs should be properly formatted and escaped if certain characters are used. (See point 4.9)

1.4 What are Index and Child Sitemaps?

A standard sitemap XML file a list of URLs for your website. However, if your website has a lot of URLs or has specific URLs for different sections you may need to create a Sitemap Index and Child Sitemap files.

A Sitemap Index file is a listing page containing links to individual sitemap files. Having an index file allows you to be able to submit just the one index file to Google rather than multiple individual sitemaps.

Example of a sitemap index file

Example of a Sitemap Index File

A Child Sitemap is a sitemap file which is linked to from an index file. It’s in exactly the same format as a normal sitemap. An index file can contain up to 50,000 child sitemap references, while sitemaps can contain 50,000 URLs each.

Creating child sitemaps can be useful if you want to group certain URLs in different sections, for blogs or products. Looking at these individual child sitemaps in Search Console will highlight how many of these URLs are submitted and then indexed. If there is a large difference for certain sitemaps there may be issues to investigate.

Don’t be tempted to nest your sitemap or sitemap index files. Adding a sitemap within a sitemap index file is fine, but you shouldn’t nest sitemap index files within other sitemap index files as it’s not supported and URLs may not be read.

//platform.twitter.com/widgets.js

2. Sitemap Benefits

Having a sitemap will not affect the rankings of a page and are not a requirement by Google. Additionally, including a URL within a sitemap does not necessarily mean that it will be indexed. Instead, it’s a strong hint to Google that you consider this URLs important enough to be considered for indexation.

So, what’s the point?

The key thing to remember is that a sitemap can help Google better understand your website. Examples of this include being able to notify Google about any new or recently changed pages and help them find key URLs from websites with large or complicated structures. These things can help increase visibility within Google’s index.

Including tags within your sitemap such as will notify crawlers when a page was last updated, which can signal to a crawler that it may need to be prioritised higher in the crawl list. Sitemaps are also key for new websites or site migrations when you want to provide Google with the new list of URLs to crawl and index.

Although sitemaps are not a requirement, it’s particularly recommended to include one if:

  • Your website is large with a complicated URL structure and many internal links
  • You have a new website which has little or no backlinks
  • Your website has recently migrated
  • Your website is constantly changing with pages being added, removed and changed, such as an e-commerce website

3. Finding Your Sitemap

In this section, we’ll go through how to find your sitemap and the differences between a dynamic and static sitemap file.

3.1 How do I find my sitemap?

Generally, a sitemap file is found on the root – the first level – of your website at /sitemap.xml. E.g https://www.ricemedia.co.uk/sitemap.xml

The URL, however, can be whatever you want it to be, as long as it’s in the correct XML format. The sitemap should be located at the root of the location of your sitemap URLs. So, if your sitemap was at website.com/sitemap.xml it could include all URLs within website.com. If the sitemap was located in a folder such as /website.com/folder/sitemap.xml the URLs in the sitemap could only reference URLs located in that folder e.g website.com/folder/page-1.html but not website.com/page-2.html.

If you’re still not sure where your sitemap is, you can also check your robots.txt file. The robots.txt file is always found in the same location at the root of your website e.g https://www.ricemedia.co.uk/robots.txt. In the robots.txt file, there is sometimes a reference to a sitemap file which may be of help.

If you are still unable to find one, it may have a custom name or one does not currently exist. It’s best to check with your website developers or CMS in that case.

3.2 Do I have a static or dynamic sitemap?

If you’ve got a sitemap the next thing to consider is whether your sitemap file is dynamic or static. A static sitemap file is one that has been generated via a tool, such as XML Sitemaps or Screaming Frog, and is essentially a snapshot of your website at the time the sitemap is generated. This is an easy way to get a sitemap created and submitted to Google. The downside is that if you are adding, removing or changing pages in your website regularly, this will soon be out of date. The changed URLs will either 404 or 301, and you’ll soon see errors start to pop up within Search Console.

With a static sitemap, you can normally tell as it’ll include the tool used within the file itself, as shown below with a Screaming Frog sitemap.

A dynamic sitemap is generated by the website and stays up to date – adding, changing and removing URLs as needed. This is definitely the preferred option for a sitemap as Google will always have an up to date list. That being said, it can also create a lot of issues if incorrect settings are used in its generation.

Issues like using the wrong domain, HTTP instead of HTTPS, canonicalised URLs and accidentally including pages you didn’t even realise existed can all cause you a lot of problems. We’ll go through what you need to look out for in the next step.

4. What to include & avoid in your Sitemap

Once you have located your sitemap, you should analyse it for any potential issues. In this section, we’ll run through what to include, avoid and look out for within your sitemaps.

4.1 Only Include Canonical URLs

A canonical tag is a way of telling Google the preferred URL to be used for indexing, to help prevent duplicate content issues. If you’ve got similar instances of the same page, you can specify the canonical version to Google via the rel=canonical HTML element on the page. For example, it’s common for a product to be in different categories and therefore have different URLs:

/red-dresses/a-red-dress/
/maxi-dresses/a-red-dress/
/sale-dresses/a-red-dress/

/product/a-red-dress/ – Canonical

To prevent duplicate content you would want a canonical tag on all of the above URLs, pointing to the one canonical version – the /product/a-red-dress/ URL in this instance. When a URL is described as ‘canonicalised’ it means that it’s canonical tag does not match that itself, meaning that it’s a duplicate page and not the preferred version to index.

Sitemaps are extremely useful for search engines as they help them to crawl your website more intelligently. But if the sitemaps contain extra URLs that are not used within the website or are canonicalised, it can have a negative effect by giving search engines more URLs to crawl.

Make sure that canonicalised URLs are not included in your sitemap – only include the preferred or canonical URLs. If your website has got multiple URLs available for the same page but no canonical tags are in place, then adding canonical tags should go to the top of your priority list!

How to Test: Crawl your sitemap using Screaming Frog or Deepcrawl and check that there are no canonicalised URLs.

4.2 Only use your preferred URL format

This is an important one. Make sure that your sitemap URLs are absolute links (this means it contains the domain name in the URL) and use your preferred URL format. If your website is using HTTPS, so should your sitemap URLs. Depending on whether you’ve decided to use www’s or not, trailing or non-trailing slashes at the end of your URL, your sitemap URLs should match your choice. Often when a website has migrated to HTTPS, we find that the sitemap URLs have not been updated to the secure protocol.  This is often quite easily fixed, but for WordPress websites can be caused by plugin conflicts, so it’s important to check after a migration if your website is using a dynamic sitemap.

How to Test: Crawl your sitemap using Screaming Frog and check that all URLs in the list include your preferred URL format.

4.3 Remove any 301’s, noindex or 404 URLs

The reason for the this is to help Google to easily crawl all the URLs within your sitemap. If the crawler finds a URL within the sitemap, views it and see a noindex tag, then it’s a waste of its time – you’ve already told Google that you don’t want this page to be indexed.

It’s important to optimise your crawl budget as much as possible. Including redirecting URLs instead of the final URL, or a URL that 404s, all contribute to unnecessary URLs being crawled. If you have a particularly low crawl budget, including a lot of these types of URLs may mean that not all of your sitemap URLs will be crawled.

How to Test: Crawl your sitemap using Screaming Frog and check for any 301 or 404 response codes being returned. Also check that none of the URLs have a noindex tag applied. Deepcrawl will also highlight broken and noindex URLs within its Sitemap Reporting section.

A reason where you may want 301 URLs within your sitemap is during a site migration. You can submit the old sitemap including the old URLs along with the new sitemap to help Google find and crawl the new URLs. During a site migration, you would have set up redirects from all these old URLs to the new versions, so it’s helpful for Google to see the changes that have been made. Adding both sitemap versions you’ll be able to see the index count rising for the new URLs and the index count dropping for the old URLs.

This should only be done as a temporary measure – make sure you remove these sitemaps after six months. This was confirmed via a Google Hangout with John Mueller and explained here by Search Engine Roundtable.

4.4 Investigate orphaned pages

An orphaned sitemap URL is a URL that is added within the sitemap file but is not linked internally within the website.

Including orphaned URLs gives search engines more URLs to crawl but if the extra URLs are incorrect – i.e shouldn’t be viewed by users – having them in the sitemap file means they can still be indexed and compete against other correct URLs.

Another issue to be aware of is that orphaned URLs in the sitemap can be viewed as doorway pages. A doorway page is a page which is accessible to search engines (submitted via a sitemap file) but is hard to find for users (not linked internally).

As described by Google, here are two questions which describe doorway pages:

  • Do the pages duplicate useful aggregations of items (locations, products, etc.) that already exist on the site for the purpose of capturing more search traffic?
  • Do these pages exist as an “island?” Are they difficult or impossible to navigate to from other parts of your site? Are links to such pages from other pages within the site or network of sites created just for search engines?

For larger websites it may be tempting to include URL combinations which are valid pages but hard to link internally, however, these would still be seen as doorway pages, as confirmed below by John Mueller on Twitter.

Consider updating your link structure to include these URLs in a natural way (i.e don’t just include them all within one page which is hard to find) or removing these URLs from your sitemap.

Typically with CMS systems like WordPress, you can enable sitemap functionality through plugins such as Yoast. This is great, but you absolutely should make sure you check what is being pulled through your sitemap and update the settings according to your website’s needs.

Example of custom sitemap options using the Yoast plugin

By default, generated sitemap settings will normally include all accessible pages and resources within your website. This will include all pages that are not manually set to noindex – that’s good, right? Not necessarily. Say you’ve made a landing page just for paid search and you don’t want search engines to index this page – it’s not been linked to within the website so that users can’t find it, but you haven’t set the page to noindex. Unfortunately, it’s highly likely that this page will be currently sitting within your sitemap file. When you submit the sitemap, you’re presenting Google with this URL to crawl, so not only are you giving Google an extra URL to crawl but you’re also allowing this page to potentially be indexed.

That was an example of one page, but you’d be surprised at the URLs you may not realise were sitting in your sitemap being presented to search engines. Not only search engines too, as the majority of sitemap files are easily accessible, your competitors could also be crawling your sitemap files to see what pages are in there.

If you’re using the Yoast plugin, you can manually specify posts to not be included via their post id or setting the page manually to noindex.

If you would rather users were not able to find your sitemap file, consider giving it a custom URL. While search engines will generally crawl the common URL name such as sitemap.xml, you can call it anything you like when you submit to Google, as long as it’s a valid XML file. If you choose to do this, make sure you don’t then add a sitemap reference in your robots.txt file, that’s a big giveaway! Be sure to manually submit the sitemap to all search engines as well if using this method.

How to Test: Crawl your website and sitemap individually using Screaming Frog and compare the pages returned. Or use Deepcrawl and check the orphaned page report.

4.5. Remove the bloat

So you’ve made sure there are no errors and only valid pages are in the sitemap. The next step is to remove any links that are not important for search engines to see within the sitemap. Not every URL needs to be included in your sitemap, only the important ones.

Bear in mind that pages or sections not included in the sitemap but are linked internally it will still be crawled. Any sections that are largely duplicate content (i.e no unique text or optimisation) such as blog tags, blog author URLs and product filter URLs should generally be removed from the sitemap.

With dynamic sitemaps, especially on WordPress, you may also see that child sitemaps have been generated for pages you would not expect to see, like slider sections and reviews. You should definitely remove these. Plugins like Yoast give you a lot of options in terms of what files you want to keep and remove from your sitemap.

How to Test: Crawl your sitemap using Screaming Frog or manually check the sitemap files whether there are layout files or pages that don’t need to be included.

4.6. Look for missing pages or sections

So you’ve removed unimportant URLs, is the important stuff actually there? Are there page types missing? A simple setting in the past may have accidentally removed key pages or sections from the sitemap, so double check page types like categories and key landing pages are still within the sitemap.

How to Test: Crawl your website and sitemap using Screaming Frog and see if any important pages are missing from the sitemap (some won’t need to be there!) Deepcrawl also provides a Pages Not in Sitemap report.

4.7. Include accurate Last Modified date

The tag is a field you can add to each sitemap URL, it specifies the timestamp of when the URL was last updated.

Example of a sitemap URL containing a last modified date
Sitemap Last Modified Date Example

John Mueller has confirmed previously that while most sitemap tags are not really taken into consideration such as priority and freq, the last modified date is a small signal that can be used to speed up re-crawling of URLs. This is because the lastmod tag tells Google that this page is more likely to have new content and thus may need to be re-crawled sooner over other URLs with a later lastmod date. It will have larger benefits for websites with a lot of URLs in their sitemaps, as it helps Google to better understand which URLs may need to be prioritised.

A question asked to John Mueller about quicker recrawling of pages

However, this modified date must be accurate. If identical or inaccurate last modified dates are used in the sitemaps they will be ignored as confirmed during a recent Webmaster Hangout.

4.8 Don’t worry *too much* about the Priority tag

You’ll often see Priority tags in a sitemap file. The initial purpose of the tag was to allow website owners to assign priority to URLs relative to others within the sitemap. It was supposed to allow you to tell Google which are the more important URLs to crawl first. However, it’s been confirmed that Google ignores this, really just wanting to see the URL and an accurate last modified date. Bing however still specifies the priority tag within its sitemap documentation, so if Bing is a priority for you, it’s worth keeping it. However, make sure they have the correct priority levels and are not all the same.

4.9 Use correct URL encoding

Sitemap files should be UTF-8 encoded, certain characters such as ‘&amp;’ should use entity escape codes within the sitemap URL. Any non ASCII characters must also be escaped. If this isn’t added, you’ll see parsing errors when you try to submit within Search Console.

For example, using & within a URL must be swapped out with &amp; within a sitemap URL in order to be read properly.

Original URL
http://www.example.com/test-category&subcategory=books

Sitemap URL with Escape Codes Added
http://www.example.com/test-category&amp;subcategory=books

How to Test: Check for errors within Search Console once submitted, or crawl your sitemap file using Screaming Frog. Look at the URI > Non-ASCII Characters report.

4.10 Add Images where possible

It’s possible to add images associated with pages to your sitemap file as well. John Mueller recommended making sure these have alt text and captions added to provide more information to Google.

5. Testing Your Sitemap

Now you know what to look out for, how do you test your sitemaps?

One of our favourite ways to test XML sitemaps is to use Screaming Frog and it’s awesome sitemap crawling tool. Once in the tool, go to Mode and change the setting to List. Next, you can click on the Upload button and choose from either Download Sitemap or Download Sitemap Index.

If you just want to crawl one sitemap, click on “Download Sitemap”. If you have an index file containing lots of child sitemaps and want to crawl them all select “Download Sitemap Index”

For both options, you just need to add the URL of the Sitemap or Sitemap Index file in the URL field and click go. It’ll then bring back the list of all the URLs found.

If you’re happy with those URLs you can then click OK and those URLs will be crawled by Screaming Frog. Once the URLs have been crawled you’ll be able to see via the tool if there are any issues with the URLs such as URLs which redirect, are 404s or are canonicalised.

To check for orphaned pages, you need to compare the list of URLs generated by a web crawl against the sitemap crawl. The web crawl will bring back any pages internally linked within the website and therefore access to search engines.

To do this make a normal web crawl of the website using Screaming Frog by making sure Mode is set to Spider. Add this list to a spreadsheet. Then add the URLs generated from the sitemap crawl to another column. By comparing the two columns you’ll be able to see if there are any URLs only located within the sitemap.

The web crawler Deepcrawl also has a fantastic Sitemap analysis section. You can add your sitemap URLs during setup and during the crawl of the website it will perform both a website and sitemap URL crawl. The resulting report will notify you of issues within the sitemap such as errors (size issues, missing pages, canonical, 301s, 404s etc) and there is even an orphaned sitemap URLs section ready for you to view.

If you have a large website which has multiple sitemaps available it may be preferable to only check a small sample or individual sitemaps at a time. Crawling your sitemap URLs is the same as crawling your website so make sure that it doesn’t affect your website performance. If it does, consider using Speed settings to slow down the crawl.

Screaming frog speed settings

6. How to submit a Sitemap to Google

Once you’re happy with your sitemap and have fixed any issues, you can head over to Search Console to either submit or resubmit your sitemap.

6.1 Submitting a new XML sitemap to Google

Go to Crawl > Sitemaps in the menu on the left and click Add/Test Sitemap in the top right-hand corner.

6.2 Resubmitting an XML sitemap to Google

If you’ve edited your existing sitemap you can resubmit this by ticking the box next to it and clicking Resubmit.

Once you submit, you’ll be able to see if there are some kinds of errors straight away such as an invalid URL. John Mueller has confirmed that submitted sitemaps are validated straight away.

Google will signal both errors and warnings for sitemap issues. When possible, it will also provide an example of the URL affected.

6.3 Common Sitemap XML Errors

Below is a list of the most common errors you will encounter within Search Console.

Error/WarningIssueSolution
URLs not accessibleThis will show when Google has encountered an error trying to access a URL in your sitemap. Use fetch & render to test the URL to make sure it exists. If it doesn’t it’s likely due to incorrect 404 URLs being included in the sitemap.
URL not allowedThe URLs in the sitemap may have the wrong domain specified such as http instead of https or www instead of non-www. The sitemap could also be on a different level to the URLs.Make sure that domain used in your sitemap URLs files matches the search console account where it is being submitted. Also, check that the sitemap file is on the same level as the URLs.
Some URLs in the Sitemap have a high response timeYour sitemap URLs are slow to load.Test the URLs using a page speed testing tool such as Google PageSpeed Insights or GTMetrix
Sitemap File Size ErrorYour sitemap file is larger than the maximum 50MB limit when uncompressed. Break the sitemap down into child sitemaps & submit the sitemap index file.
Invalid DateA sitemap URL has an invalid date or format error.Make sure the <lastmod> dates use W3C Datetime encoding and are in the right format:

2017-05-20

2017-05-20T18:00:15+00:00

Invalid URLA URL in your sitemap is not validCheck that your URL does not contain unsupported characters, spaces or quotes characters. Try accessing it using a browser.
Parsing ErrorGoogle could not parse certain URLs within the sitemapThis may be due to certain characters not being properly escaped. URLs containing characters such as & should use entity escape codes within the Sitemap URL. The entity escape code for the & symbol is &amp;
Too many sitemaps in sitemap index fileYour index file contain more than 50,000 sitemapsSplit your sitemap index file into multiple sitemap index files.
Too many URLs in sitemapYour sitemap has more than 50,000 URLsSplit your sitemap into multiple sites, consider using a sitemap index file to manage your sitemaps

7. Analysis of your Sitemap in Search Console

The sitemap area in Search Console is an important place to keep track of how URLs are being indexed in Google and will highlight any errors or issues such as 404s or high response times in your sitemaps. As the sitemaps give Google an important list of URLs to crawl it is vital to make sure that the sitemap list is as clean and efficient as possible.

7.1 Check Indexed vs Submitted Count

The submitted versus indexed count is one of the most important figures within the Sitemaps section in Search Console. As the names suggest, this tells you how many of the submitted URLs in your sitemap files are indexed.

A submitted URL is a URL which has been provided in a Sitemap for Google to crawl.
An indexed URL is a Sitemap URL which has been indexed by Google.

Ideally, you want the submitted and indexed URL count to be nearly the same, as this tells you that Google has found all of the URLs you’ve provided useful and unique enough to be indexed. If the submitted and indexed count is far apart, and there have been no recent URL changes, it suggests there could be problems with the sitemaps.

Two examples of a similar and different sitemap indexed/submitted count

If you do have a large difference in submitted vs indexed, you should go through the earlier list of potential sitemap issues and then resubmitting the sitemap if any changes are made.

Also look for indexation drops over time and investigate the cause. The website below moved their images to a CDN without a custom URL. This is why the indexed URLs is dropping due to no longer being attributed to the main website domain.

If you have child sitemaps available you can more clearly see individual submitted and indexed counts. This will allow you to see if there are indexing issues with certain sections or categories for example.

If you’re seeing that more pages are indexed than submitted, then this is probably due to the same URL being within more than one sitemap. Make sure a URL is only listed once.

Regularly check your Sitemaps within Search Console for any errors that may show up, especially if your sitemap is dynamic. If you have recently migrated, you should keep checking your sitemap index list to make sure that the new URLs are being indexed.

7.2 Check / Tidy up your current Sitemaps

If you already have a large sitemap section within Search console, it’s worth spending a bit of time having a tidy up. Make sure there are no individual sitemaps submitted which are already submitted within an index section. Submitting a child sitemap on its own and then submitting the index sitemap file which includes that child sitemap will raise the URLs submitted count giving a false number of URLs submitted.

There may be old sitemaps which are no longer being used, if they are not part of a recent migration within the last six months we would recommend deleting these.

8. Summary

Sitemaps are a fantastic tool to use to help Google find and understand the important URLs within your website. However, it’s vital to make sure that the URLs within a sitemap are correctly formatted and contain the correct URLs within the size limits allowed.

  • Ensure you have a sitemap file for your website – utilise child sitemaps where possible
  • Check your sitemap for errors – such as incorrect formatted, canonicalised, broken or redirecting URLs
  • Improve your sitemap – include an accurate last modified date & images
  • Investigate your sitemap URLs – look for orphaned sitemap URLs or missing pages/sections
  • Submit your sitemap to search engines – analyse which pages are being indexed
  • Regularly test your sitemap – make sure no new errors crop up or unexpected pages

If you would like some advice regarding your website’s sitemap or any aspects of Technical SEO please don’t hesitate to contact us!

Share your thoughts

Share This