Last Updated : October 2, 2019
When completing a technical audit for a new client, I’ve found that sitemaps are a constantly recurring issue. Because a sitemap isn’t a requirement, adding one to your website is a task that can get put on the back burner, especially if it doesn’t seem high on a priority list. Existing sitemaps can also easily get forgotten about and, more importantly, they could be sending the wrong information about your website to search engines.
This extensive guide will tell you what sitemaps are, as well as their SEO benefits. We’ll be covering the sitemap guidelines and what to watch out for, but also how you can improve any existing sitemaps. Also, we will explain how to submit them in Search Console and how to analyse sitemaps using the Index Coverage report. For this guide we are focusing specifically on XML sitemaps and suggestions are based on Google’s guidelines and suggestions. Let’s begin!
Contents
- 1. Introduction to Sitemaps
- 2. Sitemap Benefits
- 3. Finding Your Sitemap
- 4. What to include & avoid in your Sitemaps
- 4.1 Only include canonical URLs
- 4.2 Use preferred URL format
- 4.3 Remove 301’s, noindex or 404 URLs
- 4.4 Investigate orphaned pages
- 4.5 Remove the bloat
- 4.6 Check for missing pages/sections
- 4.7 Include accurate last modified date
- 4.8 Don’t worry about the priority tag
- 4.9 Use correct URL encoding
- 4.10 Beware of your Sitemap sizes
- 4.11 Categorise Sitemaps where possible
- 4.12 Add supported media
- 4.13 Indicate alternate country/language URLs
- 5. Testing Your Sitemap
- 6. How to submit a Sitemap to Google
- 7. Analysis of your Sitemap in Search Console
- 8. Summary
1. Introduction to Sitemaps
1.1 What is an XML sitemap?
A sitemap is a file that provides a list of URLs on your website for search engines. This list of URLs will help Google understand more about your website when it crawls it, in terms of its organisation and site structure. In most cases, a sitemap is formatted as an XML file. Whilst Google does support sitemaps created as an RSS feed or a .txt file, the XML file is the most common one.
1.2 What does a sitemap look like?
A standard sitemap in XML format as shown within Sitemaps.org looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
The main components of a sitemap are:
- <urlset>(required) – Encapsulates the file and references the current protocol standard.
- <url> (required) The container or parent tag of each URL. All elements associated with an individual URL will be nested within this.
- <loc> (required) – The URL of the page, it should be a full URL containing the domain name.
- <lastmod> (optional) – The last modified date of the URL. Should be in W3C Datetime format.
- <changefreq> (optional) – How frequently a page changes. Often ignored by search engines.
- <priority> (optional) – Priority of the URL compared to others in the list. Ranges from least important to important (0.0 to 1). It’s been confirmed that Google ignores this, however search engines like Bing do use it.
1.3 Sitemap Guidelines
When making a sitemap, there are some limits on the size, the number of URLs and format of the URLs can that be added but also some optional things you can add to them.
Limits / Rules
- Be aware of size limitations – Each sitemap should not contain more than 50,000 URLs and the file size should not exceed 50MB (uncompressed) – Although this is the limit, ideally you wouldn’t want to be reaching the top limit of either 50,000 or 50MB. You should break the sitemap down into smaller sitemaps to prevent potential issues with multiple requests to your sitemap.
- Use a sitemap index file if needed – In the case of multiple sitemaps this allows for just one sitemap URL to be submitted to Google (See point 1.4)
- Only use canonical URLs – Only provide Google with your preferred URLs that contain your full domain. (See point 4.1)
- Use the correct format – All URLs should be properly formatted and escaped if certain characters are used. (See point 4.9)
Optional Extras
- Add media types – You can point to media types using sitemaps such as Images, News and Videos. (See point 4.12)
- Indicate alternate languages/regions (if applicable) – specify which pages are for different countries and languages using hreflang tags (see point 4.13)
1.4 What are Index and Child Sitemaps?
A standard sitemap XML file contains a list of URLs from your website. However, if your website has a lot of URLs or has specific URLs for different sections you may need to create a Sitemap Index and Child Sitemap files. A Sitemap Index file is a listing page containing links to individual sitemap files. Having an index file allows you to be able to submit just the one index file to Google rather than submitting multiple individual sitemaps.
Example of a sitemap index file:
A child sitemap is a sitemap file which is linked to from an index file. It’s in exactly the same format as a normal sitemap. An index file can contain up to 50,000 child sitemap references, while sitemaps can contain 50,000 URLs each. Creating child sitemaps can be useful if you want to group certain URLs in different sections, for blogs or products.
Don’t be tempted to nest your sitemap or sitemap index files. Adding a sitemap within a sitemap index file is fine, but you shouldn’t nest sitemap index files within other sitemap index files as it’s not supported and URLs may not be read.
We support sitemap index files, but not nested sitemap files (sitemap in sitemap index is OK, sitemap in sitemap not)
— 🍌 John 🍌 (@JohnMu) 13 August 2017
2. Sitemap Benefits
Having a sitemap will not affect the rankings of a page and are not a requirement by Google. Additionally, including a URL within a sitemap does not necessarily mean that it will be indexed. Instead, it’s a strong hint to Google that you consider this URL to be important enough to be considered for indexing.
So, what’s the point? The key thing to remember is that a sitemap can help Google better understand your website. Examples of this include being able to notify Google about any new or recently changed pages and help them find key URLs from websites with large or complicated structures. These things can help increase visibility within Google’s index.
In June 2019, Gary Illyes again confirmed that XML sitemap files are the second most important source of URLs for Googlebot.
Sitemaps are the second Discovery option most relevant for Googlebot @methode #SOB2019
— Enrique Hidalgo (@EnriqueStinson) 15 June 2019
This clearly explains that even though an XML sitemap is not a requirement, it should be high on the priority list for every website owner. Including individual attributes for each URL within your sitemap such as will notify crawlers when a page was last updated, which can signal to a crawler that it may need to be prioritised higher in the crawl list.
Sitemaps are also key for new websites or site migrations when you want to provide Google with the new list of URLs to crawl and index. Although sitemaps are not a requirement, it’s particularly recommended to include one if:
- Your website is large with a complicated URL structure and many internal links
- You have a new website which has little or no backlinks
- Your website has recently migrated
- Your website is constantly changing with pages being added, removed and changed, such as an e-commerce website
3. Finding Your Sitemap
In this section, we’ll go through how to find your sitemap and the differences between a dynamic and static sitemap file.
3.1 How do I find my sitemap?
Generally, a sitemap file is found on the root – the first level – of your website at /sitemap.xml. E.g https://www.ricemedia.co.uk/sitemap.xml
The URL, however, can be whatever you want it to be, as long as it’s in the correct XML format. The sitemap should be located at the root of the location of your sitemap URLs. So, if your sitemap was at website.com/sitemap.xml it could include all URLs within website.com. If the sitemap was located in a folder such as /website.com/folder/sitemap.xml the URLs in the sitemap could only reference URLs located in that folder e.g website.com/folder/page-1.html but not website.com/page-2.html.
If you’re still not sure where your sitemap is, you can also check your robots.txt file. The robots.txt file is always found in the same location at the root of your website e.g https://www.ricemedia.co.uk/robots.txt. In the robots.txt file, there is sometimes a reference to a sitemap file which may be of help.
You can also check in Search Console to see if one has previously been submitted to Google. Once in Search Console, go to the Sitemaps section and see if there is an entry under Submitted Sitemaps. If you are still unable to find one, it may have a custom name or one does not currently exist. It’s best to check with your website developers or CMS in that case.
3.2 Do I have a static or dynamic sitemap?
If you’ve got a sitemap the next thing to consider is whether your sitemap file is dynamic or static. A static sitemap file is one that has been generated via a tool, such as XML Sitemaps or Screaming Frog, and is essentially a snapshot of your website at the time the sitemap is generated. This is an easy way to get a sitemap created and submitted to Google.
The downside is that if you are adding, removing or changing pages in your website regularly, this will soon be out of date. The changed URLs will either 404 or 301, and you’ll soon see errors start to pop up within Search Console. With a static sitemap, you can normally tell as it’ll include the tool used within the file itself, as shown below with a Screaming Frog sitemap.
A dynamic sitemap is generated by the website and stays up to date – adding, changing and removing URLs as needed. This is definitely the preferred option for a sitemap as Google will always have an up to date list. That being said, it can also create a lot of issues if incorrect settings are used in its generation.
Issues like using the wrong domain, HTTP instead of HTTPS, canonicalised URLs and accidentally including pages you didn’t even realise existed can all cause you a lot of problems. We’ll go through what you need to look out for in the next step.
4. What to include & avoid in your Sitemap
Once you have located your sitemap, you should analyse it for any potential issues. In this section, we’ll run through what to include, avoid and look out for within your sitemaps.
4.1 Only Include Canonical URLs
A canonical tag is a way of telling Google the preferred URL to be used for indexing, to help prevent duplicate content issues. If you’ve got duplicate instances of the same page, you can specify the canonical version to Google via the rel=canonical HTML element on the page.
For example, it’s common for a product to be in different categories and therefore have different URLs:
/red-dresses/a-red-dress/
/maxi-dresses/a-red-dress/
/sale-dresses/a-red-dress/
/product/a-red-dress/ – Canonical
To prevent duplicate content you would want a canonical tag on all of the above URLs, pointing to the canonical version e.g /product/a-red-dress/ URL in this instance. When a URL is described as ‘canonicalised’ it means that it’s canonical tag does not match the current URL, meaning that it’s a duplicate page and not the preferred version to index.
Sitemaps are extremely useful for search engines as they help them to crawl your website more intelligently. But if the sitemaps contain extra URLs that are not used within the website or contain canonicalised URLs, it can have a negative effect by giving search engines more URLs to crawl.
#Sitemap errors don't impact #rankings…but they can slow down #indexing 😐 says @JohnMu here https://t.co/k4r09guntQ
— DeepCrawl (@DeepCrawl) May 18, 2017
John Mueller has confirmed that Google uses the URLs in the XML sitemaps to help decide on the canonical version to use. Make sure that canonicalised URLs are not included in your sitemap – only include the preferred or canonical URLs. If your website has got multiple URLs available for the same page but no canonical tags are in place, then adding canonical tags should go to the top of your priority list!
How to Test: Crawl your sitemap using Screaming Frog, Deepcrawl or Sitebulb and check that there are no canonicalised URLs.
4.2 Only use your preferred URL format
This is an important one. Make sure that your sitemap URLs are absolute links (this means it contains the domain name in the URL) and use your preferred URL format. If your website is using HTTPS, so should your sitemap URLs. Depending on whether you’ve decided to use www’s or not, trailing or non-trailing slashes at the end of your URL, your sitemap URLs should match your choice.
Often when a website has migrated to HTTPS, we find that the sitemap URLs have not been updated to the secure protocol. This is often quite easily fixed, but for WordPress websites can be caused by plugin conflicts, so it’s important to check after a migration if your website is using a dynamic sitemap.
How to Test: Crawl your sitemap using Screaming Frog, Deepcrawl or Sitebulb and check that all URLs in the list include your preferred URL format.
4.3 Remove any 301’s, noindex or 404 URLs
The reason for this is to help Google to easily crawl all the URLs within your sitemap. If the crawler finds a URL within the sitemap, views it and see a noindex tag, then it’s a waste of its time – you’ve already told Google that you don’t want this page to be indexed. It’s important to optimise your crawl budget as much as possible.
Including redirecting URLs instead of the final URL, or a URL that 404s, all contribute to unnecessary URLs being crawled. If you have a particularly low crawl budget, including a lot of these types of URLs may mean that not all of your sitemap URLs will be crawled.
How to Test: Crawl your sitemap using Screaming Frog and check for any 301 or 404 response codes being returned. Also, check that none of the URLs have a noindex tag applied. Deepcrawl and Sitebulb will also highlight broken and noindex URLs within its Sitemap Reporting section. Search Console will also highlight URLs submitted which have a noindex tag.
Why you *might* want to include 301, noindex or 404 URLs within your Sitemap
A reason where you may want 301 URLs within your sitemap is during a site migration. You can submit the old sitemap including the old URLs along with the new sitemap to help Google find and crawl the new URLs. During a site migration, you would have set up redirects from all these old URLs to the new versions, so it’s helpful for Google to see the changes that have been made.
Adding both sitemap versions, you’ll be able to see the index count rising for the new URLs and the index count dropping for the old URLs. This should only be done as a temporary measure – make sure you remove these sitemaps after six months. This was confirmed via a Google Hangout with John Mueller and explained here by Search Engine Roundtable.
You can also help to speed up de-indexing of 404 and noindex URLs by submitting them within a separate Sitemap. As discussed later, by having them in their own sitemap you can monitor how quickly they are getting removed using the Index Coverage report.
“One way to speed this up could be to submit a temporary sitemap file listing these URLs with the last modification date (eg, when you changed them to 404 or added a noindex), so that we know to recrawl & reprocess them.” “This is something you’d just want to do for a limited time (maybe a few months), and then remove, so that you don’t end up in the long run with a sitemap file that’s not needed by your site” John Mueller, Jan 2019
4.4 Investigate orphaned pages
An orphaned sitemap URL is a URL that is added within the sitemap file but is not linked internally within the website. Not only do orphaned URLs gives search engines more URLs to crawl but often these URLs are incorrect – i.e shouldn’t be viewed by users – having them in the sitemap file means they can still be indexed and compete against other correct URLs.
Another issue to be aware of is that orphaned URLs in the sitemap can be viewed as doorway pages. A doorway page is a page which is accessible to search engines (submitted via a sitemap file) but is hard to find for users (not linked internally). As described by Google, here are three examples of doorway pages:
- Having multiple domain names or pages targeted at specific regions or cities that funnel users to one page
- Pages generated to funnel visitors into the actual usable or relevant portion of your site(s)
- Substantially similar pages that are closer to search results than a clearly defined, browseable hierarchy
For larger websites it may be tempting to include URL combinations which are valid pages but hard to link internally, however, these would still be seen as doorway pages, as confirmed below by John Mueller on Twitter. Consider updating your link structure to include these URLs in a natural way (i.e don’t just include them all within one page which is hard to find) or removing these URLs from your sitemap. Typically with CMS like WordPress, you can enable sitemap functionality through plugins such as Yoast. This is great, but you absolutely should make sure you check what is being pulled through your sitemap and update the settings according to your website’s needs.
By default, generated sitemap settings will normally include all accessible pages and resources within your website. This will include all pages that are not manually set to noindex – that’s good, right? Not necessarily. Say you’ve made a landing page just for paid search and you don’t want search engines to index this page – it’s not been linked to within the website so that users can’t find it, but you haven’t set the page to noindex.
Unfortunately, it’s highly likely that this page will be currently sitting within your sitemap file. When you submit the sitemap, you’re presenting Google with this URL to crawl, so not only are you giving Google an extra URL to crawl but you’re also allowing this page to potentially be indexed.
To remove certain page types from Yoast, a popular SEO plugin, it’s important to go to the individual page settings. Any page type you don’t want to be included in the sitemap will automatically be removed from the XML sitemap if you select ‘no’ under ‘Show [page type] in search results?’. You’d be surprised at the URLs you may not realise were sitting in your sitemap being presented to search engines. Not only search engines too, as the majority of sitemap files are easily accessible, your competitors could also be checking out your sitemap files to see what pages are in there.
If you would rather users were not able to find your sitemap file, consider giving it a custom URL. While search engines will generally crawl the common URL name such as sitemap.xml, you can call it anything you like when you submit to Google, as long as it’s a valid XML file. If you choose to do this, make sure you don’t then add a sitemap reference in your robots.txt file, that’s a big giveaway! Be sure to manually submit the sitemap to all search engines as well if using this method.
How to Test: Crawl your website and sitemaps using a website crawler such as Screaming Frog, Deepcrawl or Sitebulb, look at the orphaned URLs report.
4.5. Remove the bloat
The next step is to remove any links that are not important for search engines to see within the sitemap. Not every URL needs to be included in your sitemap, only the important ones. Bear in mind that pages or sections not included in the sitemap but are linked internally will still be crawled.
Any sections that are largely duplicate content (i.e no unique text or optimisation) such as blog tags, blog author URLs and product filter URLs should generally be removed from the sitemap. With dynamic sitemaps, especially on WordPress, you may also see that child sitemaps have been generated for pages you would not expect to see, like slider sections and reviews. You should definitely remove these.
How to Test: Crawl your sitemap using Screaming Frog, Deepcrawl or Sitebulb or manually check the sitemap files to see whether there are layout files or pages that don’t need to be included.
4.6. Look for missing pages or sections
So you’ve removed unimportant URLs, is the important stuff actually there? Are there page types missing? A simple setting in the past may have accidentally removed key pages or sections from the sitemap, so double-check page types like categories and key landing pages are still within the sitemap.
How to Test: Crawl your website and sitemaps using Screaming Frog, Deepcrawl or Sitemap and check the Pages Not in Sitemaps report.
4.7. Include accurate Last Modified date
The tag is a field you can add to each sitemap URL, it specifies the timestamp of when the URL was last updated. Example of a sitemap URL containing a last modified date
John Mueller has confirmed previously that while most sitemap tags are not really taken into consideration such as priority and freq, the last modified date is a small signal that can be used to speed up re-crawling of URLs. This is because the lastmod tag tells Google that this page is more likely to have new content and thus may need to be re-crawled sooner over other URLs with a later lastmod date.
It will have larger benefits for websites with a lot of URLs in their sitemaps, as it helps Google to better understand which URLs may need to be prioritised. A question asked to John Mueller about quicker recrawling of pages
However, this modified date must be accurate. If identical or inaccurate last modified dates are used in the sitemaps they will be ignored as confirmed during a recent Webmaster Hangout.
Identical last modified dates in #Sitemaps for all pages will be ignored says @JohnMu here: https://t.co/Q0JqpaAPgt #SEO
— DeepCrawl (@DeepCrawl) 13 March 2017
4.8 Don’t worry *too much* about the Priority tag
You’ll often see priority tags in a sitemap file. The initial purpose of the tag was to allow website owners to assign priority to URLs relative to others within the sitemap. It was supposed to allow you to tell Google which are the more important URLs to crawl first.
However, it’s been confirmed that Google ignores this, really just wanting to see the URL and an accurate last modified date. Bing however still specifies the priority tag within its sitemap documentation, so if Bing is a priority for you, it’s worth keeping it. However, make sure they have the correct priority levels and are not all the same.
We ignore priority in sitemaps.
— 🍌 John 🍌 (@JohnMu) August 17, 2017
4.9 Use correct URL encoding
Sitemap files should be UTF-8 encoded, certain characters such as ‘&’ should use entity escape codes within the sitemap URL. Any non-ASCII characters must also be escaped. If this isn’t added, you’ll see parsing errors when you try to submit within Search Console.
For example, using & within a URL must be swapped out with & within a sitemap URL in order to be read properly. Original URL http://www.example.com/test-category&subcategory=books
Sitemap URL with Escape Codes Added http://www.example.com/test-category&subcategory=books
John Mueller confirmed more about parsing errors affecting sitemaps in a Google Webmaster Hangout from June 2019:
“If one individual URL element within an XML sitemap has an error, this will not impact the way Google is able to parse and read the sitemap as a whole. However, if the element is broken in a way that impacts the parsing of the rest of the sitemap, then the XML file becomes unreadable and will not be usable as a sitemap.”
How to Test: Check for errors within Search Console once submitted, or crawl your sitemap file using Screaming Frog. Look at the URI > Non-ASCII Characters report.
4.10 Beware of your Sitemap size
As mentioned earlier within the Google documentation, sitemap files can be up to 50MB (uncompressed) and hold up to 50,000 URLs. If you have more, you would create more sitemap files and link to them via an index sitemap. However, in practice, you would want your sitemap files to avoid going anywhere near this these sitemap limits.
A sitemap near that size would take a long time to download and it’s not certain that Google would end up crawling all of the URLs it contains, especially if there are issues with some URLs.
When we are auditing sitemaps we recommend reducing the size and number of URLs of large sitemaps where possible. Barry Adams from Polemic Digital suggests limiting large sitemap files to 10,000 URLs each, having found that this number has led to higher degrees of indexing.
4.11 Categorise Sitemaps where possible
To better analyse the performance of your sitemap it’s also recommended to group similar URLs into individual sitemaps, e.g product URLs in one sitemap, categories in another etc. This helps you as you can specifically view the Index Coverage report for those specific URL groups in Search Console but also helps to keep things organised and is another way to reduce the number of URLs in your sitemap files.
Another thing that sometimes helps is to split the sitemap files up into separate chunks of logical chunks for your website so that you understand more where pages are not being indexed and then you can see are the products not being indexed or the categories not being indexed and then you can drill down more and more and figure out where where there might be problems that said we don’t guarantee indexing so just because a sitemap file has a bunch of URLs and it doesn’t mean that we will index all of them that’s still kind of something to keep in mind but obviously you can try to narrow things down a little bit and see where where you could kind of improve that situation.” John Mueller, 2017
4.12 Add supported media to your Sitemaps (optional)
You can also extend your sitemaps to include media types such as News, Images and Videos. If you create a Google News Sitemap it lets you control what you submit to Google News, there are some fairly strict requirements so make sure you check the guidelines before creation. For images and videos on your pages, you can either embed them within an existing sitemap or create a separate sitemap for them.
It’s encouraged to add images within pages to a sitemap where possible as it specifies to Google that you would definitely like these crawled and indexed within Image Search results. This is especially important if Google is struggling to find these during a crawl due to Javascript issues. John Mueller recommended making sure these have alt text and captions added to provide more information to Google.
Want to make your #sitemaps more informative? #SEO Tip: Add images says @JohnMu here: https://t.co/5sU94N9Nu6 #Google
— DeepCrawl (@DeepCrawl) May 10, 2017
4.13 Indicate alternate country/language URLs (optional)
If you have multiple versions of a page created for users in different regions or languages, you can specify these variations to Google via hreflang tags. You can implement these in three ways; within the HTML, HTTP Headers or within the Sitemap. Google recommends to only choose one method of implementation to avoid potential errors or confusion. The most common method looks to be via the HTML method. If adding within a Sitemap file you would add a child for each different language/region version which looks like the below:
Find out more on the Google Hreflang documentation and also this fantastic Hreflang tags implementation guide by Eoghan Henn. Within this guide, he states that he does tend to see more issues when users decide to implement hreflang tags via a sitemap other the other two methods.
5. Testing Your Sitemap
Now you know what to look out for, how do you test your sitemaps?
5.1 Crawl your Sitemap individually
When I’m auditing a large website, I find that it’s best to crawl sitemaps on their own first so that you can get an idea of any issues that need to be addressed such as 404s, redirects and canonicalised URLs. There are a few tools which can help with this.
One of our favourites is Screaming Frog and its awesome sitemap crawling tool. To just crawl sitemaps you would go to Mode and change the setting to List. Next, you can click on the Upload button and choose from either Download Sitemap or Download Sitemap Index
.
If you just want to crawl one sitemap, click on “Download Sitemap”. If you have an index file containing lots of child sitemaps and want to crawl them all select “Download Sitemap Index” For both options, you just need to add the URL of the Sitemap or Sitemap Index file in the URL field and click go. It’ll then bring back a list of all the URLs found.
This is also a good way to quickly check the number of URLs in a sitemap file if you are worried they might be too large. If you’re happy with those URLs you can then click OK and those URLs will be crawled by Screaming Frog. Once the URLs have been crawled you’ll be able to see via the tool if there are any issues with the URLs, such as URLs which redirect, are 404s or are canonicalised.
5.2 Crawl your Sitemap during a web crawl
In order to check for orphaned pages or URLs that are missing from the sitemaps, you need to compare the URLs in your sitemap to those in your web crawl. The more manual way of doing this is to compare the list of URLs generated by a web crawl against the sitemap crawl. You would crawl the website with a tool such as Screaming Frog in Spider mode.
You would then delete the irrelevant URLs that you wouldn’t want to appear in the sitemap, such as filter or paginated URLs and add the final web crawl list to a spreadsheet. You would then add the URLs from your earlier sitemap crawl to the spreadsheet and compare the two for any orphaned or missing URLs.
The other option is to use a crawling tool that will allow you to crawl the website and the sitemap XML URLs at the same time. The tool then does the comparison for you and will find orphaned URLs and URLs missing from the Sitemap. Our favourites are listed below :
Screaming Frog
Screaming Frog is a great desktop website crawler which can crawl your sitemap files on their own or combined with a web crawl. It has a free version of up to 500 URLs. Before you start your crawl, update your settings at Configuration > Spider.
If you have your sitemap file linked to in your robots.txt file you can tick ‘Auto Discover XML Sitemaps via robots.txt’. If they are not linked in the robots or you would rather add the link, tick ‘Crawl These Sitemaps’ and then add the link in there. Once the website is crawled, for the XML sitemaps to be analysed you’ll need to go to Crawl Analysis > Start. Once finished, you’ll see the issues listed within the sitemap section on the right-hand side.
If you have a large website which has multiple sitemaps available it may be preferable to only check a small sample or individual sitemaps at a time. Crawling your sitemap URLs is the same as crawling your website so make sure that it doesn’t affect your website performance. If it does, consider using Speed settings to slow down the crawl. The example below is of the speed settings used in Screaming Frog.
Check out this useful guide on how to audit your XML sitemaps using Screaming Frog. It explains all the different settings you can use and how you can crawl a sitemap or sitemap file on it’s own or combined with a web crawl.
Sitebulb
Sitebulb is another brilliant website crawler which has a lot of great features including the ability to crawl and analyse your sitemap files during a website crawl. In the Audit Setup screen make sure you tick ‘XML Sitemaps’ in the Select URLs sources to Audit section. If you click the down arrow on the right you can see the sitemap it has automatically selected. You can keep this or delete and add your own.
Once the crawl has finished, you can see the data associated with the sitemaps in the XML sitemaps section in a very easy to understand way.
After a crawl Sitebulb provides suggestions or issues in the form of ‘hints’. As Sitebulb explains in it’s documentation, any hints given for a sitemap XML crawl will normally be errors which need to be fixed.
Deepcrawl
Deepcrawl is a cloud-based web crawler which also has a fantastic Sitemap analysis section. During setup, you can find sitemaps automatically via the robots.txt or add them manually. Once complete you can see Sitemap specific reports such as orphaned URLs and pages not in Sitemaps.
Deepcrawl have some detailed guides on how you can use the software to monitor and audit your XML Sitemaps. One of the great things about Deepcrawl is that you can schedule your XML sitemap crawls on a regular basis. You can also set up tasks to to monitor issues which will then send an email summary.
5.3 Manually check findings
While using the tools mentioned above will highlight orphaned URLs or URLs which are not included within the sitemap, make sure that you manually check the lists before specifying the actions to be taken. For example, if some URLs in the web crawl are highlighted as not included in the Sitemap, double-check to make sure these are actually URLs that you do want to be included.
Often you will find paginated or filter URLs shown as not in the sitemaps but you wouldn’t actually want these added to your sitemap. Sometimes orphaned URLs can be in the sitemap accidentally. For example, if they are old URLs or ones created for testing purposes or paid search. In that case, you would want to remove them from the sitemap and potentially no index, 404 or redirect them instead of linking to them within the website.
5.4 Additional actions after Sitemap URL removal
Leading on from the previous point, if you’ve found URLs in the sitemap that shouldn’t be seen by search engines you will have to take more action than just removing them from the sitemap file. This is because removing a URL from the XML sitemap will not automatically stop it being crawled once it has been found by Google.
Let’s use an example of a page which has been created especially as a landing URL for a paid search ad; this shouldn’t be indexed and is just for users. When making this page in a CMS like WordPress, unless you specifically set it to noindex upon creation it would automatically add it to the XML sitemap (if you have this set up). If you want the page to be removed from the index but still visible to users and you don’t mind people potentially seeing this page in the short term, use the noindex, nofollow robots meta tag.
If you instantly want to stop potential searchers seeing the page you could delete it (but don’t add a redirect) which will result in the original URL showing as a 404 and then moving the content onto a new URL (but this one has the noindex tag). It will still take time for it to be removed from the search results so in the meantime you could use the URL removal tool in search console to temporarily hide it from the search results.
6. How to submit a Sitemap to Google
Once you’re happy with your sitemap and have fixed any issues, you can head over to Google Search Console for submission.
6.1 Submitting a new Sitemap in Search Console
In Search Console head to Index > Sitemaps within the left sidebar. Add in the URL of your index sitemap file or the URL of an individual sitemap into the Add a New Sitemap box and press Submit. Once you submit, you’ll be able to see if there are some kinds of errors straight away. John Mueller has confirmed that submitted sitemaps are validated directly after submission.
Working on #Sitemaps? @JohnMu says #Google validates them immediately after submission here https://t.co/jXEpOuxK0k #SoQuick!
— DeepCrawl (@DeepCrawl) September 11, 2017
Once submitted, you’ll be able to see it within the list. For each sitemap it will display the name, sitemap type, when it was last read, status, number of discovered URLs and a link to view more details. Upon submission the status is the main thing you need to check, if there is an error with the sitemap file it will show up here, there are three possible statuses.
- Success : The sitemap was loaded and processed. The URLs will be submitted for indexing.
- Has errors : The sitemap was parsed successfully but there are errors. Any URLs that were parsed successfully will be submitted. Using the sitemap report will identify any specific issues with the URLs.
- Couldn’t fetch : The sitemap couldn’t be reached. Often this is due to an incorrect URL being submitted which is a 404. Recheck the URL and the sitemap is live on the URL you have submitted you can use the URL Inspection Tool in Search Console to analyse it further.
Another thing to note is that although you can submit image, video and news URLs within a sitemap, the Sitemap Report in Search Console does not currently show any data for these URL types.
6.2 Removing a Sitemap from Search Console
If you would like to remove a sitemap from Search Console, this is also easy to do. You may want to remove a sitemap from Search Console for various reasons :
- It contains legacy URLs that were used to help Google during a site migration. Google recommends keeping these for up to six months after a migration to assist with indexing the new website URLs.
- The sitemap is an old version/no longer exists
- It’s a duplicate or no longer useful
- The sitemap URLs have been duplicated into a different sitemap. It’s not recommended to include the same URL within multiple sitemaps.
To remove a sitemap you need to click the name of sitemap you wish to remove from within the Submitted sitemaps list. Once you are in your chosen sitemap go to the top right and click the three dots next to Open Sitemap. A pop up will then appear saying ‘Remove Sitemap‘ which will then remove it from the list.
Be aware that if the sitemap being removed is a child sitemap, if you resubmit the index sitemap it will add this sitemap into the list again. Within the Google Sitemap report documentation, it is stated that removing the sitemap from Search Console will not stop it from potentially looking at the sitemap in the future. To make sure it is no longer visited by Googlebot the recommendation is to use a robots.txt disallow to prevent it from being read or the most obvious one is to actually delete the sitemap file from your website.
No, that’s wrong. We’ll drop your sitemap file if the URL OF YOUR SITEMAP FILE does not work. If /sitemap.xml is a 404, we’ll stop fetching it over time.
— 🍌 John 🍌 (@JohnMu) 17 January 2019
6.3 How to resubmit a Sitemap in Search Console
This option is no longer available in Search Console. According to Google, there is no longer a need to resubmit a sitemap that Google knows about (e.g has been submitted) as they will notice any changes the next time your website is crawled.
7. Analysis of your Sitemap in Search Console
The sitemap area in Search Console is an important place to keep track of how URLs are being indexed in Google and will highlight any errors or issues such as 404s or high response times in your sitemaps. As the sitemaps give Google an important list of URLs to crawl it is vital to make sure that the sitemap list is as clean and efficient as possible.
7.1 Common XML Sitemap Errors in Search Console
Once submitted your sitemap will either be listed as Success, Has errors or Couldn’t fetch. There are quite a few different errors that may appear within the Sitemap report in Search Console when Google has checked your submitted sitemap. These include URL not accessible, URL not allowed, Parsing error and Invalid date. Please check the Google sitemap report documentation for the full list.
7.2 How to view Index Coverage report for all Sitemap URLs
Index Coverage is a fantastic report in Search Console which can explain how Google is interpreting and indexing your URLs. You can view a specific Index Coverage report just for URLs submitted in your sitemap which is so useful! To do this, go to Index > Coverage in the left-hand menu of Search Console. Click on the All Known Pages dropdown at the top and then select All Submitted URLs from the dropdown.
7.3 How to view Index Coverage report for individual Sitemaps
Where suitable, it’s best to categorise and group similar URLs so that they are within their own sitemaps files, e.g products, categories etc. This way you are able to just analyse a smaller subset of URLs using the Index Coverage report. You can access the Index Coverage for an individual sitemap in two ways:
- Go to Index > Sitemaps in the left-hand menu of Search Console. Within the Submitted sitemaps list, find the sitemap you wish to view and then click on the graph icon at the end of the row. Alternatively, if you click on the sitemap name, this will take you to more details about the site and there is also an Index Coverage link on there.
- Go to Index > Coverage in the left-hand menu of Search Console. Click on the All Known Pages dropdown at the top and then select your sitemap from the list.
Once you have selected your sitemap, the name will appear at the top and you’ll be looked at the Index Coverage report just for that specific sitemap. Please note that to view the Index Coverage report for child sitemaps submitted as part of an index file, you have to additionally submit the child sitemap you wish to view individually within Search Console and then wait for the ‘See Index Coverage’ button to appear.
If only the index file has been submitted you’ll only be able to view the Index Coverage report for all URLs within the sitemap index file and won’t be able to separate them on a child sitemap basis.
7.4 Sitemap analysis using Index Coverage report
As we’ve covered earlier in the guide, Google uses URLs within a sitemap as a key way to discover new URLs and ones that you value within a website. As well as helping you to understand how Google is viewing and indexing these URLs, using Index Coverage for your sitemaps help you to check whether you are giving Google the right information. It can help you refine and improve the sitemap to see whether if there are any URLs that you are missing, need to be adjusted or even removed.
Remember that there is a limit of 1,000 rows for each report so if you have a large website, you may not be able to see the full list of URLs affected by the reports in the Error, Valid and Excluded sections.
This is where your sitemap analysis using other tools such as Sitebulb, Screaming Frog and Deepcrawl can help, especially if you need to find out a full list of noindex, 404, blocked by robots.txt URLs. There are three main categories in the Index Coverage report; Error, Valid and Excluded.
‘Errors’ reasons for Sitemaps
The majority of URLs showing for reasons in the Error category can often be fixed as they are due to robots blocking, 404s etc. These reasons are :
- Submitted URL blocked by robots.txt
- Submitted URL marked ‘noindex’
- Submitted URL not found (404)
- Submitted URL seems to be a Soft 404
- Submitted URL has crawl issue
If there are any URLs showing for the first three reasons in the Error category these should have been picked up during your sitemap tests and these can be easily double-checked with the tools we discussed earlier. However, the URLs in the index coverage report do not update instantly so once it’s updated it will take a while to be reflected in Search Console. You can request Validate Fix for reasons in the Error which will send them to Google for checking or manually submit individual URLs via the URL Inspection tool.
The URLs shown in Soft 404s and crawl issues are useful to look at as they are giving you information that the tools can’t, about how Google is categorising the URLs. A soft 404 is a page that doesn’t have a 404 response code but looks like one to search engines. Pages like empty categories can often be seen as a soft 404, if they contain little to no unique content.
If soft 404s are in your sitemap you should analyse those URLs to understand why they might be being viewed as a soft 404 and fix if necessary. If it’s a valid page but being seen as a soft 404 it may need to be reviewed, in terms of content and quality.
If a submitted URL is being included in the Crawl Issue reason list, it means it encountered a crawling error that doesn’t fit into any of the other reasons. You can examine the URL using the URL Inspection Tool to see if it gives any more information.
If you are seeing a URL under a reason that you know is fixed, it might have been fixed AFTER the last time Google crawled the URL, which you can check via the crawl date. You can request re-indexing for individual URLs or select Validate Fix if they are in the Error category.
‘Excluded’ reasons for Sitemaps
Google has stated that you shouldn’t expect to be able to fix everything in the ‘excluded category’ as having canonicalised URLs or those with redirects is expected and normal within a current website.
However, when you are just looking at the report for a Sitemap, you would want the majority of the Errors and Excluded sections as clear as possible. This is because if you have followed the points mentioned earlier your sitemaps should not contain canonicalised or duplicate URLs and every URL in it should be considered important enough for indexing.
If Google disagrees, it can point to errors, issues with quality, pages that look like duplicates or a website problem that Google is encountering when trying to access the URLs. The reasons for the ‘Excluded’ category are :
- Duplicate, submitted URL not selected as canonical
- Discovered – currently not indexed
- Crawled – currently not indexed
Duplicate, submitted URL not selected as canonical
The explanation for this reason from Google is when there are duplicate URLs but not an explicit canonical URL.
“The URL is one of a set of duplicate URLs without an explicitly marked canonical page. You explicitly asked this URL to be indexed, but because it is a duplicate, and Google thinks that another URL is a better candidate for canonical, Google did not index this URL. Instead, we indexed the canonical that we selected.”
Where possible investigate the individual URLs using the URL Inspection tool (this will also show the canonical that Google has picked for it) and also check whether the URLs are unique and valid for indexing and have a correct canonical tag. Also check for duplicate or thin content.
In our experience we’ve noticed that sometimes a URL can sometimes appear in this list if its either canonicalised or a redirect has been added to the URL but the URL is still within the sitemap. Checking the URL using the URL Inspection Tool or visiting the URL in the browser or with a website crawler can normally give better clarity. If it’s due to a redirect you can remove it from the sitemap and wait for it to be removed or try resubmitting via the URL inspection tool. Again, if it’s canonicalised and correct, you can remove it from the sitemap also.
John Mueller also gave some insight when a webmaster was discovering this was causing a large number of deindexed URLs:
“Usually this happens when we run across a number of URL patterns on a site that all lead to substantially the same content. If this all happened during a short time, it might be that there was something misconfigured that caused this, and in that case, it’ll settle back down over time as our algorithms confirm that these URLs are actually separate. “
https://www.seroundtable.com/google-canonical-urls-wrong-26872.html
Discovered – currently not indexed
The official statement for this reason in the help documentation which refers mainly to the site being overloaded.
“Discovered – currently not indexed: The page was found by Google, but not crawled yet. Typically, Google tried to crawl the URL but the site was overloaded; therefore Google had to reschedule the crawl. This is why the last crawl date is empty on the report.”
John Mueller gave more information on this in a webmasters hangover in 2018:
“So in general this sounds a bit like something where we’re seeing a lot of pages. And our systems are just not that interested in indexing all of these where they think maybe it’s not worthwhile to actually go through and crawl and index all of these. So especially if you’re seeing discovered but currently not indexed that means we know about that page, that could be through a sitemap file, it could be through internal linking. But our systems have decided it’s not worth the effort, at least at the moment, for us to crawl in index this particular page.”
“So if you’re auto-generating content, if you’re taking content from a database and just putting it all online, then that might be something where we look at that and say well there’s a lot of content here but the pages are very similar or they’re very similar to other things that we already have indexed, it’s probably not worthwhile to kind of jump in and pick all of these pages up and put them into your search results.” “First make sure that you’re not accidentally generating too many URLs. Make sure that the internal linking is working well. And trying to reduce the number of pages and kind of combine the content to make it much stronger.” https://www.seroundtable.com/google-discovered-currently-not-indexed-help-26697.html
If this is affecting your sitemap URLs he mentions that it actually could be an issue of too many similar or generated URLs. The quality of the pages might also need to be made stronger, he suggested that if this is affecting a large percentage of your website it may be that the number of pages need to be reduced, or combined and the content on the remaining pages improved. Also check that the URLs on the list are not orphaned URLs, or could look like soft 404s and are available on the website via internal linking.
Crawled – currently not indexed
This reason has been given this description by Google : The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling. This is when Google has crawled the URL but hasn’t indexed it. It may do so in the future…but it may not.
This is another good time to check URL inspector and also quality of the content on these URLs to see if there are any duplication or thin content issues going on. Another thing to note is that we have found in the past that sometimes a URL is listed in here but when doing a manual check via the URL inspection tool it is actually indexed.
7.5 Check / Tidy up your current Sitemaps
If you already have a large sitemap section within Search console, it’s worth spending a bit of time having a tidy up. There may be old sitemaps which are no longer being used, if they are not part of a recent migration within the last six months we would recommend deleting these.
8. Summary
Sitemaps are a fantastic tool to use to help Google find and understand the important URLs within your website. However, it’s vital to make sure that the URLs within a sitemap are correctly formatted and contain the correct URLs within the size limits allowed.
-
- Ensure you have a sitemap file for your website – utilise child sitemaps where possible
- Check your sitemap for errors – such as incorrectly formatted, canonicalised, broken or redirecting URLs
- Check your sitemap sizes – make sure they are not too large and also be wary of including too many with a small number of URLs
- Improve your sitemap – include an accurate last modified date & images
- Investigate your sitemap URLs – look for orphaned sitemap URLs or missing pages/sections
- Extend your sitemap – based on your website consider adding media types like news, images and videos. Add hreflang if applicable
- Submit your sitemap to search engines – check for errors upon submission
- Analyse your sitemaps using Search Console – check to see how Google is treating the URLs within your sitemaps and make adjustments if necessary. To view the Index Coverage report for individual child sitemaps, submit those separately along with the index file.
- Regularly test your sitemap – make sure no new errors crop up or unexpected pages
If you would like some advice regarding your website’s sitemap or any aspects of SEO, Technical SEO or PPC please don’t hesitate to contact us!