In SEO, there are several areas which are considered the bread and butter of a campaign, whether you’re an experienced agency or a fresh-faced startup trying to optimise your own site. On-page optimisation, building links, putting content together, the list goes on – but one subject which isn’t mentioned all too often is managing the crawl budget of a site. It can sometimes just be an afterthought; something which is isn’t controllable and all gets handled by search engines.
Well, crawl optimisation and ensuring that you’re in complete control over what is and isn’t being crawled indexed by search engines should be a key part of any SEO campaign.
Throughout this post, we’ll take a look at precisely what a crawl budget is, and what you can do to properly manage your crawl budget to ensure that you’re in charge of what is and isn’t being crawled on your site.
What is a Crawl Budget?
A crawl budget is the number of pages that search engines like Google will crawl across your entire site during any given period of time, with it being defined as the number of URLs that search engines can and want to crawl.
Going back to basics for a second – a search engine will send a crawler, or a spider, out to trawl through the links of sites, finding content for their gargantuan indexes. The most commonly referenced one would be Google’s aptly name Googlebot.
The links included in their indexes – the ones you see in search results pages – come from internal links, external links, sitemap files and so on, while also looking at the robots.txt file and meta robots tags for rules as to what should and shouldn’t be crawled or indexed.
Managing this is essential, as you need to provide key pages which are to be crawled and indexed. If you fail to control the amount of pages being crawled and indexed within a website, especially when it comes to massive e-commerce sites, it can cause trouble with regards to your overall crawl budget, the URLs that are being crawled and those which are being indexed.
How is a crawl budget determined?
In this blog by Gary Illyes, it’s actually mentioned that the crawl budget is generally something which most publishers don’t have to worry about. If the site is well managed, it will be crawled and subsequently indexed. He then provides further information on the process that is undertaken by Google when they’re determining crawl budget and what they want to crawl on a site.
Essentially, there are two factors: the crawl rate and the crawl demand. The crawl rate represents how much time is taken between fetches/crawls, as well as the number of resources used to crawl the site. This is determined by the health of each crawl – if the site responds slowly and is littered with crawl errors, future crawls will use fewer resources – and a limit set by the user in Search Console, if they so choose.
The crawl demand looks at the popularity of the site and how stale the site is. Pages which have been linked to and shared more often will be crawled more often, while staleness refers to a lack of updates from within the site. These two combine to determine the crawl budget of the site – the links which bots can and want to crawl.
How do you optimise your crawl budget?
Here are a few areas which you should be paying attention to in order to really manage how your site is being crawled and the crawl budget you’re being allocated each time search engines trawl through your site.
Manage Your URL Parameters
As mentioned earlier on, crawl management must be considered for ecommerce sites, especially ones that generate massive amounts of URLs through filters or the search function.
These URLs, while not hugely different in terms of content, with only a slight change in products displayed or results provided, will appear as individual URLs. If you have an ecommerce store with thousands of products and a filter which breaks them down into an array of categories, this will cause issues with regards to both duplicate content and the crawling of your site.
I’ve seen many cases, and I’m sure you have too, where a relatively small ecommerce site has thousands upon thousands of URLs indexed within Google, simply because they haven’t managed their parameters properly.
You can actually manage parameters within Google Search Console, which will let Googlebot know precisely how you want these parameters to be handled:
There’s also the more manual route, where you can use canonical tags, as well as including the parameter within your robots.txt file as a disallow rule, preventing the crawling of the parameter URLs.
Avoid Redirect Chains
One issue that often comes up when it comes to lack of control over the crawl budget is the use of redirect chains.
With a redirect, the aim should be to direct the user to the intended page with just one step, or as few steps as possible. For example, if you try to go to the non-secure version of the Rice homepage, there’s a simple redirect in place:
In some cases, you may well see a massive chain of redirects, usually done due to older redirects piling up over countless migrations and structural changes. Each URL, or each link in the chain, is an unnecessary use of the allocated crawl budget.
Recently, while checking a new client’s site, we found a chain that was created when you tried to access a certain URL which was linked in the main navigation. Due to the sheer number of links in this chain, the final page wouldn’t even load. Absolute nightmare.
Though the record would probably go to Barry Adams, who spotted this and posted it on Twitter a few weeks ago:
This is a new record for me: a 20-hop redirect chain that still resolves to a 200 OK page in the end.
— Barry Adams 📈 (@badams) March 31, 2017
When a crawl bot sees a chain like this, there’s a good chance that they’ll drop off before even reaching the destination page – Google have stated that they’ll only follow roughly five steps in a chain before dropping off – which is a massive issue, as it means that they’re not crawling the latest content on the site, making it harder for it to be fully indexed.
Manage Your Site Speed
This is an area which is mentioned quite frequently in SEO, though it’s deserved due to its importance. It plays a role in how users engage and interact with a site and also plays a role in the crawling process.
In the aforementioned post put together by Gary Illyes, he mentioned that a faster site will also have an effect on the rate at which the site is crawled:
“Making a site faster improves the users’ experience while also increasing crawl rate. For Googlebot, a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of 5xx errors or connection timeouts signal the opposite, and crawling slows down.”
John Mueller has also mentioned something along these lines in the past:
“We’re seeing an extremely high response time for requests made to your site (at times, over 2 seconds to fetch a single URL). This has resulted in us severely limiting the number of URLs we’ll crawl from your site, and you’re seeing that in Fetch as Google as well. My recommendation would be to make sure that your server is fast & responsive across the board. As our systems see a reduced response time, they’ll automatically ramp crawling back up (which gives you more room to use Fetch as Google too). “
Having a snappy response time for your site will help aid the crawling process. If your site isn’t currently up to scratch in that regard, Google will recognise any improvements going forward, adjusting the crawl rate accordingly.
Manage Your Sitemap
Your sitemap file plays a key role in determining which content from your site will be crawled and indexed. It provides a direct route to the pages on your site and is prioritised above the standard site crawl in terms of identifying links, as the links in a sitemap are deemed as more important.
With this in mind, you want to ensure that there isn’t any unnecessary clutter within your sitemap file, nor should there be malformed URLs, non-existent pages, etc. Having unnecessary pages being crawled through the sitemap which no longer exist or are redirecting creates unneeded clutter, taking up a part of your overall crawl allocation.
You should routinely check your sitemap file for any errors or unnecessary URLs that have been included. This can be done by checking out the Sitemaps section within Google Search Console, as well as running a crawl of the sitemap in Screaming Frog, identifying any issues under the Response Codes tab, as well as the Directives section.
If your site has a massive amount of URLs, or you have a separate section for a blog, or you just want to keep things as tidy as possible – you can look into creating multiple sitemaps within a sitemap index. This is commonly done within Yoast for WordPress, as shown below for the Rice site:
For a complete look at putting sitemaps together properly, check out our guide to XML sitemaps.
Internal Linking and Site Structure
A key component of crawl management and ensuring that key content is to build a proper internal linking structure within the site, something that should be considered when the site is being put together.
If your site has a proper hierarchy and well-built architecture, with key content being linked to, it makes it far easier for it to crawl and to subsequently be indexed. Search engines should be able to find each page within a certain amount of clicks. If content is buried deep inside the architecture of the site, it’s much harder to find.
To see which pages are being internally linked to the most, you can see this in Search Console, under Internal Links. Generally, your homepage should be the most linked-to page, with key areas following it.
In terms of determining the amount of clicks it takes to reach a page on the site, you can check this by running a Screaming Frog crawl of the site and clicking on the Site Architecture tab. I know I keep referencing Screaming Frog, but it is pretty damn good.
Here, you’ll be shown info such as the clicks taken to reach the page from the starting URL, as well as a graph displaying this information across all pages of the site as a whole:
Making search engines trawl through reams upon reams of pages to find new content makes it much more difficult to find, and requires a lot more effort in order to crawl it.
Helpful Links + Further Reading:
What Crawl Budget Means for Googlebot – Gary Illyes
If you’d like some advice regarding your site’s Crawl Budget and Technical SEO then do not hesitate to get in touch with one of our tech wizards!