Guide to crawl budgets

A Guide to Crawl Budgets and Crawl Optimisation

In SEO, there are several areas which are considered the bread and butter of a campaign, whether you’re an experienced agency or a startup trying to optimise your own site. On-page optimisation, building links, content marketing, the list goes on. One subject which isn’t mentioned as often is crawl budget management. It can sometimes just be an afterthought – something which isn’t controllable and all gets handled by search engines.

Crawl optimisation and ensuring that you’re in complete control over what is and what isn’t being crawled indexed by search engines should be a key consideration for a number of SEO projects, especially for bigger sites.

Throughout this post, we’ll take a look at precisely what a site’s crawl budget actually is, as well as what you can do to properly manage your crawl budget, ensuring that you’re in charge of what is and isn’t being crawled on your site.

What is a Crawl Budget?

A crawl budget is the number of pages that search engines like Google will crawl across your entire site during any given period of time, with it being defined as the number of URLs that search engines want to crawl.

Going back to basics for a second, a search engine will send a crawler, or a spider, out to trawl through links throughout the web, finding content for their gargantuan indexes. The most commonly referenced one would be Google’s aptly named Googlebot.

The URLs included in their indexes – the ones you see in search results – come from internal links, external links, sitemap files and so on. Crawlers will also look at the robots.txt file and meta robots tags for rules as to what should and shouldn’t be crawled or indexed. Botify had an interesting look into how Google crawls the web at SMX Paris 2018. 

Managing this is essential, as you need to provide key pages which are to be crawled and indexed. If you fail to control the number of pages being crawled and indexed within a website, especially when it comes to massive e-commerce sites, it can cause trouble with regards to your overall crawl budget, the URLs that are being crawled and those which are being indexed.

So, how exactly is the crawl budget of a website determined?

How is a site’s crawl budget determined?

In this blog by Google’s Gary Illyes, it’s actually mentioned that the crawl budget is generally something which most publishers don’t have to worry about. If the site is well managed, it will be crawled and subsequently indexed. He then provides further information on the process that is undertaken by Google when they’re determining crawl budget and what they want to crawl on a site. This relates to smaller sites. If you’ve got a site with a handful of pages, there really isn’t that much to manage. 

Essentially, there are two factors: the crawl rate and the crawl demand. The crawl rate represents how much time is taken between fetches/crawls, as well as the number of resources used to crawl the site. This is determined by the health of each crawl – if the site responds slowly and is littered with crawl errors, future crawls will use fewer resources – and a limit set by the user in Search Console, if they so choose.

Guide to crawl budgets and crawl optimisation

The crawl demand looks at the popularity of the site, as well as how stale the site is. Pages which have been linked to and shared more often will be crawled more often, while staleness refers to a lack of updates from within the site.

These two combine to determine the crawl budget of the site – the links which bots can and want to crawl.

Now that we know the factors that go into determining a crawl budget, what can we do to optimise our site’s crawl budget?

How can you optimise your crawl budget?

Here are a few areas which you should be paying attention to in order to really manage how your site is being crawled, and improving the general crawl health of your site. 

Check Your Log Files

Before getting into the primary areas of crawl budget optimisation, you need to identify precisely what is being crawled on your site. The best way to do this would be to analyse your log files.

Log files, specifically web server log files, are essentially a collection of the requests made for your site. These requests log information such as the HTTP status code, server IP, timestamp, and user agent.

This information is vital when analysing the crawl behaviour of search engines across your site. Looking into this will give you direct access regarding what search engines are actually crawling.

When in the process of optimising how your site is crawled, you can find any potential issues by checking which URLs are being crawled, and the rate at which they’re crawled. Here’s a quick example from a site I looked at fairly recently – the data sample size was rather small, though when looking into the directories of the site and their respective crawl rates, I saw that there were a relatively large number of requests being made for cruft URLs:

Log File Analysis

A robots.txt rule was subsequently set up for the URLs generated within this directory, improving the overall crawl health of the site.

For more information, Built Visible has a fantastic guide on server log file analysis.

There are also many different tools out there which can help you in your server log file analysis:

Screaming Frog’s Log File Analyser

Deepcrawl

OnCrawl

Manage Your URL Parameters

As mentioned earlier on, crawl management must be considered for ecommerce sites, especially ones that generate massive amounts of URLs through filters or the search function.

These URLs, while not hugely different in terms of content, with only a slight change in products displayed or results provided, will appear as individual URLs. If you have an ecommerce store with thousands of products and a filter which breaks them down into an array of categories, this will cause issues with regards to both duplicate content and the crawling of your site.

I’ve seen many cases, and I’m sure you have too, where a relatively small ecommerce site has thousands upon thousands of URLs indexed within Google, simply because they haven’t managed their parameters properly. It’s vital that you manage the crawling of these URLs. 

You can actually manage parameter URLs within Google Search Console, which will let Googlebot know precisely how you want these parameters to be handled:

Crawl Budget Optimisation

There’s also the Let Googlebot Decide option, which essentially brings the parameter to their attention. Googlebot can analyze your site to determine how best to handle the parameter.

As mentioned in Rachel Costello’s Search Leeds talk on conflicting website signals, the parameter handling signals are particularly strong.

There’s also the more manual route, where you can use canonical tags, as well as including the parameter within your robots.txt file as a disallow rule, preventing the crawling of the parameter URLs.

Avoid Redirect Chains

One issue that often comes up when it comes to lack of control over the crawl budget is the use of redirect chains.

With a redirect, the aim should be to direct the user to the intended page with just one step, or as few steps as possible. For example, if you try to go to the non-secure version of the Rice homepage, there’s a simple redirect in place:

Crawl Budget Crawl Optimisation

In some cases, you may well see a massive chain of redirects, usually done due to older redirects piling up over countless migrations and structural changes. Each URL, or each link in the chain, is an unnecessary use of the allocated crawl budget.

Recently, while checking a new client’s site, we found a chain that was created when you tried to access a certain URL which was linked in the main navigation. Due to the sheer number of links in this chain, the final page wouldn’t even load. Absolute nightmare.

Though the record would probably go to Barry Adams, who spotted this and posted it on Twitter: 

When a crawl bot sees a chain like this, there’s a good chance that they’ll drop off before even reaching the destination page – Google have stated that they’ll only follow roughly five steps in a chain before dropping off –  which is a massive issue, as it means that they’re not crawling the latest content on the site, making it harder for it to be fully indexed.

In order to clean these up, it’s worth going through your internal links to identify any internal links to redirecting pages. This is good housekeeping anyway, but some of them may well be redirect chains – further worsening the issue.

You can also address these by working on legacy redirects. If your site has a few miles on the clock and has undergone plenty of changes, you can go into these older redirects and update their destination URLs.

Improve Your Site Speed

This is an area which is mentioned quite frequently in SEO, though it’s deserved due to its importance. It plays a role in how users engage and interact with a site and also plays a role in the crawling process.

In the aforementioned post put together by Gary Illyes, he mentioned that a faster site will also have an effect on the rate at which the site is crawled:

Making a site faster improves the users’ experience while also increasing crawl rate. For Googlebot, a speedy site is a sign of healthy servers, so it can get more content over the same number of connections. On the flip side, a significant number of 5xx errors or connection timeouts signal the opposite, and crawling slows down.

John Mueller has also mentioned something along these lines in the past:

We’re seeing an extremely high response time for requests made to your site (at times, over 2 seconds to fetch a single URL). This has resulted in us severely limiting the number of URLs we’ll crawl from your site, and you’re seeing that in Fetch as Google as well. My recommendation would be to make sure that your server is fast & responsive across the board. As our systems see a reduced response time, they’ll automatically ramp crawling back up (which gives you more room to use Fetch as Google too).

Having a snappy response time for your site will help aid the crawling process. If your site isn’t currently up to scratch in that regard, Google will recognise any improvements going forward, adjusting the crawl rate accordingly.

Manage Your Sitemap

Your sitemap file plays a key role in determining which content from your site will be crawled and indexed. It provides a direct route to the pages on your site and is prioritised above the standard site crawl in terms of identifying links, as the links in a sitemap are deemed as more important.

With this in mind, you want to ensure that there isn’t any unnecessary clutter within your sitemap file, nor should there be malformed URLs, non-existent pages, etc. Having unnecessary pages being crawled through the sitemap which no longer exist or are redirecting creates unneeded clutter, taking up a part of your overall crawl allocation.

You should routinely check your sitemap file for any errors or unnecessary URLs that have been included. This can be done by checking out the Sitemaps section within Google Search Console. You can also run a crawl of the sitemap in Screaming Frog, identifying any issues under the Response Codes tab, as well as the Directives section.

If your site has a massive amount of URLs, or you have a separate section for a blog, or you just want to keep things as tidy as possible. You can look into creating multiple sitemaps within a sitemap index. This is commonly done within Yoast for WordPress, as shown below for the Rice site:

crawl budgets and crawl optimisation

For a complete look at putting sitemaps together properly, check out our guide to XML sitemaps.

Create a Clear Site Structure

A key component of crawl management and ensuring that key content is to build a proper internal linking structure within the site, something that should be considered when the site is being put together.

If your site has a proper hierarchy and well-built architecture, with key content being linked to, it makes it far easier for it to crawl and to subsequently be indexed. Search engines should be able to find each key page within a limited number of clicks. If content is buried deep inside the architecture of the site, it’s much harder to find, and likely won’t be crawled as often.

To see which pages are being internally linked to the most, you can see this in Search Console, under Internal Links. Generally, your homepage should be the most linked-to page, with key pages/categories following it.

Crawl Budgets and Crawl optimisation

In terms of determining the number of clicks it takes to reach a page on the site, you can check this by running a Screaming Frog crawl of the site and clicking on the Site Architecture tab. I know I keep referencing Screaming Frog, but it is pretty damn good.

Here, you’ll be shown info such as the clicks taken to reach the page from the starting URL. You’ll also see a graph displaying this information across all pages of the site as a whole:

crawl budget guide

Making search engines trawl through reams upon reams of pages to find new content makes it much more difficult to find. Plus, it requires a lot more effort in order to crawl it.

Helpful Links + Further Reading:

What Crawl Budget Means for Googlebot – Gary Illyes

Crawl Budget Optimisation: You Are What Googlebot Eats – AJ Kohn

If you’d like some advice regarding your site’s Crawl Budget and Technical SEO then do not hesitate to get in touch with one of our tech wizards!