We’ve built a tool that lets us check how many pages on a given site are indexed in Google.
So far, we checked hundreds of websites and the tool helped us diagnose SEO issues that our clients were dealing with, such as ones connected to the crawl budget and indexing.
We often encounter data anomalies when investigating these problems and see many websites with severe mistakes in their sitemaps.
How could this affect your website?
If your sitemap is not implemented properly, Googlebot can spend a lot of time crawling low-quality URLs, which is a waste of crawl budget. As a result, many valuable URLs on your website may not be indexed in Google, because it won’t have sufficient resources to crawl them.
What mistakes are popular websites making in their sitemaps, and how do you avoid them to ensure Google is not wasting the crawl budget on irrelevant content?
Let’s dig in.
What is the crawl budget?
First, let me explain what crawl budget is and how exactly it’s relevant for website indexing.
Google is able to crawl a lot of content, but its resources are not infinite – so it needs to make choices with the resources it has.
That’s why Googlebot defines a crawl budget for all websites – the number of URLs it can and wants to crawl.
A site’s crawl budget depends on two metrics:
- Crawl capacity limit – calculated to crawl all important content on a website without overwhelming its server’s limits – and,
- Crawl demand – determined by a website’s size, popularity, and update frequency.
If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.source: Google's documentation
Because of Googlebot’s limited capabilities, you should plan which URLs Googlebot crawls on your website.
The key to adjusting which URLs are crawled is explained in Google’s documentation:
Manage your URL inventory: Use the appropriate tools to tell Google which pages to crawl and which not to crawl. If Google spends too much time crawling URLs that aren’t appropriate for the index, Googlebot might decide that it’s not worth the time to look at the rest of your site.source: Google's documentation
To recap – here is what we know so far:
- If your website is slow, Google may crawl fewer URLs, hence fewer URLs will find their way into Google’s index,
- If Google is able to discover lots of low-quality URLs when crawling your site, it may decide that the overall quality of your site is low.
Here is a crucial takeaway:
With tons of low-quality URLs for Google to crawl, Googlebot may lose lots of time on crawling them and may not be able to crawl many high-quality URLs on your website.
This holds the most weight for large or rapidly changing websites because they need to be crawled often and extensively in order to attract traffic.
How are sitemaps important for your crawl budget?
As I’ve explained, optimizing your crawl budget is an extremely important step for your site’s indexing.
One of the ways to manage your URL inventory is by creating and maintaining a well-optimized sitemap.
A sitemap is a file where you provide information about the pages, videos, and other files on your site, and the relationships between them […]. A sitemap tells Google which pages and files you think are important in your site, and also provides valuable information about these files. For example, when the page was last updated and any alternate language versions of the page.source: Google’s documentation
However, tons of websites fail to create well-optimized sitemaps. Luckily, we can learn from their mistakes.
What mistakes should you avoid in your sitemap?
I analyzed many popular sites and found that a lot of them make mistakes in their sitemaps that negatively affect their crawl budget, which could lead to issues with their Index Coverage.
Here is my breakdown of mistakes to avoid when creating a sitemap.
Submitting malformed URLs
One of the mistakes I discovered concerned the structure of URLs in sitemaps.
Let’s analyze it by looking at a specific example.
When I saw statistics collected by our software, I was stunned: it showed that 0% of whisky.de’s pages submitted in sitemaps were indexed in Google.
I knew this couldn’t be true, so I investigated the data further.
Most URLs in whisky.de’s sitemaps seemed valid:
- They were canonical,
- They weren’t blocked by the noindex robots meta tag,
- They weren’t blocked by the disallow directive in robots.txt,
- They were responding with a 200 status code.
But then I noticed that all the URLs had double slashes following the top-level domain – take a look at this sample:
The double slash seems like an obvious programmatic mistake while generating sitemaps and one that’s easy to fix.
However, the pages included in sitemaps have canonical tags pointing to respective URLs – their correct versions with a single slash.
As a result, it’s highly probable that Google is visiting twice as many URLs as intended: the URLs with the single slashes and double slashes.
Google has mechanisms to spot faulty patterns in URLs, and technically speaking, it’s possible that Google spotted the mistake. So, it could be crawling whisky.de accordingly and indexing the correctly structured URLs. But there’s no way for us to check that without access to the website’s Google Search Console account or server logs.
In practice, you shouldn’t rely on Google’s algorithms to fix your mistakes – practices like the one I described can put a strain on your crawl budget and even keep your pages out of Google’s index.
Submitting thin content URLs
There is a plague of websites that include thin content pages in their sitemaps.
Let me show you an example.
I discovered this mistake on AnnTaylor.com, a top-rated store with women’s clothing.
I wanted to check how many of their product categories were indexed in Google, so I investigated their sitemap dedicated to category pages.
The initial check showed that only 46% of the category pages were indexed in Google.
So, I looked into this in more detail and learned that most of their category pages were soft 404s.
Specifically, these pages displayed the following message:
It was no surprise that Google didn’t want to index them!
The next logical step was to exclude soft 404s from my sample. For that purpose, I checked the indexing status of the same sitemap, but used a trigger that excluded pages containing the phrase “We stylishly searched and no luck” as exemplified in the image above.
It turned out that after excluding soft 404 URLs, as much as 82% of the pages in their category sitemap are indexed.
Still, 18% of category pages aren’t indexed in Google – that is what their SEOs should focus on investigating.
AnnTaylor’s situation is serious for the following reasons:
- First of all, Google is wasting crawl budget on crawling thin content.
- Additionally, it’s not a mystery that Google judges quality on three levels: page, section, and site-wide. Google may decide that category pages, in general, are of low quality and all of them could get deindexed. In the past, it happened to websites like Giphy, Instagram, or Pinterest, as I described in one of my articles. Let’s hope it won’t happen to AnnTaylor.
Skipping valuable URLs
As I mentioned already, sitemaps help Google understand your website better and crawl it more intelligently.
However, I noticed many websites don’t include their most valuable URLs in sitemaps.
Here is one example.
I checked a general sample (taken from all URLs from sitemaps) for GoodReads and found out that just 35% of them were indexed.
I was very surprised, as I know that it’s a very high-quality website. I know I’m not the only one who visits GoodReads to read reviews and learn if a particular book is worth reading.
Then, I saw the sample we checked had no URLs with books included. So I decided to download all their sitemaps.
The result: no URLs with books in sitemaps.
Why is it a bad sign?
There is a risk that Google prioritizes URLs found in sitemaps and somehow, skips visiting product pages.
Disclaimer: GoodReads is not our client. So, technically speaking, it is possible that they have a private sitemap submitted to Google Search Console.
Overusing the <lastmod> parameter
One of the parameters you can include in your sitemap file is <lastmod>, specifying the last time a page has been updated. This way, Google can easily pick URLs that changed recently.
However, some websites overuse this technique. And doing it could have adverse effects because, as we read in Google’s guidelines, “Google uses the <lastmod> value if it’s consistently and verifiably (for example by comparing to the last modification of the page) accurate.”
Let’s look at an example of a site that overuses the <lastmod> parameter.
I looked at Avon’s product sitemap and all listed URLs have the same <lastmod> parameter – the current day:
It’s safe to assume not all of Avon’s URLs change daily, so Google is reluctant to index its pages.
Linking to your staging environment within sitemaps
It’s quite common for Google to index staging URLs.
It is usually a mystery how Google finds links to such pages. But a common explanation is that these URLs are linked directly from sitemaps.
Note that acehardware.com has since updated the sitemaps and addressed the mistake below.
Here is a sample I initially checked.
As you can see, I found that they were linking to the staging site from their sitemap.
Why is it bad to include your staging environment in a sitemap?
- Google crawls unnecessary URLs.
- If staging URLs are indexed, they confuse users looking for a particular piece of information and stumble on them in search results.
Best practices to follow in sitemaps
You’ve gone through my overview of things to avoid when creating and managing a sitemap for a website.
So now, what are some practices that you should follow?
Here are some best practices I recommend:
– Only include canonical URLs in your sitemaps.
– The maximum sitemap size should be 50,000 URLs. You can break them up into smaller sitemaps if you have more URLs.
– Don’t include session IDs from your URLs in sitemaps – this way, you can reduce duplicate crawling of the given URLs.
– Use consistent and complete URLs – include absolute rather than relative URLs.
As I’ve mentioned, make sure your sitemaps only include valuable URLs. You can perform a full website crawl to check if any URLs found in a crawl are missing from your sitemap.
This is just the tip of the iceberg when it comes to optimizing your sitemap – for further recommendations, read our ultimate guide to XML sitemaps. And for a full technical SEO audit, contact us for technical SEO services.
Sitemaps are valuable for every website.
Yet, as you can see from the examples of sites I listed, many popular websites don’t have optimized sitemaps, which comes at a cost – their Index Coverage is heavily impacted.
Also, keep in mind that SEO mistakes in sitemaps can negatively affect your crawl budget, which is crucial if you have a medium or large website.
I hope now you know what mistakes to avoid and you will be on your way to creating a sitemap that helps Google crawl your site more efficiently, leading to improved Index Coverage.