Representation of a search engine crawler

Guide: Improving Your Site’s Crawlability and Indexing for SEO

In SEO, crawlability refers to how easily search engines can find and access pages and resources on your website. Since only crawled and indexed pages can appear in the search results, this is a major factor in SEO and the reason it’s the first tier in Maslow’s hierarchy of SEO needs.

In Modern SEO, crawlability problems will hold your entire website back from achieving strong rankings, not just the pages that can’t be crawled. If you block Google from seeing the value you offer, then they have no choice but to assume your website is weak — you’ll be ranked accordingly.

TL;DR: If search engines can’t crawl your website, they can’t index or rank your content. Excessive crawl errors will lead to poor search rankings.

Table of Contents

Crawling vs indexing

A quick clarification before we go any deeper. You’ll come across a lot of references to ‘crawling’ and ‘indexing’ in SEO and they’re often used interchangeably. Although they’re similar, it’s not quite the same.

Crawling: This is when a search engine (or other bot) accesses the pages and resources on your website. The crawler will ‘look at’ those pages, ‘crawling’ from one page to another via your sitemap and internal links, hence the name.

Indexing: Indexing is when search engines take a snapshot of each page and store it in their database (index) for future reference. When your website is ranked and shown in search results, it’s this indexed snapshot that their algorithm references.

For a page to be ranked, it must be both crawlable and indexable. If either of these are blocked in some way, that page(s) cannot appear in search results until the problem is resolved.

How to check for crawlability and indexing problems

There are 2 main tools that most SEOs use to check for crawlability and indexing problems, and both of them are free.

Screaming Frog SEO Spider

Screaming Frog is an OG SEO tool that essentially acts like a search engine’s crawler. It’ll crawl from one page to another, using internal links and your sitemap(s) and report back on its findings.

For this topic, what we’re interested in are the Status Codes and Indexability Status columns. Let’s take a closer look at how to set up one of these crawls and check the status:

  1. Download and install Screaming Frog SEO Spider (note: A paid license will be required if you have >500 URLs)
  2. Type your URL into the address bar at the top and click Start
    A screenshot showing how to initiate a crawl in Screaming Frog SEO Spider
  3. Wait for the crawl to run

As the crawl runs, you’ll start to see the results populate one row at a time. Once the progress bar in the bottom right corner shows 100% complete, start by clicking the Indexability column header to sort results by that column.

A Screaming Frog crawl with results sorted by Indexability Status

What you’re looking at now is every URL (page, image, CSS file, PDF etc) on your website, sorted to show non-indexable URLs at the top. In the case of this report, anything marked as “non-indexable” is also not crawlable — in other words, search engines don’t know it exists.

In the next column, you’ll see the reason for each URL being non-indexable. Generally, it’s going to be at least  one of these:

  • Blocked by robots.txt: Crawlers are being blocked from accessing these pages via the robots.txt file. This text file is used to give instructions to crawlers. Generally, it’s used to tell them what they can and cannot access, as well as where to find your sitemap(s). To find yours (assuming you have one), just type /robots.txt at the end of your website address. For example, nytimes.com/robots.txt.
  • Noindex: The noindex tag (<meta name=”robots” content=”noindex”>) has been placed in the <head> section of these pages, explicitly telling crawlers not to index them.
  • Canonicalised: The canonical tag is used to tell search engines that a particular page should be used as the ‘master copy’ of a page and to ignore the others. More on this below.
  • Redirected: A redirect has been set up for these URLs, meaning that if a user tries to visit that URL, they’ll be redirected to a different one.
  • Client Error: Some kind of error occurred, preventing Screaming Frog from being able to access the URL. This is often an issue related to the server — either it timed out or denied access to this crawler for security reasons.

We’ll get into fixing these below.

Google Search Console

The other free tool we often use to check for crawl errors is Google Search Console (GSC). This is Google’s free platform that gives you all kinds of data around your website’s appearance and performance in the search results.

While Screaming Frog is showing you a real-time snapshot of what’s theoretically visible, GSC shows what Google has actually observed over time. The caveat is that some of what you’ll see in there can be safely ignored, which makes the reports a little less black and white.

One of the first snapshots you’ll see from the Overview tab is the Indexing report. This gives you a snapshot of your indexable and non-indexable URLs at a glance.

Google Search Console's indexing report

By clicking ‘Full report’, you’ll get a more detailed breakdown. You can also access this report by clicking Pages under the Indexing portion in the left navigation menu.

Screenshot of the more detailed Indexing report in Google Search Console

At the bottom, you can see a breakdown of how many pages fall into each category:

  • Not found (404): The page doesn’t exist.
  • Excluded by ‘noindex’ tag: A noindex tag (<meta name=”robots” content=”noindex”>) has been placed in the <head> section of these pages, explicitly telling crawlers not to index them.
  • Crawled – currently not indexed: Google has crawled this page and decided it’s not a valuable addition to their index. Take a look at Moz’s guide for a more detailed look at the topic.
  • Page with redirect: These URLs just redirect to a different page. For example, if you set up a redirect from /blog/old-post-version to /blog/new-post-version, the old version will show up as a ‘page with direct’ in this report.
  • Alternate page with proper canonical tag: These are pages that have successfully used the canonical element to point search engines to a different page. Canonicalization is a complex topic — take a look at Google’s guide for more info here.
  • Duplicate without user-selected canonical: Google has found two pages that it deems to be duplicates, but no canonical tag was used to determine which is the main version of that page. Google will decide for you and discard the other version.
  • Blocked by robots.txt: Crawlers are being blocked from accessing these pages via the robots.txt file. This text file is used to give instructions to crawlers. Generally, it’s used to tell them what they can and cannot access, as well as where to find your sitemap(s). To find yours (assuming you have one), just type /robots.txt at the end of your website address. For example, nytimes.com/robots.txt.
  • Blocked due to access forbidden (403): For one reason or another, Google was not granted access to view these pages. This is generally due to a server or plugin security feature that’s blocking unnecessary access.
  • Discovered – currently not indexed: Google is aware of these pages but hasn’t crawled or indexed it. This is generally because Google has decided it’s too low quality to look at any closer. Ahrefs has a great breakdown of this topic.

Note that the goal of this report isn’t to reduce your ‘not indexed’ number to 0. When done correctly, it can be a good thing to have select pages excluded by noindex. 404s can be perfectly fine and it’s not inherently bad to have URLs blocked by robots.txt.

The key here isn’t to do away with all blocked pages, it’s to make sure the URLs that show up in this report are there deliberately, not through error.

How to improve your crawlability

Now that we know the scope of your crawlability and indexation problems, let’s get to work fixing them. While you could start going through one page at a time and addressing them, it’s better to take a long term, holistic approach.

Clean up your site architecture

If your site and URL architecture are a mess, then you run the risk of further problems down the track. Overhauling things from the top down is often the most time efficient way to resolve a lot of these problems.

An effective site architecture means:

  • A taxonomy that’s simple for users to navigate
  • A page hierarchy that’s as flat as practical
  • A URL structure that matches this taxonomy. Short, simple and easy to understand.

For example, if your URLs look something like shoestore.com/category/black/leather/mens and you have an equally complex navigation menu, simplify.

shoestore.com/mens-shoes/leather is a much cleaner, easier to navigate structure.

The easier your site architecture, the lower your risk of near-duplicate pages and accidental errors.

Pro Tip: Don’t forget to set up 301 redirects from your old site structure to the new URL versions or you will lose rankings that would otherwise stay intact.

Fix internal broken and redirected links

The best place to identify these is in Screaming Frog. Once you have a completed crawl, sort the Status Code column from high to low and look at any URL that has a 3xx or 4xx code (e.g. 301, 302, 403, 404).

Any URL with a 3xx status code means you’re linking to that page, which then redirects to another.

URLs with a 4xx status code mean you’re linking to that page and it no longer exists.

The most efficient way to address these is to gather up the inlinks, export to CSV and drop them into a spreadsheet to work your way through. Here’s how to do it:

  1. Select all URLs with a 3xx or 4xx status code
  2. Click the Inlinks tab at the very bottom of the Screaming Frog window
  3. Click the Export button
    A screenshot of the inlinks tab in Screaming Frog SEO Spider
  4. Choose a file name and save location, then click Save
  5. Open a new spreadsheet
  6. Import the CSV you just saved from Screaming Frog (File -> Import)

What you’re now left with is a spreadsheet that shows every internal link on your website that points to a broken or redirected page.

Screaming Frog inlinks exported to Google Sheets

The From column shows you the page where you’ll find the offending link. The ‘To’ column shows you which 3xx or 4xx page the link currently points to, and the Anchor Text column is the text used in that link. For example, in this link to our SaaS SEO page, “SaaS SEO” is the anchor text. You can use Ctrl + F to locate that anchor text on your ‘from’ page to make finding that link easier.

Now for the tedious part. You need to go through each of these rows and update the link to a valid URL.

For 3xx links, just update the link to the new destination. For example, if your link  points to /old-page/ which then redirects to /new-page/, edit the link so it points directly to /new-page/, removing the need to redirect.

For 4xx links, the fix is less obvious. Since the page being linked to no longer exists, you’ll have to either identify the most suitable page to point to, or remove the link entirely. For any URLs where you’re unsure what was on that page, the Wayback Machine will show you what the page used to look like. This makes it much easier to find a suitable replacement.

Review your robots.txt file

The instructions you give your Robots file can have major implications for the crawlability of your website. At worst, a single line can block your entire website from being crawled.

User-agent: *
Disallow: /

What this says is ‘all crawlers, disallow access to the entire website’. Not great if you’re working on your SEO!

If you have no robots.txt file, then access is wide open for crawlers. This isn’t the worst thing in the world, but you’re losing some efficiency and control. Ideally, you’ll use this file to block access to any content you don’t want search engines to see, focusing their resources on the right pages.

For example, you might have 10 minor variations of the same page on your website, used for Google Ads. Since you don’t want Google thinking you have 10 near-identical pages, it’s best to block crawler access to them entirely. If all of those landing pages sit in the /ads-lp/ folder (e.g. website.com/ads-lp/october), then you can block access to them all using a single line:

User-agent: *
Disallow: /ads-lp/
Sitemap: https://website.com/sitemap.xml

Note the inclusion of the sitemap reference in the above example, too. This is a good practice to make it easier for search engines to locate it, since not all sites use a simple /sitemap.xml location.

The correct setup of your robots.txt will be very specific to your website and CMS. Take a look at Google’s Search Central guide on creating your robots.txt file for a detailed explanation.

Note: While disallowing crawl access via robots.txt will generally exclude that page from search results, there are some circumstances where  that won’t be the case. If you need to block access to sensitive information, use the noindex tag instead.

Check your use of canonicalization

The canonical tag is used to tell search engines that a particular page should be used as the ‘master copy’ of a page and to ignore the others.

A common example of this is when you’re using UTM codes to monitor the success of your marketing campaigns. You might end up with URLs like these on your site:

https://website.com/accounting
https://website.com/accounting/?utm_source=facebook&utm_medium=social&utm_campaign=accounting-lp
https://website.com/accounting/?utm_source=instagram&utm_medium=social&utm_campaign=accounting-lp
https://website.com/accounting/?utm_source=linkedin&utm_medium=social&utm_campaign=accounting-lp
https://website.com/accounting/?utm_source=tiktok&utm_medium=social&utm_campaign=accounting-lp

All of these URLs are valid and they all lead to the exact same page. As users, we understand that it’s the same page, but what search engines see is 4 different URLs (pages) with identical information. Since we don’t want search engines thinking we have batches of identical pages on our website, we use the below canonical tag in the <head> section of each page  to tell them the top one is ‘the’ page and any links or traffic going to those UTM variants should be attributed to that main version.

<link rel=“canonical” href=“https://website.com/accounting” />

Note that you must get the URL exactly right here. If you accidentally type http:// instead of https:// or you forget a trailing slash when your URLs use them, this could cause disaster. It’s usually best to copy/paste the URL to avoid human error.

The other reason SEOs use canonical tags is for directing backlink value. If each of the URLs above gained 2 backlinks from different websites, our canonical tag example makes sure the majority of that link value is being pointed directly at the /accounting page rather than spreading it across all 5 URLs. Effectively, it’s concentrating that link value into one place.

Take a look at Ahrefs’ guide to better understand and implement canonical tags on your website.

Review your existing content and remove or rewrite low quality pages

If your site is packed with old, dusty content that offers little to no value, don’t just let it sit there. Either remove, canonical or rewrite those pages.

By cutting the clutter from your website, you’re making it easier to crawl and maintain while also boosting the overall quality of your website. If you are going to delete pages though, it’s a good idea to redirect those URLs to a relevant page to retain any link value you might have pointing to them.

Frequently Asked Questions

To improve your website’s crawlability, clean up your website’s architecture, remove broken links and redirects and implement the canonical tag correctly. You’ll also want to check your robots.txt file and have a look at the Pages report in Google Search Console to identify individual errors.

Crawlability is important in SEO because if a page can’t be crawled, it can’t appear in the search results! By improving your site’s crawlability, you’re boosting search engines’ ability to see the content on your site, index it and show it to your audience when they search for it.

Crawlability and indexability refer to how easy it is for search engines and other crawlers to navigate through your website and ‘look at’ each page.

Although they’re often used interchangeably, they’re not quite the same. If a URL is crawlable, that means the crawler can see that it exists. If a URL is indexable, then the crawler can ‘look at’ that page and take a snapshot of it. Just because a page is crawlable doesn’t necessarily mean it’s indexable.

The two best places to check your site’s crawlability are Google Search Console’s Page Indexing report, and Screaming Frog SEO Spider.

You’ll find more info above on how to leverage each of these reports to review your crawlability in more detail.