Phases and Elements of the Crawling Process

April 16, 2020

10 min read

Crawling, spiders, bots — these are terms any SEO is used to handling day to day, and they carry essential weight in any ranking strategy, because if this phase fails, the rest will too.

Let's look in detail at what a web crawling process consists of.

What does it mean to crawl a website?

Before moving on, let's define the process of crawling a website, showing the importance it holds within any attempt to appear in Google's search results.

Crawling a website is understood as the process by which spiders or crawlers travel through the different pages of a website, gathering all the accessible information, to store it, process it and later classify it.

It's worth highlighting a few fundamental terms within the definition we've just laid out:

Journey: Think of a spider indeed. This friendly insect has to pass through as many pages as possible to extract as much information as it can. To go from one page to another, it does so through the internal links that connect them. Hence the importance of having correct internal linking that allows these spiders to "discover" — if not the entirety — at least the most relevant pages for us.
Accessibility: The information has to be accessible to these spiders. That is, if in some way we are limiting their access intentionally or by mistake, we will be preventing the spiders from being able to process all the content, and therefore from understanding and ultimately classifying it.

This blocking or limitation of page content can occur in several different ways, which we'll try to explain further along in this post.

The crawlers

We've talked about spiders, also known as crawlers or bots. We can define them as programs that analyze the documents on our website, that is, they are like "librarians" that search, classify and organize. Their main function is therefore to build databases. There are several types, depending on the kind of information they collect. Let's mention some of the most common.

Googlebot: The spider in charge of crawling our content and categorizing it within the organic results (SERPs). For SEOs, it's the most important one.

Within this type we can distinguish some subtypes:

Googlebot (smartphones): Mobile version
Googlebot (desktop version): Desktop version
Googlebot Images: In charge of crawling images
Googlebot News: For news
Googlebot Video: Now it's the turn of videos

Example of a bot identified in our logs:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z‡ Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

They're not the only ones — there are others such as Adsbot, Adsense, etc. Having already mentioned the relevant ones for the SEO sector, differentiating them from the rest is not the focus of this article, but you can find additional information at the following official Google link.

Phases of Google's crawling and indexing process

Now that we know what crawling is, who is in charge of that function, and we've discussed the process, let's look at it in more specific detail.

First phase: crawling and classification

The process by which our pages appear in Google's results goes through a first phase of crawling, as we've seen, performed by the spiders (crawlers), so that they read, interpret, index and classify our content.

It's this new word we want to analyze in detail, classify. Google has to perfectly understand our content, simply and quickly, because as we'll see later, Google spends a specific amount of time on our website, and in that time it must "understand" our content and associate it with the different search intents of users.

That's why in modern SEO the word "Search Intent" is heard so often, since Google will take it into account in that classification and it will define the position our pages occupy in the SERP rankings.

That's why the crawling process has to be clean, simple, fast, without obstacles, etc., so that everything is clear and we are classified correctly.

Phase two: Indexing

We can't forget the indexing phase, which precedes the classification and also plays a fundamental role, since it will be the step where Google adds our content to its database, that is, it indexes it.

Blocking Google's robots

We mentioned earlier that there are ways in which we could be limiting these spiders' access to our content. For this, there is an element of vital weight in SEO known as robots.txt.

The robots.txt file is a text file we upload to our server, in which we give precise instructions to the different spiders to allow or block them from crawling URLs on our site. This blocking can be applied:

to the entire domain
to a specific path
to a specific URL
or to a set of URLs that match a certain pattern.

Let's see an example configuration of this file:

User-agent: *

Disallow: /wp-admin/

Allow: wp-admin/admin-ajax.php

Sitemap: /sitemap.xml

As we can see, it has a first line where we specify the user-agent (the name of the crawler we want to block or allow, from those we saw earlier), followed by the "disallow" orders to prohibit entry or "allow" to permit it.

In the specific case we see, by indicating with a * we are saying "all crawlers", without exception. We are prohibiting them from entering the /wp-admin/ path, but within that path we want to allow them to enter /admin-ajax.php.

An incorrect configuration of this file can cause us to be blocking important parts of our content. It's a common mistake to have the entire website blocked while it's being developed, and then forget to remove that block after putting it into production, making it inaccessible to Google.

Another problem Google's spiders might encounter when crawling our content is not being able to follow the internal links we have on our website, and therefore not accessing the rest of the URLs. This happens when we use javascript elements instead of "href" in those links. This practice is very common, since using JS has many advantages at the user level, but if not used correctly, and added to internal links, Google may not be able to follow them.

In the SEO world this is known as "link obfuscation". As of today, it's an open debate whether Google is capable of crawling and rendering pages made in JS correctly.

Server response codes

To continue understanding this process well, we can't overlook a concept that SEOs have to deal with daily, server response codes.

Before, we saw the cycle by which Google finds us, but how does this happen? A user performs a search (a query) on Google. The search engine goes to its database and shows the most relevant results (SERPs), according to the classification made, for that search.

Once the user sees the different results (impressions), they click on one of them, the one that in their judgment best fits what they need. At that moment, Google's request to the server where the website is hosted comes into play, so that it "serves" the content.

When this occurs, the server response is produced through the corresponding code. Let's name the most relevant ones that, as SEOs, we must take into account:

200: This response code is the one that tells Google that the page exists, that it has content and that there's no problem showing it. It's the most desired by SEOs, as long as the content of that page with code 200 is optimal.
30x: The 30x status code family corresponds to redirects. The most notable are 301 (permanent), 302 and 307 (temporary). Basically they tell Google "hey, this URL A that you've requested is not this one anymore, it's this other URL B". There are more, but they're not the focus of the concept we're developing. It's important to know that, as SEOs, the preferred ones are 301s, which transfer all the authority.

Recommended reading: Tutorial on 301 redirects

40X: Error codes. The least desired by SEOs. The most common is the famous 404. When this code appears, we're telling Google in response to its request for a URL that it no longer exists and is therefore an error.
410: We've wanted to single this one out from the 40x family for its SEO value. When we use this code, in response to a request from Google's server for a URL, we're telling it that it's "gone for good". It's interesting because, unlike the 404, Google understands that it will never be there again and will stop trying to crawl it, while with the 404, it will crawl it again thinking we may want to fix it.
50x: This type of response is linked to server errors. When our machine fails for some reason, and Google tries to request the content of some URL from us, if the server fails, it returns a 505 status code.

Crawl Budget

At this point in the post, we still need to address a term that became popular a couple of years ago in the SEO world, known as crawl budget.

The crawl budget refers to the time Google's spiders spend crawling a website and all its URLs. It is, as we said earlier, a finite time. Hence the importance of having our website optimized, in order to make it easier for it to see the most relevant pages of our site in that time.

This time that crawlers spend going through our website is not a fixed value, it will grow or decrease depending on aspects such as the frequency with which we update the content, the authority of our domain (popularity), etc.

The higher the quality of our website, the greater the authority and the more fresh content, the more relevant Google will consider us and will allocate more budget to crawling us.

With crawling programs such as Screaming Frog, we perform ideally simulated crawls of our website, that is, as if the spiders had all the time in the world to go through each and every one of our URLs.

But this isn't how it works when we talk about Googlebot — rather, every time Google visits our website, it will visit some URLs more than others. In fact, there may be some it doesn't even visit. We'll analyze this with what are known as the server logs, (records of which URLs Google has crawled, how often it has done so and how many times in a given period).

Up to here, all the analysis regarding understanding what crawling is and the different elements that form part of Google's crawling system.

Any questions or suggestions? As always... we'd love to hear from you!

Author: David Kaufmann

I've spent the last 10+ years completely obsessed with SEO — and honestly, I wouldn't have it any other way.

My career hit a new level when I worked as a senior SEO specialist for Chess.com — one of the top 100 most visited websites on the entire internet. Operating at that scale, across millions of pages, dozens of languages, and one of the most competitive SERPs out there, taught me things no course or certification ever could. That experience changed my perspective on what great SEO really looks like — and it became the foundation for everything I've built since.

From that experience, I founded SEO Alive — an agency for brands that are serious about organic growth. We're not here to sell dashboards and monthly reports. We're here to build strategies that actually move the needle, combining the best of classical SEO with the exciting new world of Generative Engine Optimization (GEO) — making sure your brand shows up not just in Google's blue links, but inside the AI-generated answers that ChatGPT, Perplexity, and Google AI Overviews are delivering to millions of people every single day.

And because I couldn't find a tool that handled both of those worlds properly, I built one myself — SEOcrawl, an enterprise SEO intelligence platform that brings together rankings, technical audits, backlink monitoring, crawl health, and AI brand visibility tracking all in one place. It's the platform I always wished existed.

→ Read all articles by David