In March 2026, Google’s senior engineer Gary Illyes opened the curtain on the inner workings of Googlebot, the web crawler that powers the company’s search index. The new post, titled Inside Googlebot: demystifying crawling, fetching, and the bytes we process, offers a detailed look at how Google’s crawling ecosystem operates, the limits it imposes on fetched content, and what those limits mean for site owners and developers.
The Many Faces of Googlebot
When people think of Googlebot, they often picture a single, monolithic crawler. In reality, Google runs a fleet of specialized crawlers, each designed for a particular type of content or task. The term “Googlebot” is therefore a shorthand for a complex system that includes:
- HTML crawlers that discover and index web pages;
- Image crawlers that fetch and analyze photographs, icons, and other visual assets;
- Video crawlers that gather metadata and thumbnails for video content;
- PDF crawlers that read and index PDF documents;
- Fetchers that retrieve resources for rendering and testing.
Google documents its various user agents and fetchers in detail here. Understanding which crawler is interacting with your site can help you troubleshoot indexing issues and optimize your content for the right audience.
Size Limits and What They Mean for Your Site
One of the most frequently asked questions from webmasters is: “How big can my pages be before Google stops crawling them?” Gary Illyes clarified that Googlebot enforces strict byte limits on the resources it fetches. These limits are designed to balance the need for comprehensive indexing with the practical constraints of bandwidth and processing power.
Below is a concise summary of the current limits as of 2026:
- HTML pages: up to 2 MB per URL (including HTTP headers). Anything beyond that is truncated at the 2 MB boundary.
- PDF files: up to 64 MB per document.
- Images and videos: thresholds vary widely depending on the product that consumes the asset; no single universal limit applies.
- Other content types: default limit is 15 MB unless a specific crawler defines a different value.
It’s important to note that these limits apply to the bytes fetched by Googlebot, not to the total size of the file on your server. If a page is larger than 2 MB, Googlebot will still fetch the first 2 MB and then stop; it does not reject the page outright.
The Crawling Process Explained
Once Googlebot reaches a URL, it follows a three‑step process to determine whether and how the content will be indexed:
- Partial fetching: Googlebot downloads up to the configured byte limit. For an HTML page exceeding 2 MB, the fetch stops exactly at the 2 MB cutoff, including the HTTP request headers.
- Processing the cutoff: The downloaded portion—whether it’s the first 2 MB of an HTML page, the first 64 MB of a PDF, or whatever the limit allows—is handed off to Google’s indexing pipeline. The Web Rendering Service (WRS) also receives this data to render the page as a user would see it.
- Indexing decision: Based on the fetched content, Google determines whether the page is valuable enough to index. If the truncated portion contains enough signals (text, links, structured data), the page may still be indexed. If critical information lies beyond the cutoff, the page might be partially indexed or omitted entirely.
Because Googlebot’s fetch is limited, it’s crucial to place the most important content—keywords, headings, structured data—within the first 2 MB of the page. This ensures that the crawler sees the signals it needs to rank the page effectively.

Leave a Comment