Algorithms

Crawl Budget Optimisation: Capacity Limits and Demand Signals

How crawl capacity and demand determine which pages search engines prioritise, and practical strategies to eliminate waste on large-scale sites.

For sites with thousands or millions of pages, search engines must prioritise what to crawl—and some pages may never be indexed at all. This article covers how crawl capacity and demand determine which pages get crawled, and practical strategies to reduce waste.

What is crawl budget?

Crawl budget refers to the number of pages a search engine will crawl on your site within a given timeframe. It's determined by two key factors: crawl capacity limit (how much crawling your server can handle) and crawl demand (how much Google wants to crawl based on perceived value).

Google's crawlers don't have unlimited resources. They must balance crawling billions of pages across the web while being respectful to individual servers. For most small to medium sites, crawl budget isn't a concern; Google will find and index your content. However, for large sites with thousands or millions of pages, crawl budget directly determines how quickly new content reaches search results, and whether some pages get indexed at all.

For more context on how this fits into overall Technical SEO strategy, crawl budget optimisation is often one of the highest-impact interventions we make.

The two components of crawl budget

Crawl capacity limit

This is the maximum crawling Google will do without negatively impacting your server. If your site responds slowly, returns errors, or explicitly limits crawling via robots.txt, Google will reduce its crawl rate. Factors include:

  • Server response time and stability
  • Hosting infrastructure capacity
  • Crawl-delay directives (though Google doesn't fully honour these)
  • Rate of 5xx errors encountered

Crawl demand

Even if your server could handle more, Google only crawls what it deems valuable. Crawl demand increases when:

  • Pages are popular (receive external links, traffic)
  • Content is updated frequently
  • New pages are discovered
  • The site has demonstrated quality and authority

Does crawl budget affect your site?

For most sites under 10,000 pages with decent server performance, crawl budget is not a limiting factor. Google will typically crawl and index your content without intervention.

Note: Focus on crawl budget optimisation only when you have evidence of indexing delays or incomplete coverage. The diagnostic steps below help confirm whether crawl budget is actually your bottleneck.

Crawl budget optimisation becomes relevant when:

  • Your site has 50,000+ pages
  • Google Search Console shows significant "Discovered – currently not indexed" URLs
  • Log analysis reveals Googlebot not reaching important sections
  • You've recently migrated or restructured a large site
  • Your server consistently shows high response times during crawls

Factors that influence crawl budget

Factor Impact Why it matters
Server response time Critical Slow responses = fewer pages crawled per session
HTTP status codes Critical 5xx errors waste budget; clean 404s are fine
Duplicate content High Exact and Near-duplicates fragment crawl attention
URL parameters High Infinite combinations can trap crawlers
Faceted navigation High Filter combinations explode URL counts exponentially
Soft 404s High Pages returning 200 but showing error content get re-crawled repeatedly
Redirect chains Medium Each hop consumes crawl resources
XML sitemap quality Medium Signals canonical preference and freshness to crawlers
Click depth Medium Pages many clicks from the homepage get crawled less frequently
Page freshness Medium Frequently updated pages attract more crawls
Orphan pages Medium Pages with no internal links waste crawl resources

Diagnosing crawl budget issues

Before optimising, confirm that crawl budget is actually your problem. Many sites blame crawl budget for indexing issues that stem from content quality or technical errors.

Using Google Search Console

Navigate to Settings → Crawl Stats to review Googlebot's behaviour on your site:

  • Total crawl requests: Look for declining trends over 90 days. A drop exceeding 20% warrants investigation.
  • Average response time: Target under 500ms. Response times exceeding 1 second trigger crawl throttling.
  • Response code distribution: Healthy sites show 95%+ responses as 200 OK, under 3% redirects, under 2% client errors.

The Index Coverage report (Indexing → Pages) reveals how Google handles discovered URLs:

  • Discovered – currently not indexed: Google found these URLs but hasn't crawled them yet. Large numbers here suggest crawl budget constraints.
  • Crawled – currently not indexed: Google crawled these but chose not to index them, typically a content quality signal rather than a crawl budget issue.

Quick diagnostic calculation

Divide your total indexable pages by the average daily crawl requests from GSC. If the result exceeds 10, crawl budget may be limiting how quickly Google processes your site.

Total pages: 250,000
Average daily crawls: 2,500
Ratio: 100

→ At current crawl rate, reaching every page once takes ~100 days

A ratio under 3 typically indicates crawl budget isn't a bottleneck.

Server log analysis

For definitive answers, analyse server logs to see exactly which URLs Googlebot requests. GSC provides aggregated data; logs show individual requests.

import pandas as pd

def parse_crawl_logs(log_file):
    """
    Parse server logs to analyse Googlebot behaviour.
    Assumes Combined Log Format.
    """
    crawl_data = []
    
    with open(log_file, 'r') as f:
        for line in f:
            if 'Googlebot' in line:
                parts = line.split('"')
                if len(parts) >= 3:
                    request = parts[1].split()
                    if len(request) >= 2:
                        crawl_data.append({
                            'url': request[1],
                            'status': parts[2].strip().split()[0],
                            'timestamp': line.split('[')[1].split(']')[0]
                        })
    
    df = pd.DataFrame(crawl_data)
    
    print("Status code distribution:")
    print(df['status'].value_counts())
    
    print("\nMost crawled URL patterns:")
    print(df['url'].value_counts().head(20))
    
    return df

Key patterns to look for:

  • Parameter URLs consuming disproportionate crawl share
  • Important pages crawled infrequently
  • Repeated requests to soft 404 pages
  • Crawl activity concentrated on low-value sections
  • Redirect chains consuming multiple requests

Common crawl budget killers

Certain patterns waste crawl resources at scale. Identifying and eliminating these typically produces the largest improvements.

Faceted navigation

Product filters create URL combinations that explode exponentially. A category page with 10 colours, 15 sizes, and 20 brands generates 3,000 unique URLs from filter combinations alone. Add sorting options and pagination, and a single category can spawn tens of thousands of crawlable URLs, most serving near-identical content.

Infinite scroll and deep pagination

Paginated series without proper controls trap crawlers in endless loops. A blog archive with thousands of posts creates /page/2/, /page/3/, through /page/500/, each requiring a crawl request while providing diminishing unique value.

Soft 404s

When a page returns HTTP 200 but displays "Product not found" or similar error content, Google can't rely on the status code to know the page is invalid. These soft 404s get re-crawled repeatedly because Google never receives the definitive signal that a proper 404 or 410 provides.

Orphan pages

Pages with no internal links pointing to them typically indicate forgotten or low-value content. Google may discover them through sitemaps or old external links, but the absence of internal linking signals low importance. Crawling orphan pages wastes budget on content your own site doesn't prioritise.

Redirect chains

Each redirect hop consumes a crawl request. A chain of three redirects (A → B → C → D) uses four requests to reach one destination. Google typically follows up to five hops per session but may abandon longer chains entirely, leaving destination pages undiscovered.

Warning: Index bloat (where Google knows about far more URLs than are worth indexing) compounds these issues. A site with 100,000 discovered URLs but only 10,000 indexable pages has a 10:1 bloat ratio. Reducing bloat by eliminating crawl traps often produces faster indexing improvements than server optimisations.

Best practices for crawl budget optimisation

1. Improve server performance

Fast, reliable responses are the foundation of good crawl efficiency:

Target metrics:
- Time to First Byte (TTFB): < 200ms
- Server uptime: > 99.9%
- 5xx error rate: < 0.1%

2. Eliminate duplicate and near-duplicate URLs

Consolidate variations that serve substantially similar content:

  • Use canonical tags consistently
  • Handle trailing slashes uniformly
  • Manage URL parameters via Google Search Console
  • Avoid session IDs and tracking parameters in URLs

3. Configure robots.txt strategically

Block crawlers from low-value URL patterns, not individual pages:

User-agent: *

# Block faceted navigation that creates duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /search?

# Block internal utility pages
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/

# Keep sitemaps accessible
Sitemap: https://example.com/sitemap.xml

4. Maintain clean XML sitemaps

Your sitemaps should only include:

  • Indexable pages (200 OK status, no noindex)
  • Canonical versions of pages
  • Pages you actually want ranked

Remove from sitemaps:

  • Redirected URLs
  • Noindexed pages
  • Paginated archives (usually)
  • Parameter variations

5. Minimise click depth for important pages

Click depth (the number of clicks required to reach a page from the homepage) affects crawl priority more than URL structure. A page at /category/subcategory/product/ can be one click away if linked directly from the homepage. Ensure important pages are reachable within 3-4 clicks; pages buried deeper get crawled less frequently regardless of their URL path.

6. Monitor crawl stats regularly

Use Google Search Console's crawl stats report to identify:

  • Response time trends
  • Crawl request patterns
  • File type distribution
  • Response code breakdown

Large site considerations

Sites exceeding 100,000 pages require more structured approaches to crawl budget management.

URL prioritisation tiers

Categorise pages by business value to focus crawl resources:

  • Tier 1 (critical): Revenue-generating pages such as products, services, and key landing pages. Ensure strong internal linking and sitemap inclusion.
  • Tier 2 (supporting): Content that supports conversions, including blog posts, guides, and category pages. Include in sitemaps with appropriate <lastmod> signals.
  • Tier 3 (utility): Necessary but low-value pages like about, contact, and legal pages. Minimal sitemap priority.
  • Tier 4 (candidates for removal): Thin content, outdated pages, duplicate variations. Consider noindexing or removing entirely.

Sitemap segmentation

For large sites, segment sitemaps by content type rather than arbitrary splits. This enables per-segment monitoring in Search Console: when product page indexing drops, you see it immediately in the products sitemap report rather than buried in aggregate data.

Server log analysis at scale

GSC data alone is insufficient for sites with millions of pages. Dedicated log analysis tools (or custom solutions using tools like ELK stack) reveal:

  • Which URL patterns consume the most crawl requests
  • Time-of-day crawl patterns and server load correlation
  • Googlebot behaviour differences across site sections
  • Redirect chain frequency and depth
Tip: For sites with extensive faceted navigation, consider implementing filters via JavaScript without generating crawlable URLs. When filters update content via AJAX without changing the URL, Googlebot sees only the base category page, not thousands of filter combinations.

Signs of a healthy crawl budget

When crawl budget is well-optimised:

  • High crawl-to-index ratio: Most crawled pages end up indexed. If Google crawls 1,000 pages and 900+ get indexed, the ratio is healthy.
  • Rapid indexing of new content: New pages appear in search results within days rather than weeks.
  • Stable or increasing crawl requests: GSC Crawl Stats show consistent or growing activity, not declining trends.
  • Clean response code distribution: 95%+ of crawl requests return 200 OK, with minimal 4xx/5xx errors.
  • Crawl activity aligned with content updates: When you publish or update pages, Googlebot visits them promptly.
  • Low "Discovered – not indexed" counts: The gap between discovered URLs and indexed URLs remains narrow.

Common misconceptions

"Blocking pages in robots.txt saves crawl budget": Blocking URLs doesn't eliminate crawl demand. Google still checks robots.txt for those URLs. If you want pages truly ignored, use noindex and let them be crawled once, or remove them entirely. Blocking can also prevent link equity from flowing through those pages.

"More pages = more crawl budget needed": Quality matters more than quantity. A site with 100,000 high-quality, well-linked pages may be crawled more efficiently than a site with 10,000 thin, duplicate, or orphaned pages. Pruning low-value pages often improves crawl efficiency for the remaining content.

"Submitting sitemaps increases crawl budget": Sitemaps help Google discover URLs and understand their relative priority, but they don't increase your total crawl allocation. A sitemap full of low-quality URLs won't accelerate crawling.

"Googlebot crawls sites on a fixed schedule": Crawl frequency is dynamic and page-specific. Some pages might be crawled multiple times per day; others, once per month. This depends on perceived freshness, importance, and historical change patterns.

"4xx errors waste crawl budget": Proper 404 and 410 responses are efficient. Google requests the URL, receives a clear "not found" signal, and moves on. The problem is soft 404s (pages returning 200 OK while displaying error messages) which get re-crawled because Google never receives a definitive status code.

"Blocking pages temporarily reallocates crawl budget elsewhere": Google explicitly warns against using robots.txt to shift crawl resources. Blocking pages doesn't automatically increase crawling of other pages. The only reliable ways to increase crawl budget are improving server capacity and improving content quality.

Key takeaways

  1. Diagnose before optimising: Use GSC Crawl Stats and the pages-to-crawls ratio to confirm crawl budget is actually your bottleneck
  2. Eliminate crawl traps first: Faceted navigation, soft 404s, and orphan pages typically waste more budget than slow servers
  3. Server speed enables everything else: Response times under 500ms allow Googlebot to crawl more pages per session
  4. Proper 404s are efficient: Return real 404/410 status codes for removed content; soft 404s get re-crawled repeatedly
  5. Large sites need structured approaches: URL prioritisation tiers and segmented sitemaps provide both efficiency and monitoring visibility

Frequently asked questions

How long does it take to see results from crawl budget optimisation?

Expect different timelines for different changes:

  • robots.txt changes: Googlebot typically re-fetches robots.txt within 24-48 hours
  • Crawl rate improvements: 1-2 weeks to stabilise after server optimisations
  • Indexing improvements: 2-4 weeks for measurable changes in coverage
  • Full impact: 1-3 months for comprehensive optimisation efforts

Do CDNs help with crawl budget?

Indirectly. CDNs reduce server response times, allowing Googlebot to crawl more pages per session. They also improve reliability by distributing load. However, a CDN doesn't directly increase your crawl budget allocation; the benefit comes through improved server performance.

Should I block pagination from crawling?

It depends on the content. For blog archives, allowing the first 10-15 pages while blocking deeper pagination is reasonable, since most value is in recent content. For e-commerce categories, consider "view all" pages or ensure each paginated page has unique, valuable products. Forum threads with unique discussions should generally remain crawlable.

What's the difference between "Discovered – not indexed" and "Crawled – not indexed"?

"Discovered – currently not indexed" means Google found the URL but hasn't crawled it yet, often indicating crawl budget constraints. "Crawled – currently not indexed" means Google fetched the page but chose not to index it, typically signalling content quality issues rather than crawl budget problems.

Further reading

Original content researched and drafted by the author. AI tools may have been used to assist with editing and refinement.

Share this article

Your Brand, VISIVELY!