For sites with thousands or millions of pages, search engines must prioritise what to crawl—and some pages may never be indexed at all. This article covers how crawl capacity and demand determine which pages get crawled, and practical strategies to reduce waste.
What is crawl budget?
Crawl budget refers to the number of pages a search engine will crawl on your site within a given timeframe. It's determined by two key factors: crawl capacity limit (how much crawling your server can handle) and crawl demand (how much Google wants to crawl based on perceived value).
Google's crawlers don't have unlimited resources. They must balance crawling billions of pages across the web while being respectful to individual servers. For most small to medium sites, crawl budget isn't a concern; Google will find and index your content. However, for large sites with thousands or millions of pages, crawl budget directly determines how quickly new content reaches search results, and whether some pages get indexed at all.
For more context on how this fits into overall Technical SEO strategy, crawl budget optimisation is often one of the highest-impact interventions we make.
The two components of crawl budget
Crawl capacity limit
This is the maximum crawling Google will do without negatively impacting your server. If your site responds slowly, returns errors, or explicitly limits crawling via robots.txt, Google will reduce its crawl rate. Factors include:
- Server response time and stability
- Hosting infrastructure capacity
- Crawl-delay directives (though Google doesn't fully honour these)
- Rate of 5xx errors encountered
Crawl demand
Even if your server could handle more, Google only crawls what it deems valuable. Crawl demand increases when:
- Pages are popular (receive external links, traffic)
- Content is updated frequently
- New pages are discovered
- The site has demonstrated quality and authority
Does crawl budget affect your site?
For most sites under 10,000 pages with decent server performance, crawl budget is not a limiting factor. Google will typically crawl and index your content without intervention.
Crawl budget optimisation becomes relevant when:
- Your site has 50,000+ pages
- Google Search Console shows significant "Discovered – currently not indexed" URLs
- Log analysis reveals Googlebot not reaching important sections
- You've recently migrated or restructured a large site
- Your server consistently shows high response times during crawls
Factors that influence crawl budget
| Factor | Impact | Why it matters |
|---|---|---|
| Server response time | Critical | Slow responses = fewer pages crawled per session |
| HTTP status codes | Critical | 5xx errors waste budget; clean 404s are fine |
| Duplicate content | High | Exact and Near-duplicates fragment crawl attention |
| URL parameters | High | Infinite combinations can trap crawlers |
| Faceted navigation | High | Filter combinations explode URL counts exponentially |
| Soft 404s | High | Pages returning 200 but showing error content get re-crawled repeatedly |
| Redirect chains | Medium | Each hop consumes crawl resources |
| XML sitemap quality | Medium | Signals canonical preference and freshness to crawlers |
| Click depth | Medium | Pages many clicks from the homepage get crawled less frequently |
| Page freshness | Medium | Frequently updated pages attract more crawls |
| Orphan pages | Medium | Pages with no internal links waste crawl resources |
Diagnosing crawl budget issues
Before optimising, confirm that crawl budget is actually your problem. Many sites blame crawl budget for indexing issues that stem from content quality or technical errors.
Using Google Search Console
Navigate to Settings → Crawl Stats to review Googlebot's behaviour on your site:
- Total crawl requests: Look for declining trends over 90 days. A drop exceeding 20% warrants investigation.
- Average response time: Target under 500ms. Response times exceeding 1 second trigger crawl throttling.
- Response code distribution: Healthy sites show 95%+ responses as 200 OK, under 3% redirects, under 2% client errors.
The Index Coverage report (Indexing → Pages) reveals how Google handles discovered URLs:
- Discovered – currently not indexed: Google found these URLs but hasn't crawled them yet. Large numbers here suggest crawl budget constraints.
- Crawled – currently not indexed: Google crawled these but chose not to index them, typically a content quality signal rather than a crawl budget issue.
Quick diagnostic calculation
Divide your total indexable pages by the average daily crawl requests from GSC. If the result exceeds 10, crawl budget may be limiting how quickly Google processes your site.
Total pages: 250,000
Average daily crawls: 2,500
Ratio: 100
→ At current crawl rate, reaching every page once takes ~100 days
A ratio under 3 typically indicates crawl budget isn't a bottleneck.
Server log analysis
For definitive answers, analyse server logs to see exactly which URLs Googlebot requests. GSC provides aggregated data; logs show individual requests.
import pandas as pd
def parse_crawl_logs(log_file):
"""
Parse server logs to analyse Googlebot behaviour.
Assumes Combined Log Format.
"""
crawl_data = []
with open(log_file, 'r') as f:
for line in f:
if 'Googlebot' in line:
parts = line.split('"')
if len(parts) >= 3:
request = parts[1].split()
if len(request) >= 2:
crawl_data.append({
'url': request[1],
'status': parts[2].strip().split()[0],
'timestamp': line.split('[')[1].split(']')[0]
})
df = pd.DataFrame(crawl_data)
print("Status code distribution:")
print(df['status'].value_counts())
print("\nMost crawled URL patterns:")
print(df['url'].value_counts().head(20))
return df
Key patterns to look for:
- Parameter URLs consuming disproportionate crawl share
- Important pages crawled infrequently
- Repeated requests to soft 404 pages
- Crawl activity concentrated on low-value sections
- Redirect chains consuming multiple requests
Common crawl budget killers
Certain patterns waste crawl resources at scale. Identifying and eliminating these typically produces the largest improvements.
Faceted navigation
Product filters create URL combinations that explode exponentially. A category page with 10 colours, 15 sizes, and 20 brands generates 3,000 unique URLs from filter combinations alone. Add sorting options and pagination, and a single category can spawn tens of thousands of crawlable URLs, most serving near-identical content.
Infinite scroll and deep pagination
Paginated series without proper controls trap crawlers in endless loops. A blog archive with thousands of posts creates /page/2/, /page/3/, through /page/500/, each requiring a crawl request while providing diminishing unique value.
Soft 404s
When a page returns HTTP 200 but displays "Product not found" or similar error content, Google can't rely on the status code to know the page is invalid. These soft 404s get re-crawled repeatedly because Google never receives the definitive signal that a proper 404 or 410 provides.
Orphan pages
Pages with no internal links pointing to them typically indicate forgotten or low-value content. Google may discover them through sitemaps or old external links, but the absence of internal linking signals low importance. Crawling orphan pages wastes budget on content your own site doesn't prioritise.
Redirect chains
Each redirect hop consumes a crawl request. A chain of three redirects (A → B → C → D) uses four requests to reach one destination. Google typically follows up to five hops per session but may abandon longer chains entirely, leaving destination pages undiscovered.
Best practices for crawl budget optimisation
1. Improve server performance
Fast, reliable responses are the foundation of good crawl efficiency:
Target metrics:
- Time to First Byte (TTFB): < 200ms
- Server uptime: > 99.9%
- 5xx error rate: < 0.1%
2. Eliminate duplicate and near-duplicate URLs
Consolidate variations that serve substantially similar content:
- Use canonical tags consistently
- Handle trailing slashes uniformly
- Manage URL parameters via Google Search Console
- Avoid session IDs and tracking parameters in URLs
3. Configure robots.txt strategically
Block crawlers from low-value URL patterns, not individual pages:
User-agent: *
# Block faceted navigation that creates duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /search?
# Block internal utility pages
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
# Keep sitemaps accessible
Sitemap: https://example.com/sitemap.xml
4. Maintain clean XML sitemaps
Your sitemaps should only include:
- Indexable pages (
200 OKstatus, nonoindex) - Canonical versions of pages
- Pages you actually want ranked
Remove from sitemaps:
- Redirected URLs
- Noindexed pages
- Paginated archives (usually)
- Parameter variations
5. Minimise click depth for important pages
Click depth (the number of clicks required to reach a page from the homepage) affects crawl priority more than URL structure. A page at /category/subcategory/product/ can be one click away if linked directly from the homepage. Ensure important pages are reachable within 3-4 clicks; pages buried deeper get crawled less frequently regardless of their URL path.
6. Monitor crawl stats regularly
Use Google Search Console's crawl stats report to identify:
- Response time trends
- Crawl request patterns
- File type distribution
- Response code breakdown
Large site considerations
Sites exceeding 100,000 pages require more structured approaches to crawl budget management.
URL prioritisation tiers
Categorise pages by business value to focus crawl resources:
- Tier 1 (critical): Revenue-generating pages such as products, services, and key landing pages. Ensure strong internal linking and sitemap inclusion.
- Tier 2 (supporting): Content that supports conversions, including blog posts, guides, and category pages. Include in sitemaps with appropriate
<lastmod>signals. - Tier 3 (utility): Necessary but low-value pages like about, contact, and legal pages. Minimal sitemap priority.
- Tier 4 (candidates for removal): Thin content, outdated pages, duplicate variations. Consider noindexing or removing entirely.
Sitemap segmentation
For large sites, segment sitemaps by content type rather than arbitrary splits. This enables per-segment monitoring in Search Console: when product page indexing drops, you see it immediately in the products sitemap report rather than buried in aggregate data.
Server log analysis at scale
GSC data alone is insufficient for sites with millions of pages. Dedicated log analysis tools (or custom solutions using tools like ELK stack) reveal:
- Which URL patterns consume the most crawl requests
- Time-of-day crawl patterns and server load correlation
- Googlebot behaviour differences across site sections
- Redirect chain frequency and depth
Signs of a healthy crawl budget
When crawl budget is well-optimised:
- High crawl-to-index ratio: Most crawled pages end up indexed. If Google crawls 1,000 pages and 900+ get indexed, the ratio is healthy.
- Rapid indexing of new content: New pages appear in search results within days rather than weeks.
- Stable or increasing crawl requests: GSC Crawl Stats show consistent or growing activity, not declining trends.
- Clean response code distribution: 95%+ of crawl requests return 200 OK, with minimal 4xx/5xx errors.
- Crawl activity aligned with content updates: When you publish or update pages, Googlebot visits them promptly.
- Low "Discovered – not indexed" counts: The gap between discovered URLs and indexed URLs remains narrow.
Common misconceptions
"Blocking pages in robots.txt saves crawl budget": Blocking URLs doesn't eliminate crawl demand. Google still checks robots.txt for those URLs. If you want pages truly ignored, use noindex and let them be crawled once, or remove them entirely. Blocking can also prevent link equity from flowing through those pages.
"More pages = more crawl budget needed": Quality matters more than quantity. A site with 100,000 high-quality, well-linked pages may be crawled more efficiently than a site with 10,000 thin, duplicate, or orphaned pages. Pruning low-value pages often improves crawl efficiency for the remaining content.
"Submitting sitemaps increases crawl budget": Sitemaps help Google discover URLs and understand their relative priority, but they don't increase your total crawl allocation. A sitemap full of low-quality URLs won't accelerate crawling.
"Googlebot crawls sites on a fixed schedule": Crawl frequency is dynamic and page-specific. Some pages might be crawled multiple times per day; others, once per month. This depends on perceived freshness, importance, and historical change patterns.
"4xx errors waste crawl budget": Proper 404 and 410 responses are efficient. Google requests the URL, receives a clear "not found" signal, and moves on. The problem is soft 404s (pages returning 200 OK while displaying error messages) which get re-crawled because Google never receives a definitive status code.
"Blocking pages temporarily reallocates crawl budget elsewhere": Google explicitly warns against using robots.txt to shift crawl resources. Blocking pages doesn't automatically increase crawling of other pages. The only reliable ways to increase crawl budget are improving server capacity and improving content quality.
Key takeaways
- Diagnose before optimising: Use GSC Crawl Stats and the pages-to-crawls ratio to confirm crawl budget is actually your bottleneck
- Eliminate crawl traps first: Faceted navigation, soft 404s, and orphan pages typically waste more budget than slow servers
- Server speed enables everything else: Response times under 500ms allow Googlebot to crawl more pages per session
- Proper 404s are efficient: Return real 404/410 status codes for removed content; soft 404s get re-crawled repeatedly
- Large sites need structured approaches: URL prioritisation tiers and segmented sitemaps provide both efficiency and monitoring visibility
Frequently asked questions
How long does it take to see results from crawl budget optimisation?
Expect different timelines for different changes:
- robots.txt changes: Googlebot typically re-fetches robots.txt within 24-48 hours
- Crawl rate improvements: 1-2 weeks to stabilise after server optimisations
- Indexing improvements: 2-4 weeks for measurable changes in coverage
- Full impact: 1-3 months for comprehensive optimisation efforts
Do CDNs help with crawl budget?
Indirectly. CDNs reduce server response times, allowing Googlebot to crawl more pages per session. They also improve reliability by distributing load. However, a CDN doesn't directly increase your crawl budget allocation; the benefit comes through improved server performance.
Should I block pagination from crawling?
It depends on the content. For blog archives, allowing the first 10-15 pages while blocking deeper pagination is reasonable, since most value is in recent content. For e-commerce categories, consider "view all" pages or ensure each paginated page has unique, valuable products. Forum threads with unique discussions should generally remain crawlable.
What's the difference between "Discovered – not indexed" and "Crawled – not indexed"?
"Discovered – currently not indexed" means Google found the URL but hasn't crawled it yet, often indicating crawl budget constraints. "Crawled – currently not indexed" means Google fetched the page but chose not to index it, typically signalling content quality issues rather than crawl budget problems.
Further reading
- Large site owner's guide to managing your crawl budget
Google's guidance on crawl budget for large sites with practical recommendations - Googlebot crawlers overview
Official documentation on Googlebot behaviour, crawl frequency, and how it accesses sites - URL Inspection tool documentation
How to use Search Console's URL Inspection to diagnose indexing and crawl issues