What is crawl budget?
Crawl budget refers to the number of pages a search engine will crawl on your site within a given timeframe. It's determined by two key factors: crawl capacity limit (how much crawling your server can handle) and crawl demand (how much Google wants to crawl based on perceived value).
Google's crawlers don't have unlimited resources. They must balance crawling billions of pages across the web while being respectful to individual servers. For most small to medium sites, crawl budget isn't a concern—Google will find and index your content. However, for large sites with thousands or millions of pages, understanding crawl budget becomes critical.
For more context on how this fits into overall Technical SEO strategy, crawl budget optimisation is often one of the highest-impact interventions we make.
The two components of crawl budget
Crawl capacity limit
This is the maximum crawling Google will do without negatively impacting your server. If your site responds slowly, returns errors, or explicitly limits crawling via robots.txt, Google will reduce its crawl rate. Factors include:
- Server response time and stability
- Hosting infrastructure capacity
- Crawl-delay directives (though Google doesn't fully honour these)
- Rate of 5xx errors encountered
Crawl demand
Even if your server could handle more, Google only crawls what it deems valuable. Crawl demand increases when:
- Pages are popular (receive external links, traffic)
- Content is updated frequently
- New pages are discovered
- The site has demonstrated quality and authority
Factors that influence crawl budget
| Factor | Impact | Why it matters |
|---|---|---|
| Server response time | Critical | Slow responses = fewer pages crawled per session |
| HTTP status codes | Critical | 5xx errors waste budget; clean 404s are fine |
| Duplicate content | High | Near-duplicates fragment crawl attention |
| URL parameters | High | Infinite combinations can trap crawlers |
| Redirect chains | Medium | Each hop consumes crawl resources |
| XML sitemap quality | Medium | Signals priority and freshness to crawlers |
| Internal link depth | Medium | Deeply buried pages get crawled less frequently |
| Page freshness | Medium | Frequently updated pages attract more crawls |
Common misconceptions
"Crawl budget matters for every site": For sites under 10,000 pages with decent server performance, crawl budget is rarely a limiting factor. Google will typically crawl and index your content without intervention. Focus on crawl budget only when you have evidence of indexing delays or incomplete coverage.
"Blocking pages in robots.txt saves crawl budget": Blocking URLs doesn't eliminate crawl demand—Google still needs to check robots.txt for those URLs. If you want pages truly ignored, use noindex and let them be crawled once, or remove them entirely. Blocking can also prevent link equity from flowing through those pages.
"More pages = more crawl budget needed": Quality matters more than quantity. A site with 100,000 high-quality, well-linked pages may be crawled more efficiently than a site with 10,000 thin, duplicate, or orphaned pages. Pruning low-value pages often improves crawl efficiency for the remaining content.
"Submitting sitemaps increases crawl budget": Sitemaps help Google discover URLs and understand their relative priority, but they don't increase your total crawl allocation. A sitemap full of low-quality URLs won't get them crawled faster—it may actually dilute the signal.
"Googlebot crawls sites on a fixed schedule": Crawl frequency is dynamic and page-specific. Some pages might be crawled multiple times per day; others, once per month. This depends on perceived freshness, importance, and historical change patterns.
Best practices for crawl budget optimisation
1. Improve server performance
Fast, reliable responses are the foundation of good crawl efficiency:
Target metrics:
- Time to First Byte (TTFB): < 200ms
- Server uptime: > 99.9%
- 5xx error rate: < 0.1%
2. Eliminate duplicate and near-duplicate URLs
Consolidate variations that serve substantially similar content:
- Use canonical tags consistently
- Handle trailing slashes uniformly
- Manage URL parameters via Google Search Console
- Avoid session IDs and tracking parameters in URLs
3. Configure robots.txt strategically
Block crawlers from low-value URL patterns, not individual pages:
User-agent: *
# Block faceted navigation that creates duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /search?
# Block internal utility pages
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
# Keep sitemaps accessible
Sitemap: https://example.com/sitemap.xml
4. Maintain clean XML sitemaps
Your sitemaps should only include:
- Indexable pages (
200 OKstatus, nonoindex) - Canonical versions of pages
- Pages you actually want ranked
Remove from sitemaps:
- Redirected URLs
- Noindexed pages
- Paginated archives (usually)
- Parameter variations
5. Flatten your site architecture
Ensure important pages are reachable within 3-4 clicks from the homepage. Deep pages get crawled less frequently.
6. Monitor crawl stats regularly
Use Google Search Console's crawl stats report to identify:
- Response time trends
- Crawl request patterns
- File type distribution
- Response code breakdown
Analysing crawl patterns
For large sites, server log analysis provides the deepest insights. Here's a basic approach using Python, which can be adapted for your specific scenario:
import pandas as pd
from datetime import datetime
def parse_crawl_logs(log_file):
"""
Parse server logs to analyse Googlebot behaviour.
Assumes Combined Log Format.
"""
crawl_data = []
with open(log_file, 'r') as f:
for line in f:
if 'Googlebot' in line:
parts = line.split('"')
if len(parts) >= 3:
request = parts[1].split()
if len(request) >= 2:
crawl_data.append({
'url': request[1],
'status': parts[2].strip().split()[0],
'timestamp': line.split('[')[1].split(']')[0]
})
df = pd.DataFrame(crawl_data)
# Identify potential issues
print("Status code distribution:")
print(df['status'].value_counts())
print("\nMost crawled URL patterns:")
print(df['url'].value_counts().head(20))
return df
Key patterns to look for:
- URLs being crawled excessively (infinite loops, parameter combinations)
- High 404 or 5xx rates
- Important pages being crawled infrequently
- Redirect chains consuming multiple requests
When to prioritise crawl budget
Focus on crawl budget optimisation when:
- Your site has 50,000+ pages
- Google Search Console shows significant "Discovered - currently not indexed" URLs
- Log analysis reveals Googlebot not reaching important sections
- You've recently migrated or restructured a large site
- Your server consistently shows high response times during crawls
Key takeaways
- Crawl budget = capacity × demand: How much your server can handle multiplied by how valuable Google perceives your content
- Most sites don't need to worry: Under 10,000 pages with decent performance? Crawl budget isn't your bottleneck
- Server speed is foundational: Slow responses mean fewer pages crawled per session
- Blocking URLs doesn't eliminate crawl demand: Google still checks robots.txt for blocked URLs
- Quality beats quantity: Pruning low-value pages often improves crawl efficiency for remaining content
Frequently asked questions
Does submitting a sitemap increase my crawl budget?
No. Sitemaps help Google discover URLs and understand priority, but they don't increase your total crawl allocation. A sitemap full of low-quality URLs won't accelerate crawling.
Should I block low-value pages in robots.txt to save crawl budget?
Not necessarily. Blocking prevents crawling but not indexing—external links to blocked URLs can still cause them to appear in search results. Use noindex for pages you don't want indexed, or remove them entirely.
How do I know if crawl budget is limiting my site?
Check Google Search Console for "Discovered - currently not indexed" URLs. If important pages remain in this state for weeks, or log analysis shows Googlebot not reaching key sections, crawl budget may be a factor.
Further reading
- Large site owner's guide to managing your crawl budget
Google's guidance on crawl budget for large sites with practical recommendations