What log file analysis reveals
Server logs record every request made to your infrastructure. When Googlebot fetches a URL, that request appears in your logs—with the exact timestamp, status code returned, response time, and bytes transferred.
This is ground truth. Search Console provides sampled data with reporting delays. Third-party crawl tools show you what their crawler sees, not what Googlebot actually does. Logs show you precisely what happened, when it happened, and how your server responded.
For sites where crawl budget is a genuine constraint—typically those with hundreds of thousands or millions of pages—log analysis is the primary diagnostic tool. It answers questions that no other data source can:
- Which URLs is Googlebot actually requesting?
- How frequently does it return to specific sections?
- What status codes is it receiving?
- How long are responses taking?
- Is it fetching the resources needed to render JavaScript?
But logs have boundaries. Understanding what they can't tell you is as important as knowing what they reveal.
Crawling, indexing, and ranking are distinct
SEOs routinely conflate "Googlebot fetched it" with "Google indexed it." This is a category error that leads to flawed diagnoses.
| Stage | What happens | What logs tell you |
|---|---|---|
| Crawling | Googlebot requests the URL | Yes—complete visibility |
| Rendering | Google executes JavaScript, constructs DOM | Partial—you see resource requests, not render outcome |
| Indexing | Google evaluates content, selects canonical, adds to index | No—logs cannot confirm index inclusion |
| Ranking | Google returns page for relevant queries | No—entirely outside log scope |
A URL crawled daily can remain unindexed indefinitely. Google may fetch it, evaluate the content, and decide it doesn't meet quality thresholds or duplicates another page. Conversely, a URL crawled once six months ago can rank well if it passed evaluation and accumulated signals.
For indexing status, Search Console's index coverage report is authoritative. The diagnostic power comes from combining both: logs reveal whether crawl access is a bottleneck; Search Console reveals whether pages that are crawled make it into the index.
Anatomy of a log entry
Most web servers default to Combined Log Format, which captures the essential fields for SEO analysis:
66.249.66.1 - - [15/Dec/2025:09:23:41 +0000] "GET /products/widget-blue/ HTTP/1.1" 200 45232 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Breaking this down:
| Field | Value | Meaning |
|---|---|---|
| IP address | 66.249.66.1 |
Client IP (Google's range for Googlebot) |
| Identity | - |
Unused field (RFC 1413 identity) |
| User | - |
Authenticated user (typically empty) |
| Timestamp | [15/Dec/2025:09:23:41 +0000] |
Request time with timezone |
| Request | "GET /products/widget-blue/ HTTP/1.1" |
Method, path, protocol |
| Status code | 200 |
Server response code |
| Bytes | 45232 |
Response size in bytes |
| Referrer | "-" |
Referring URL (often empty for bots) |
| User agent | "Mozilla/5.0 (compatible; Googlebot/2.1; ...)" |
Client identification |
For SEO analysis, the critical fields are: timestamp, request path, status code, bytes transferred, and user agent. Response time isn't included in Combined Log Format by default but can be added through server configuration—it's valuable for diagnosing performance-related crawl issues.
| Field | Why it Matters | What to Look For |
|---|---|---|
| IP address | Verify legitimate Googlebot via reverse DNS | 66.249.x.x range (but always verify) |
| Timestamp | Track crawl frequency and patterns over time | Gaps, spikes, time-of-day patterns |
| Request Path | Which URLs are being crawled | Parameter URLs, redirect sources, resource files |
| Status Code | Server response health | 4xx/5xx errors, redirect rates |
| Bytes | Response size validation | Suspiciously small (soft 404?), suspiciously large |
| User-Agent | Identify crawler type | Smartphone vs Desktop, AdsBot vs Search |
Log format variations
Different servers and CDNs produce different formats:
- Apache/nginx: Typically Combined Log Format, configurable
- Cloudflare: JSON-structured logs with additional fields (edge response time, cache status)
- AWS CloudFront: Tab-separated with CDN-specific fields
- Vercel/Netlify: Platform-specific formats, often JSON
Before analysing logs, identify your format and map fields accordingly. Most analysis tools expect Common or Combined Log Format and require configuration for alternatives.
Identifying search engine crawlers
Filtering for search engine bots is the first step in any analysis. User-agent strings identify the crawler:
| Bot | User-agent pattern | Notes |
|---|---|---|
| Googlebot (web) | Googlebot/2.1 |
Primary web crawler |
| Googlebot Smartphone | Googlebot smartphone |
Mobile-first indexing crawler |
| Googlebot Images | Googlebot-Image/1.0 |
Image search crawler |
| Googlebot Video | Googlebot-Video/1.0 |
Video search crawler |
| Googlebot News | Googlebot-News |
News indexing |
| AdsBot | AdsBot-Google |
Ads landing page quality |
| Bingbot | bingbot/2.0 |
Bing's primary crawler |
| Yandex | YandexBot/3.0 |
Russian search engine |
Verifying legitimate crawlers
User-agent strings are trivially spoofed. Scrapers, competitors, and security scanners frequently impersonate Googlebot to bypass rate limiting or robots.txt restrictions.
To verify legitimate Googlebot requests:
- Reverse DNS lookup on the IP address—should resolve to
*.googlebot.comor*.google.com - Forward DNS lookup on the hostname—should resolve back to the original IP
# Verify a suspected Googlebot IP
host 66.249.66.1
# Expected: 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com
host crawl-66-249-66-1.googlebot.com
# Expected: crawl-66-249-66-1.googlebot.com has address 66.249.66.1
For large-scale analysis, maintain a verified IP list rather than performing DNS lookups per request. Google publishes IP ranges for Googlebot, though these change periodically.
What crawlers request: positive signals
Once you've filtered for verified crawler requests, aggregate patterns reveal how search engines prioritise your site.
Crawl frequency by section
Group requests by URL path prefix to see which sections receive attention:
| Section | Requests/day | % of crawl |
|---|---|---|
/products/ |
12,450 | 62% |
/categories/ |
4,200 | 21% |
/blog/ |
2,100 | 10% |
/pages/ |
890 | 4% |
| Other | 560 | 3% |
This distribution should roughly align with your content priorities. If your blog drives significant organic traffic but receives only 10% of crawl attention while a low-value section dominates, investigate why.
Status code distribution
Aggregate status codes by crawler:
| Status | Count | % | Interpretation |
|---|---|---|---|
| 200 | 18,450 | 87% | Successful responses |
| 301 | 1,240 | 6% | Redirects (investigate if high) |
| 304 | 890 | 4% | Not modified (caching working) |
| 404 | 420 | 2% | Not found (some expected; spikes warrant review) |
| 500 | 200 | 1% | Server errors (investigate immediately) |
Elevated redirect rates suggest internal linking to non-canonical URLs. High 404 rates may indicate deleted content still linked internally or externally. Any 5xx errors to Googlebot degrade crawl efficiency and should be resolved urgently.
Response time analysis
If your logs include response time (often as %D in Apache or $request_time in nginx), analyse latency patterns:
- By URL pattern: Are certain page types consistently slow?
- By time of day: Does performance degrade during peak hours?
- By crawler: Is Googlebot receiving slower responses than users?
Google reduces crawl rate when servers respond slowly. Response times consistently above 200ms for Googlebot warrant performance investigation. See crawl budget basics for how server performance affects crawl allocation.
Resource requests
For JavaScript-rendered sites, examine whether Googlebot requests the resources needed for rendering:
- JavaScript bundles (
/static/js/*.js) - CSS files (
/static/css/*.css) - API endpoints called during render (
/api/*) - Web fonts, images referenced in CSS
If these resources aren't appearing in Googlebot's requests, check robots.txt for inadvertent blocks or verify that resource URLs are accessible.
What crawlers don't request: absence as signal
What's missing from logs often reveals more than what's present. Build analyses specifically to surface URLs that should appear but don't.
Internally linked but never crawled
Compare URLs receiving internal links against URLs Googlebot has requested:
URLs with 10+ internal links: 45,000
URLs with 10+ internal links AND Googlebot request (90 days): 38,000
Gap: 7,000 URLs linked but not crawled
This gap suggests:
- JavaScript rendering issues: Links exist in rendered DOM but not server-sent HTML
- Nofollow at scale: Internal links may carry
rel="nofollow"unexpectedly - Crawl prioritisation: Google is choosing not to follow these links despite seeing them
- Link discovery failure: Links are in locations Googlebot doesn't parse (JavaScript event handlers, non-standard attributes)
Investigate a sample manually. Use Google's URL Inspection tool to see whether Google is aware of these pages through other discovery paths.
Sitemap URLs never fetched
Cross-reference your XML sitemaps against crawl logs:
URLs in sitemap: 125,000
Sitemap URLs crawled (90 days): 98,000
Never fetched: 27,000
Possible causes:
- Sitemap not processed: Verify sitemap appears in Search Console with correct URL count
- Low priority signals: Google may deprioritise URLs based on historical quality or update patterns
- Crawl budget exhaustion: Google allocates finite crawl resources; lower-priority URLs may be deferred indefinitely
- URL pattern issues: Parameter variations or trailing slash inconsistencies between sitemap and canonical URLs
The sitemap is a request for crawling, not a guarantee. Google's sitemap documentation explicitly notes that submission doesn't ensure crawling or indexing.
Canonical targets with zero bot hits
If you specify rel="canonical" from page A to page B, but Googlebot never requests page B directly, this may indicate:
- Google disagrees with your canonical: It may have selected a different canonical based on its own signals
- Canonical target is inaccessible: The URL may be blocked, redirecting, or erroring
- Signal confusion: Conflicting canonical signals across the site
Check URL Inspection for Google's selected canonical versus your declared canonical.
Referenced resources never requested
For JavaScript sites, identify resources referenced in HTML that Googlebot never fetches:
<!-- In your HTML -->
<script src="/static/js/app.bundle.js"></script>
<link rel="stylesheet" href="/static/css/main.css">
If these URLs never appear in Googlebot's requests:
- robots.txt block: Check for patterns inadvertently blocking static resources
- Response errors: Resources may be returning 4xx or 5xx to bots specifically
- Conditional serving: Server may be sending different HTML to different user agents
Diagnosing common issues
Crawl budget waste
Identify URL patterns receiving disproportionate crawl attention relative to their value:
| Pattern | Requests/day | Index value |
|---|---|---|
/search?q=* |
8,200 | None (internal search) |
/products/*?sort=* |
5,100 | Low (sorted variants) |
/products/*?color=* |
4,800 | Low (filtered variants) |
/products/* (canonical) |
3,200 | High (actual product pages) |
When parameter variations receive more crawl attention than canonical product pages, you're wasting budget. Solutions include:
- robots.txt blocks for low-value patterns
- URL parameter handling in Search Console
- Canonical tags from parameter variants to clean URLs
Redirect chain detection
Identify URLs where Googlebot receives 301/302 responses repeatedly:
/old-page → 301 (appears 45 times in 30 days)
/legacy/product → 301 (appears 120 times in 30 days)
Repeated redirect responses indicate:
- Internal links pointing to redirect sources: Update internal links to target final destinations
- External links you can't control: The redirects are necessary, but chains may exist
- Redirect loops or chains: Follow the redirect destinations to verify they resolve
Each redirect consumes a crawl request. A page requiring three hops (A→B→C→D) consumes four requests to reach the destination.
Soft 404 detection
Soft 404s occur when missing pages return 200 status with error content. Logs show the symptom; diagnosis requires examining response content:
- Identify candidate URLs: 200 responses on URL patterns that suggest non-existent content (e.g.,
/product/deleted-sku-12345) - Check response size: Soft 404 pages often have consistent, small response sizes (your error template)
- Verify with URL Inspection: Google's tool specifically reports soft 404 detection
# Suspicious pattern: 200 status with identical small response size
/products/xyz123 200 2,450 bytes
/products/abc789 200 2,450 bytes
/products/def456 200 2,450 bytes
If multiple non-existent URLs return 200 with identical byte counts, your error handling is likely producing soft 404s.
Googlebot rendering resource access
For JavaScript-rendered sites, verify Googlebot can access everything needed to render:
- Extract resource URLs from a rendered page (JavaScript, CSS, fonts, API calls)
- Check logs for Googlebot requests to these resources
- Verify robots.txt doesn't block any paths
- Test in URL Inspection to see Google's rendered version
Missing resources in logs combined with rendering differences in URL Inspection confirms a resource access problem.
Correlating logs with other data sources
Log analysis becomes more powerful when combined with other data:
Logs + Search Console index coverage
| Log status | Index status | Interpretation |
|---|---|---|
| Crawled frequently | Indexed | Working as expected |
| Crawled frequently | Not indexed | Quality or canonical issues |
| Never crawled | Not indexed | Crawl access is the bottleneck |
| Never crawled | Indexed | Discovered via sitemap or links; crawled before your log window |
This correlation identifies whether crawl access or content evaluation is causing indexing gaps.
Logs + Sitemaps
Compare sitemap freshness signals against actual crawl patterns:
- URLs with recent
<lastmod>but no recent crawls: Freshness signals may not be trusted - URLs without
<lastmod>receiving frequent crawls: Google has other freshness signals for these pages - New URLs in sitemap never crawled: Discovery through sitemap isn't guaranteed
Logs + Backlink data
High-authority backlink targets that receive no Googlebot requests warrant investigation:
- Is the URL accessible? (Check for blocks, errors)
- Is the backlink actually followed? (Check for nofollow, JavaScript links)
- Has Google devalued the linking source?
External links typically prompt crawling. Targets that remain uncrawled despite quality backlinks suggest access problems.
Logs + Analytics
Compare crawled pages against pages receiving organic traffic:
- Crawled, no traffic: Indexed but not ranking, or not indexed at all
- Traffic, rarely crawled: Stable rankings; Google sees no need to recrawl frequently
- High traffic, high crawl: Important pages receiving appropriate attention
Anomalies in this correlation (heavily crawled pages with zero traffic) may indicate wasted budget or indexing issues.
Choosing the right data source
| Question | Answer source | Why |
|---|---|---|
| Is Googlebot crawling this URL? | Server logs | Logs show actual requests |
| What status code is Googlebot receiving? | Server logs | Direct server response |
| Is the page indexed? | Search Console | Only Google knows index inclusion |
| Which canonical did Google select? | Search Console | Google's selection may differ from yours |
| Can Googlebot render this page? | Logs + Search Console | Logs show resource fetches; URL Inspection shows result |
| Why isn't this page ranking? | Logs → then → Search Console | First verify crawl access; then check index/quality |
Tools and approaches
The right tooling depends on your scale and technical resources.
Command-line basics (small sites, quick queries)
For sites under 100,000 pages or one-off investigations:
# Count Googlebot requests by status code
grep "Googlebot" access.log | awk '{print $9}' | sort | uniq -c | sort -rn
# Find most-crawled URLs
grep "Googlebot" access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -50
# Extract Googlebot requests for specific date
grep "Googlebot" access.log | grep "17/Dec/2025" > googlebot-dec17.log
Command-line tools handle moderate log volumes efficiently. For larger files, tools like ripgrep (rg) offer significant performance improvements over standard grep.
Python for structured analysis
For repeatable analysis or larger datasets, Python with pandas provides flexibility:
import pandas as pd
from user_agents import parse
def load_combined_log(filepath):
"""Parse Combined Log Format into DataFrame."""
pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) (\d+) "[^"]*" "([^"]*)"'
df = pd.read_csv(
filepath,
sep=pattern,
engine='python',
header=None,
usecols=[1, 2, 3, 4, 5, 6, 7],
names=['ip', 'timestamp', 'method', 'url', 'status', 'bytes', 'user_agent']
)
df['status'] = pd.to_numeric(df['status'], errors='coerce')
df['bytes'] = pd.to_numeric(df['bytes'], errors='coerce')
return df
def filter_googlebot(df):
"""Filter for Googlebot requests (user-agent based, verify IPs separately)."""
return df[df['user_agent'].str.contains('Googlebot', case=False, na=False)]
# Example analysis
logs = load_combined_log('access.log')
googlebot = filter_googlebot(logs)
# Status code distribution
print(googlebot['status'].value_counts())
# Most crawled URL patterns
print(googlebot['url'].value_counts().head(20))
For verified Googlebot analysis, add IP verification against Google's published ranges.
Dedicated log analysis tools
For enterprise-scale sites or teams without engineering resources:
- Screaming Frog Log File Analyser: GUI-based, handles large files, built-in bot verification
- Botify / JetOctopus / Lumar / OnCrawl: Cloud-based log analysis with Search Console integration
- Custom ELK stack: Elasticsearch, Logstash, Kibana for ongoing monitoring at scale
The trade-off is typically flexibility versus setup time. Dedicated tools provide faster time-to-insight but less customisation than scripted approaches.
BigQuery for massive scale
Sites generating gigabytes of logs daily often export to BigQuery or similar data warehouses:
-- Googlebot crawl frequency by URL pattern (first directory)
SELECT
REGEXP_EXTRACT(url, r'^/([^/]+)/') AS section,
COUNT(*) AS requests,
COUNT(DISTINCT DATE(timestamp)) AS days_active
FROM `project.dataset.logs`
WHERE user_agent LIKE '%Googlebot%'
AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY section
ORDER BY requests DESC
LIMIT 20
BigQuery handles terabyte-scale log analysis efficiently, with SQL providing accessible query syntax for non-engineers.
Operationalising log analysis
Moving from occasional audits to ongoing monitoring multiplies the value of log analysis.
Automated alerting
Configure alerts for anomalies that warrant immediate attention:
- Crawl volume drop: >50% reduction in daily Googlebot requests
- Error rate spike: >5% of Googlebot requests returning 5xx
- New 404 patterns: Significant increase in 404 responses to bots
- Response time degradation: Average response time to Googlebot exceeding thresholds
These alerts can be implemented through log management platforms (Datadog, Splunk, CloudWatch) or custom scripts feeding into notification systems.
Regular reporting cadence
Establish periodic reviews with consistent metrics:
Weekly:
- Crawl volume trend (compared to previous weeks)
- Status code distribution
- Top crawled URLs (watch for unexpected patterns)
Monthly:
- Section-by-section crawl analysis
- Negative space analysis (URLs that should be crawled but aren't)
- Correlation with Search Console index coverage changes
Quarterly:
- Historical trend analysis
- Crawl efficiency metrics (useful crawls vs. wasted crawls)
- Recommendations for robots.txt or architectural changes
Dashboard integration
For teams managing SEO at scale, incorporate log metrics into SEO dashboards alongside:
- Search Console performance data
- Index coverage trends
- Core Web Vitals
- Ranking position tracking
This unified view enables correlation analysis that single-source dashboards miss.
FAQs
How much log history do I need?
Minimum 30 days for basic analysis; 90 days for trend identification; 12 months for seasonal pattern analysis. Google recrawls pages at varying frequencies—some pages may only be visited monthly or less frequently. Short analysis windows miss infrequently crawled URLs entirely.
What if I use a CDN—where do I get logs?
CDNs typically provide their own logging. Cloudflare offers Logpush, CloudFront provides access logs to S3, Fastly has real-time log streaming. CDN logs often contain richer data (edge location, cache status) but may use different formats. Ensure you're capturing origin requests, not just edge cache hits—or analyse both to understand cache behaviour.
How do I handle log rotation and storage costs?
Compress historical logs (gzip reduces size ~90%). For analysis, sample rather than process complete logs when volumes are extreme—a 10% sample of Googlebot requests usually provides statistically valid patterns. Archive raw logs to cold storage (S3 Glacier, similar) for compliance while keeping recent data accessible.
Can I analyse logs if my site is on shared hosting?
Shared hosting typically provides access logs through control panels (cPanel, Plesk), though often with limited history and no customisation. For serious SEO work on high-traffic sites, shared hosting is limiting regardless of log access. Consider upgrading to VPS or managed hosting that provides full log access and retention control.
Do logs tell me if a page is indexed?
No. Logs tell you whether Googlebot requested a URL and what response it received. A page can be crawled repeatedly without being indexed (quality thresholds, canonical selection, manual actions). Use Search Console's index coverage or URL Inspection tool for indexing status. Logs confirm crawl access; Search Console confirms index inclusion.
What about Googlebot variants—do I need to track them separately?
Yes, when relevant. Googlebot Smartphone (mobile-first indexing) versus desktop Googlebot may show different patterns. AdsBot-Google has different behaviour and purposes than web search crawlers. Filter by user-agent variant when diagnosing specific issues, but aggregate for overall crawl volume analysis.
Key takeaways
-
Logs show crawl behaviour, not indexing outcomes: Googlebot fetching a URL doesn't mean it's indexed. Use logs to verify crawl access; use Search Console for index status.
-
Verify before analysing: User-agent strings are commonly spoofed. Perform DNS verification on IP addresses before drawing conclusions about Googlebot's behaviour.
-
Absence reveals as much as presence: URLs that should be crawled but aren't—sitemap URLs never fetched, internally linked pages never requested, canonical targets never hit—often indicate more significant issues than what is being crawled.
-
Correlate across data sources: Logs combined with Search Console, sitemaps, and backlink data enable diagnoses that no single source supports. Match crawl patterns against index coverage to identify whether crawl access or content evaluation is the bottleneck.
-
Scale your approach appropriately: Command-line tools suffice for small sites and quick queries. Dedicated tools or data warehouses become necessary at scale. Match tooling complexity to actual requirements.
-
Operationalise for ongoing value: One-off audits provide snapshots; automated monitoring and regular reporting cadences catch issues before they compound.
Further reading
- Verifying Googlebot and other Google crawlers
Official documentation on DNS verification for legitimate Googlebot requests - Google's IP ranges for crawlers
Published IP ranges for Googlebot and other Google crawlers - Apache mod_log_config documentation
Reference for Apache log format customisation - nginx log_format directive
Reference for nginx log format configuration