robots.txt is the oldest and most misunderstood file on the web. It’s a per-host file at /robots.txt that tells well-behaved crawlers which paths they may not fetch. It is not a security boundary, not a regex engine, not a noindex mechanism, and not enforced — it’s a polite request. Most production robots.txt files have at least one of the bugs in this guide.
Why it matters
Five mistakes that range from “annoying” to “site-deindexed”:
- Accidental
Disallow: /. Usually shipped from a staging server’s robots.txt that wasn’t replaced at deploy. Googlebot stops crawling everything. Rankings collapse within days. - Blocking
/wp-content/or/static/or/_next/. Kills CSS and JS fetching. Googlebot can’t render the page. The crawler sees an unstyled, half-built version and indexes that — or fails entirely. - Using robots.txt to “hide” pages. Disallowing a URL prevents crawling, but if the URL is linked from elsewhere, Google can still index a URL-only entry with no content. Use
noindexfor hiding, robots.txt for crawl management. - Treating it as regex. robots.txt uses simple wildcards (
*and$for end-of-URL), not regex.Disallow: /*\.php$works;Disallow: /[0-9]+/does not. - Forgetting it’s case-sensitive.
Disallow: /Admin/does not block/admin/. URLs in robots.txt are case-sensitive.
How to detect it
Fetch and review:
curl -s https://example.com/robots.txt
Then test specific URLs against the rules using Google’s robots.txt tester (Search Console → Settings → robots.txt). The tester shows exactly which line allows or blocks any URL — invaluable for debugging precedence.
Critical sanity checks:
# Does Googlebot have access to JS and CSS?
curl -s https://example.com/robots.txt | grep -iE 'Disallow.*(\.js|\.css|/static|/_next|/wp-content|/assets)'
# Any accidental site-wide blocks?
curl -s https://example.com/robots.txt | grep -E '^Disallow: /\s*$'
The first command should return nothing. The second should return nothing unless User-agent: * is followed by an explicit Allow: / for the paths you do want crawled.
The fix
Universal — robots.txt syntax
# Comments start with #
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /search?
Allow: /
# Allow all rendering-critical paths
User-agent: Googlebot
Disallow: /admin/
Allow: /
Sitemap: https://example.com/sitemap.xml
Rules:
User-agent: *matches all crawlers; specific UA blocks override the wildcard.Allowtakes precedence overDisallowwhen both match — useful for opening a single path inside a blocked directory.- One or more
Sitemap:declarations at the end, absolute URL each. - The longest matching rule wins (not the first).
Wildcards (not regex)
*matches any character sequence.$anchors to end of URL.- Nothing else. No character classes, no quantifiers, no groups.
# Block all .pdf URLs
Disallow: /*.pdf$
# Block all URLs with ?sessionid=
Disallow: /*?sessionid=
What to block
- Admin panels (
/admin/,/wp-admin/) - Internal search results (
/search?) - Faceted navigation parameters that explode crawl budget (
?color=,?size=,?sort=) - Cart and checkout flows (
/cart/,/checkout/)
What NOT to block
- CSS, JS, fonts, images required to render the page
- Anything you want noindexed (use
noindexinstead) - Anything sensitive (robots.txt is public; blocking
/admin-secrets/advertises it)
noindex vs robots.txt — the critical distinction
| Goal | Tool |
|---|---|
| Page never crawled | Disallow: in robots.txt |
| Page crawled but never indexed | <meta name="robots" content="noindex"> |
| Page hidden from public | Authentication, not robots.txt |
Important: if a URL is blocked in robots.txt, Google cannot crawl it — which means it cannot see the noindex meta tag. So robots.txt + noindex on the same URL means the URL can still appear in the index with no snippet. To deindex, use noindex alone, without disallowing.
Wordpress
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/wp-sitemap.xml
Never block /wp-content/ or /wp-includes/ — themes and plugins serve required CSS and JS from there.
Next.js (App Router)
// app/robots.ts
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: "*", allow: "/", disallow: ["/admin/", "/api/"] },
],
sitemap: "https://example.com/sitemap.xml",
};
}
Don’t disallow /_next/ — it serves the JS bundles required for rendering.
Nginx — serving robots.txt by host
location = /robots.txt {
if ($host = "staging.example.com") {
return 200 "User-agent: *\nDisallow: /\n";
}
alias /var/www/robots.txt;
}
Critical for staging environments — the wrong robots.txt at the wrong host is how Disallow: / ends up in production.
Common pitfalls
- Staging robots.txt deployed to production. Set up an automated check in CI that fetches
/robots.txtfrom prod after every deploy and fails if it containsDisallow: /underUser-agent: *. - Trying to deindex via robots.txt. Blocked URLs can still be indexed (URL-only). Use
noindexto deindex. - Blocking
/api/when JSON-LD or content comes from an API. Googlebot fetches JSON resources too. If your structured data or content is API-driven, those paths must be allowed. - Forgetting subdomains have their own robots.txt.
m.example.com/robots.txtis independent ofexample.com/robots.txt. Same forblog.example.com. - Trailing slash bugs.
Disallow: /adminblocks/adminand/admin/foo.Disallow: /admin/only blocks/admin/foo, not/adminitself. Pick the right form. - Robots.txt returning HTML. Some routers serve a friendly 404 page. Verify
Content-Type: text/plain.