Skip to content
← Back to Learn
Crawling & Indexing Critical

robots.txt mistakes beyond AI bots

Blocking JS/CSS kills rendering. `Disallow: /` deindexes the site. robots.txt is not regex, not enforcement, and not the right tool to hide secrets.

robots.txt is the oldest and most misunderstood file on the web. It’s a per-host file at /robots.txt that tells well-behaved crawlers which paths they may not fetch. It is not a security boundary, not a regex engine, not a noindex mechanism, and not enforced — it’s a polite request. Most production robots.txt files have at least one of the bugs in this guide.

Why it matters

Five mistakes that range from “annoying” to “site-deindexed”:

  1. Accidental Disallow: /. Usually shipped from a staging server’s robots.txt that wasn’t replaced at deploy. Googlebot stops crawling everything. Rankings collapse within days.
  2. Blocking /wp-content/ or /static/ or /_next/. Kills CSS and JS fetching. Googlebot can’t render the page. The crawler sees an unstyled, half-built version and indexes that — or fails entirely.
  3. Using robots.txt to “hide” pages. Disallowing a URL prevents crawling, but if the URL is linked from elsewhere, Google can still index a URL-only entry with no content. Use noindex for hiding, robots.txt for crawl management.
  4. Treating it as regex. robots.txt uses simple wildcards (* and $ for end-of-URL), not regex. Disallow: /*\.php$ works; Disallow: /[0-9]+/ does not.
  5. Forgetting it’s case-sensitive. Disallow: /Admin/ does not block /admin/. URLs in robots.txt are case-sensitive.

How to detect it

Fetch and review:

curl -s https://example.com/robots.txt

Then test specific URLs against the rules using Google’s robots.txt tester (Search Console → Settings → robots.txt). The tester shows exactly which line allows or blocks any URL — invaluable for debugging precedence.

Critical sanity checks:

# Does Googlebot have access to JS and CSS?
curl -s https://example.com/robots.txt | grep -iE 'Disallow.*(\.js|\.css|/static|/_next|/wp-content|/assets)'

# Any accidental site-wide blocks?
curl -s https://example.com/robots.txt | grep -E '^Disallow: /\s*$'

The first command should return nothing. The second should return nothing unless User-agent: * is followed by an explicit Allow: / for the paths you do want crawled.

The fix

Universal — robots.txt syntax

# Comments start with #
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /search?
Allow: /

# Allow all rendering-critical paths
User-agent: Googlebot
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

Rules:

  • User-agent: * matches all crawlers; specific UA blocks override the wildcard.
  • Allow takes precedence over Disallow when both match — useful for opening a single path inside a blocked directory.
  • One or more Sitemap: declarations at the end, absolute URL each.
  • The longest matching rule wins (not the first).

Wildcards (not regex)

  • * matches any character sequence.
  • $ anchors to end of URL.
  • Nothing else. No character classes, no quantifiers, no groups.
# Block all .pdf URLs
Disallow: /*.pdf$

# Block all URLs with ?sessionid=
Disallow: /*?sessionid=

What to block

  • Admin panels (/admin/, /wp-admin/)
  • Internal search results (/search?)
  • Faceted navigation parameters that explode crawl budget (?color=, ?size=, ?sort=)
  • Cart and checkout flows (/cart/, /checkout/)

What NOT to block

  • CSS, JS, fonts, images required to render the page
  • Anything you want noindexed (use noindex instead)
  • Anything sensitive (robots.txt is public; blocking /admin-secrets/ advertises it)

noindex vs robots.txt — the critical distinction

GoalTool
Page never crawledDisallow: in robots.txt
Page crawled but never indexed<meta name="robots" content="noindex">
Page hidden from publicAuthentication, not robots.txt

Important: if a URL is blocked in robots.txt, Google cannot crawl it — which means it cannot see the noindex meta tag. So robots.txt + noindex on the same URL means the URL can still appear in the index with no snippet. To deindex, use noindex alone, without disallowing.

Wordpress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/

Sitemap: https://example.com/wp-sitemap.xml

Never block /wp-content/ or /wp-includes/ — themes and plugins serve required CSS and JS from there.

Next.js (App Router)

// app/robots.ts
import type { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: "*", allow: "/", disallow: ["/admin/", "/api/"] },
    ],
    sitemap: "https://example.com/sitemap.xml",
  };
}

Don’t disallow /_next/ — it serves the JS bundles required for rendering.

Nginx — serving robots.txt by host

location = /robots.txt {
  if ($host = "staging.example.com") {
    return 200 "User-agent: *\nDisallow: /\n";
  }
  alias /var/www/robots.txt;
}

Critical for staging environments — the wrong robots.txt at the wrong host is how Disallow: / ends up in production.

Common pitfalls

  • Staging robots.txt deployed to production. Set up an automated check in CI that fetches /robots.txt from prod after every deploy and fails if it contains Disallow: / under User-agent: *.
  • Trying to deindex via robots.txt. Blocked URLs can still be indexed (URL-only). Use noindex to deindex.
  • Blocking /api/ when JSON-LD or content comes from an API. Googlebot fetches JSON resources too. If your structured data or content is API-driven, those paths must be allowed.
  • Forgetting subdomains have their own robots.txt. m.example.com/robots.txt is independent of example.com/robots.txt. Same for blog.example.com.
  • Trailing slash bugs. Disallow: /admin blocks /admin and /admin/foo. Disallow: /admin/ only blocks /admin/foo, not /admin itself. Pick the right form.
  • Robots.txt returning HTML. Some routers serve a friendly 404 page. Verify Content-Type: text/plain.
From diagnosis to deployment

Find the issue. Ship the fix.

Use Learn to understand the problem, then run Serpwise against your own site to see what can be approved and deployed.