Can robots.txt protect my paywalled articles?

Only from compliant crawlers, and only as a request. A robots.txt Disallow does not stop a non-compliant client from fetching the page. Real protection requires server-side gating that withholds the full body without valid credentials.

AI crawlers

AI crawlers and paywalled content

AI crawlers can only ingest what your server returns to them. For paywalled or metered content, that depends on whether the page is gated by hard access control or by a soft, client-side wall. robots.txt asks compliant crawlers to stay out; only real authentication or server-side gating actually prevents an AI crawler from reading the full text.

Verified against primary sources

Crawlers see what the server sends

An AI crawler ingests the bytes your server returns. If a paywall is enforced only in the browser — the full article is in the HTML but hidden by CSS or removed by JavaScript — then a crawler that reads the raw response can ingest the entire text, paywall or not.

The practical rule: client-side paywalls are a presentation choice, not an access control. To keep full content away from crawlers, the gating must happen on the server before the bytes are sent.

Soft walls, hard walls, and robots.txt

A soft, metered wall typically serves full content and counts views client-side; crawlers bypass the meter entirely. A hard wall returns a teaser or an auth challenge (401 Unauthorized, 402 Payment Required, or 403 Forbidden) and never sends the protected body without credentials.

robots.txt sits on top of both. Disallowing an AI token asks compliant crawlers not to fetch protected paths at all, but it is a request, not enforcement. Pair robots.txt with server-side gating so a non-compliant client still cannot read the full text.

Client-side paywalls do not stop a crawler reading the raw HTML
Server-side gating (teaser or 401/402/403) is what actually protects content
robots.txt asks compliant crawlers to stay out; it does not enforce

Balancing visibility and protection

Many publishers want teaser or summary pages crawlable for AI visibility while keeping full articles gated. That is achievable: allow crawlers on the public teaser URL, return only the teaser body to unauthenticated requests, and keep the full text behind authentication.

Decide this per AI token. You might allow a search-oriented crawler on teasers while disallowing training crawlers entirely, depending on whether you want the content represented in models versus merely discoverable.

How it appears in analytics and logs

If an AI token fetches a paywalled URL and your server returned the full HTML, the crawler ingested the full text regardless of any client-side overlay. If it received only a teaser or a 401/402/403, the protected body stayed out of reach.

Diagnostic use case

Decide how AI crawlers should treat subscriber-only content: keep teaser pages crawlable for visibility while ensuring full articles are server-gated so crawlers cannot ingest them.

What WebmasterID can help detect

WebmasterID records which AI tokens fetched which URLs and the status returned, so you can confirm whether crawlers are reaching teaser pages versus protected bodies on the bot-intelligence surface.

Common mistakes

Assuming a client-side paywall hides full text from crawlers — it does not.
Relying on robots.txt alone to protect subscriber content instead of server gating.
Serving full article HTML to crawlers while showing readers a paywall overlay.
Using one policy for all AI tokens when training and search crawlers warrant different rules.

Privacy and accuracy notes

Paywall handling for crawlers concerns content access policy, not visitor identity. No subscriber data is exposed to a crawler, and detection keys on the crawler token, not on any human session.

Frequently asked questions

Can robots.txt protect my paywalled articles?: Only from compliant crawlers, and only as a request. A robots.txt Disallow does not stop a non-compliant client from fetching the page. Real protection requires server-side gating that withholds the full body without valid credentials.

↑ All AI crawlers in AI crawlers

Sources and verification notes

MDN — 402 Payment RequiredStatus semantics relevant to hard-gated paid content.
Google — robots.txt specificationrobots.txt is a crawl request, not an access-control mechanism.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.