AI crawlers

AI crawler honeypots and traps

An AI crawler honeypot is a deliberately planted resource — a hidden link, a disallowed path, or an endlessly generated 'tar-pit' page — used to detect or slow crawlers that ignore robots.txt. Tools such as Nepenthes popularised the tar-pit approach. This entry explains the techniques, what they can prove, and why they are a detection aid rather than enforcement.

Partially verified

How honeypots work

The simplest honeypot is a link hidden from humans (for example via CSS) and listed as Disallow in robots.txt. A compliant crawler skips it; a non-compliant one fetches it, outing itself. A more aggressive variant is a tar pit — a page that generates endless low-value links to waste a misbehaving crawler's budget. The open-source Nepenthes project is a documented example of the tar-pit pattern.

The goal is detection and friction, not access control. A honeypot tells you something fetched a path it should not have; that is evidence of non-compliance you can act on.

Limits and cautions

Honeypots prove behaviour, not identity. A fetch of a disallowed path shows the client ignored robots.txt, but a copied user-agent token cannot tell you which vendor it really was — combine with verification before naming anyone. Tar pits also consume your own server resources and can entangle benign tools, so use them deliberately.

Never present a honeypot as a guarantee against scraping. Determined non-compliant clients can detect and avoid traps, and aggressive tar-pitting may breach your host's acceptable-use terms. Treat honeypots as one detection signal among several, not a wall.

Hidden, robots-disallowed link that compliant crawlers skip
Tar-pit pages (e.g. Nepenthes) generate endless links to waste budget
Proves non-compliant behaviour, not vendor identity — verify before naming

How it appears in analytics and logs

A request for a hidden or robots-disallowed honeypot URL is a strong signal the client did not honour your rules — a useful flag for non-compliance. It identifies behaviour, not necessarily a specific named vendor.

Diagnostic use case

Detect crawlers that ignore robots.txt by observing whether they fetch a disallowed honeypot path, and understand the limits before relying on traps.

What WebmasterID can help detect

WebmasterID can surface hits to known honeypot paths as bot events, helping you spot crawlers that ignored robots.txt without manually grepping logs for the trap URL.

Common mistakes

Naming a specific vendor from a honeypot hit without IP verification.
Assuming a tar pit stops scraping rather than slowing some clients.
Letting honeypot links be visible to humans and pollute analytics.

Privacy and accuracy notes

Honeypots observe crawler behaviour, not human users. Any link a real person would never see should not affect human analytics. WebmasterID records honeypot hits as bot events only.

↑ All AI crawlers in AI crawlers

Sources and verification notes

Nepenthes — AI tar-pit projectDocumented example of a tar-pit honeypot for non-compliant crawlers.
Google — robots.txt specificationDefines Disallow, the basis for compliant-vs-not detection.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.