AI crawler honeypots and traps
An AI crawler honeypot is a deliberately planted resource — a hidden link, a disallowed path, or an endlessly generated 'tar-pit' page — used to detect or slow crawlers that ignore robots.txt. Tools such as Nepenthes popularised the tar-pit approach. This entry explains the techniques, what they can prove, and why they are a detection aid rather than enforcement.
How honeypots work
The simplest honeypot is a link hidden from humans (for example via CSS) and listed as Disallow in robots.txt. A compliant crawler skips it; a non-compliant one fetches it, outing itself. A more aggressive variant is a tar pit — a page that generates endless low-value links to waste a misbehaving crawler's budget. The open-source Nepenthes project is a documented example of the tar-pit pattern.
The goal is detection and friction, not access control. A honeypot tells you something fetched a path it should not have; that is evidence of non-compliance you can act on.
Limits and cautions
Honeypots prove behaviour, not identity. A fetch of a disallowed path shows the client ignored robots.txt, but a copied user-agent token cannot tell you which vendor it really was — combine with verification before naming anyone. Tar pits also consume your own server resources and can entangle benign tools, so use them deliberately.
Never present a honeypot as a guarantee against scraping. Determined non-compliant clients can detect and avoid traps, and aggressive tar-pitting may breach your host's acceptable-use terms. Treat honeypots as one detection signal among several, not a wall.
- Hidden, robots-disallowed link that compliant crawlers skip
- Tar-pit pages (e.g. Nepenthes) generate endless links to waste budget
- Proves non-compliant behaviour, not vendor identity — verify before naming
How it appears in analytics and logs
A request for a hidden or robots-disallowed honeypot URL is a strong signal the client did not honour your rules — a useful flag for non-compliance. It identifies behaviour, not necessarily a specific named vendor.
Diagnostic use case
Detect crawlers that ignore robots.txt by observing whether they fetch a disallowed honeypot path, and understand the limits before relying on traps.
What WebmasterID can help detect
WebmasterID can surface hits to known honeypot paths as bot events, helping you spot crawlers that ignored robots.txt without manually grepping logs for the trap URL.
Common mistakes
- Naming a specific vendor from a honeypot hit without IP verification.
- Assuming a tar pit stops scraping rather than slowing some clients.
- Letting honeypot links be visible to humans and pollute analytics.
Privacy and accuracy notes
Honeypots observe crawler behaviour, not human users. Any link a real person would never see should not affect human analytics. WebmasterID records honeypot hits as bot events only.
Related pages
- Undeclared AI scrapers and how they appear
Some AI scrapers do not declare a recognisable token. They appear with generic user agents, browser-like strings, or forged identities. They cannot be identified by a clean token, so the honest approach is to describe the pattern, verify what you can, and categorise conservatively.
- Do AI crawlers obey robots.txt?
Major declared AI crawlers such as GPTBot, ClaudeBot, and Google-Extended document that they honour robots.txt, but compliance is voluntary and varies across operators. robots.txt is a crawl request defined by a shared standard, not an access-control mechanism, so a non-compliant or undeclared scraper can ignore it. Enforcement requires server-side controls.
- Detecting AI crawlers without a user agent
Not every AI crawler declares a clean token — some send a blank, generic, or browser-like user agent. You cannot identify those by token alone. This entry describes the behavioural and network signals that flag likely automated AI fetching, while being explicit that behaviour suggests a class, not a named vendor, and that you must never invent identity.
- Website observability
Surface unusual fetch patterns, including hits to trap paths.
Sources and verification notes
- Nepenthes — AI tar-pit projectDocumented example of a tar-pit honeypot for non-compliant crawlers.
- Google — robots.txt specificationDefines Disallow, the basis for compliant-vs-not detection.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.