How to block the Censys scanner
Censys runs internet-wide scanning that catalogs hosts and services for security research. Because it operates at the host/port level rather than fetching pages as a polite web crawler, robots.txt is largely ineffective. This page explains what Censys does and why firewall-level controls, not robots.txt, are the right response.
What Censys is
Censys performs internet-wide scanning to catalog hosts, open ports, certificates, and services for security and research purposes. Unlike a search-engine crawler that requests pages and reads robots.txt, a host scanner probes addresses and services directly.
Where Censys makes HTTP requests, it may send a self-identifying user agent. But because the goal is host discovery rather than content crawling, robots.txt — a content-crawling convention — does not meaningfully govern its behavior.
- Purpose: internet-wide host and service scanning for research
- Operates at host/port level, not as a polite page crawler
- robots.txt is a content convention scanners largely ignore
Why robots.txt is the wrong tool
You can add a Disallow for any self-identifying scanner token, and a compliant HTTP fetcher may honour it on the content-crawling side:
User-agent: CensysInspect Disallow: /
But this does nothing about host-level scanning. To limit exposure, use enforcement: firewall rules, an IP allowlist for sensitive services, and a WAF. Censys also documents an opt-out process for its scanning, which is more effective than robots.txt for this kind of activity.
How it appears in analytics and logs
Requests or connections attributed to Censys are internet-wide scanning, not human visits and not polite page crawling. They reflect external host discovery; treat them as automated scanning, not audience.
Diagnostic use case
Understand why a robots.txt Disallow does little against an internet-wide host scanner, and use firewall/edge controls to limit exposure instead.
What WebmasterID can help detect
WebmasterID classifies self-identifying scanner requests that reach your application as automated, separate from human analytics, so HTTP-level scans are visible in bot traffic.
Common mistakes
- Expecting robots.txt to stop host/port scanning — it governs content crawling only.
- Listing internal service paths in robots.txt and advertising them to scanners.
- Skipping firewall/WAF controls that actually limit scanner exposure.
Privacy and accuracy notes
Identifying scanner traffic relies on request characteristics and any self-identifying user agent, not visitor identity. Edge enforcement may act on connection metadata operationally; that is access control, not visitor profiling.
Related pages
- robots.txt vs a firewall/WAF
robots.txt and a firewall/WAF solve different problems: robots.txt politely asks compliant crawlers what to skip, while a firewall or WAF actually blocks requests at the network or edge layer. This page contrasts the two, explains when each is appropriate, and warns against using robots.txt for jobs only enforcement can do.
- How to block the Internet Archive crawler
The Internet Archive operates crawlers (historically using the ia_archiver token, and more recently archive.org_bot) that capture public pages for the Wayback Machine. This page explains how the crawler identifies itself, the robots.txt rule to disallow it, and the important caveat that the Archive's robots.txt handling has changed over time.
- robots.txt for API endpoints
JSON APIs are sometimes added to robots.txt to keep crawlers out, but robots.txt only requests compliance from polite crawlers and does nothing to authenticate or hide an endpoint. This page covers when disallowing /api is reasonable, what it does not do, and why access control belongs at the application layer.
- Bot vs human
Separate automated scanners from human visitors.
Sources and verification notes
- Censys — about our scanning and opt-outCensys documents internet-wide scanning and an opt-out process; robots.txt does not govern host scanning.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.