Robots & crawl control

How to block the Censys scanner

Censys runs internet-wide scanning that catalogs hosts and services for security research. Because it operates at the host/port level rather than fetching pages as a polite web crawler, robots.txt is largely ineffective. This page explains what Censys does and why firewall-level controls, not robots.txt, are the right response.

Partially verified

What Censys is

Censys performs internet-wide scanning to catalog hosts, open ports, certificates, and services for security and research purposes. Unlike a search-engine crawler that requests pages and reads robots.txt, a host scanner probes addresses and services directly.

Where Censys makes HTTP requests, it may send a self-identifying user agent. But because the goal is host discovery rather than content crawling, robots.txt — a content-crawling convention — does not meaningfully govern its behavior.

Purpose: internet-wide host and service scanning for research
Operates at host/port level, not as a polite page crawler
robots.txt is a content convention scanners largely ignore

Why robots.txt is the wrong tool

You can add a Disallow for any self-identifying scanner token, and a compliant HTTP fetcher may honour it on the content-crawling side:

User-agent: CensysInspect Disallow: /

But this does nothing about host-level scanning. To limit exposure, use enforcement: firewall rules, an IP allowlist for sensitive services, and a WAF. Censys also documents an opt-out process for its scanning, which is more effective than robots.txt for this kind of activity.

How it appears in analytics and logs

Requests or connections attributed to Censys are internet-wide scanning, not human visits and not polite page crawling. They reflect external host discovery; treat them as automated scanning, not audience.

Diagnostic use case

Understand why a robots.txt Disallow does little against an internet-wide host scanner, and use firewall/edge controls to limit exposure instead.

What WebmasterID can help detect

WebmasterID classifies self-identifying scanner requests that reach your application as automated, separate from human analytics, so HTTP-level scans are visible in bot traffic.

Common mistakes

Expecting robots.txt to stop host/port scanning — it governs content crawling only.
Listing internal service paths in robots.txt and advertising them to scanners.
Skipping firewall/WAF controls that actually limit scanner exposure.

Privacy and accuracy notes

Identifying scanner traffic relies on request characteristics and any self-identifying user agent, not visitor identity. Edge enforcement may act on connection metadata operationally; that is access control, not visitor profiling.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Censys — about our scanning and opt-outCensys documents internet-wide scanning and an opt-out process; robots.txt does not govern host scanning.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.