Search bots

Security scanners vs search crawlers

Security scanners (Censys, Shodan, BinaryEdge, Qualys and similar) probe hosts, ports, and application surface to assess exposure and find vulnerabilities. Search crawlers (Googlebot, Bingbot) fetch and index content to rank it. Confusing the two leads to wrong robots.txt decisions and misread logs: robots.txt governs content crawling, not port scanning, and scan traffic should never be counted as audience.

Verified against primary sources

What this means

A search crawler exists to discover and index content so it can be ranked and shown in search results. A security scanner exists to map exposure: which hosts answer, which ports are open, and whether an application has exploitable weaknesses. They share the trait of being automated, but their goals and their request patterns are different.

Search crawlers follow links and fetch real pages. Scanners often request unusual paths, probe ports, and send many parameter variations that have nothing to do with content.

Why robots.txt is the wrong control for scanning

robots.txt is a convention that compliant content crawlers honour when deciding what to index. Internet scanners and vulnerability tools are not content crawlers and generally do not treat robots.txt as a scanning boundary, because they are characterising infrastructure, not building a search index.

To limit scanning, use the provider's documented opt-out where one exists, network-level controls, and good external hygiene — not robots.txt. To control content indexing, use robots.txt and meta-robots directives aimed at the search crawlers.

Search crawler: fetches content to index and rank it
Security scanner: probes ports/parameters to assess exposure
robots.txt governs content crawling, not scanning

How it appears in analytics and logs

If logs show systematic probing of ports, parameters, or many URL variations, that is scanning, not indexing. Search crawlers fetch real content paths to index them. Misclassifying scanning as crawling produces wrong robots.txt rules and wrong analytics.

Diagnostic use case

Decide the right response to a given automated probe by classifying it as a security scan or a search crawl, since the controls and the meaning differ.

What WebmasterID can help detect

WebmasterID separates security-scan probes from search-crawl coverage server-side, so each is read correctly and neither inflates human analytics.

Common mistakes

Adding robots.txt rules to stop port scanning — wrong tool.
Reading scanner probes as search indexing that affects rankings.
Counting scan traffic as human or as crawl coverage.

Privacy and accuracy notes

Both scanners and crawlers are identified by user-agent and behaviour only. No visitor identity is involved; WebmasterID records automated probes as bot events, never as human profiles.

↑ All search bots in Search bots

Sources and verification notes

Google Search Central — robots.txt introductionrobots.txt governs content crawling by compliant crawlers; it is not an access-control or anti-scanning mechanism.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.