Robots & crawl control

robots.txt & crawl-control reference: directives, blocks & policy

A reference to robots.txt and crawler-control mechanisms. Each page gives safe, copy-pasteable examples, explains what robots.txt can and cannot enforce, and covers the related controls — meta robots, X-Robots-Tag, crawl-delay, sitemap directives, AI crawler policy, and llms.txt — without legal or enforcement overclaiming.

125 robots topics documented · part of the Web Crawler & Traffic Intelligence Encyclopedia.

robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
How to block GPTBot in robots.txt
If you do not want OpenAI's training crawler fetching your site, you can disallow GPTBot in robots.txt. This page gives the exact rule, clarifies that it does not affect ChatGPT-User or OAI-SearchBot, and is honest about the limits of robots-based blocking.
How to allow all bots in robots.txt
If you want every compliant crawler to be free to fetch your whole site, robots.txt makes that the default. This page shows the explicit allow-all group, explains why an empty or absent file is also open, and clarifies that allowing crawling is not the same as forcing indexing.
How to block all bots in robots.txt
A single robots.txt group can ask every compliant crawler to stay off your whole site. This page gives the exact rule and is blunt about the caveats: robots.txt is advisory rather than enforced, blocking search crawlers can remove you from results, and it is not a security boundary.
How to block ClaudeBot in robots.txt
If you do not want Anthropic's ClaudeBot fetching your site, you can disallow it in robots.txt. This page gives the exact rule, clarifies that it does not affect the real-time Claude-User fetch, and is honest that robots.txt is honoured by compliant crawlers rather than enforced.
How to opt out of Google AI with Google-Extended
Google-Extended is a robots.txt user-agent token Google provides so site owners can opt out of having their content used for certain Google AI products. Crucially, it is a standalone control: disallowing Google-Extended does not affect Googlebot crawling or your appearance in Google Search.
How to block CCBot (Common Crawl)
CCBot is the crawler operated by Common Crawl, a non-profit that publishes a large open web-crawl dataset reused by many downstream projects, including some AI training pipelines. This page gives the robots.txt rule to disallow CCBot and explains why blocking it affects that dataset specifically.
How to block AhrefsBot in robots.txt
AhrefsBot is the crawler Ahrefs uses to build its backlink and SEO index. This page gives the robots.txt rule to disallow it and notes that Ahrefs documents support for both robots.txt rules and the crawl-delay directive, so you can slow rather than fully block it.
How to block SemrushBot in robots.txt
SemrushBot is the crawler Semrush uses to build its SEO datasets. Semrush documents several specialised sub-bots under related tokens, so this page covers the base disallow rule and explains why you may need to target multiple tokens to cover the activity you care about.
The crawl-delay directive in robots.txt
Crawl-delay is a non-standard robots.txt directive that asks a crawler to wait between requests. Support is uneven: Google does not use it and points to Search Console instead, while Bing and Yandex have historically honoured it. This page explains the directive and the safer alternatives.
The Sitemap directive in robots.txt
The Sitemap directive points crawlers at your XML sitemap. It uses an absolute URL, can appear multiple times to list several sitemaps, and works independently of your allow/disallow rules — it is a discovery hint, not a crawl-permission rule.
User-agent groups and matching in robots.txt
robots.txt rules are organised into user-agent groups. A crawler does not combine every group — it selects the single most specific group whose token matches its name, falling back to the * group only when no named group matches. Understanding this prevents rules that never apply.
robots.txt vs the meta robots tag
robots.txt and the meta robots tag solve different problems. robots.txt asks crawlers not to fetch a path; the meta robots tag, embedded in a page's HTML, tells search engines whether to index it. The classic mistake is using Disallow to remove a page from search — which can backfire.
robots.txt vs the X-Robots-Tag header
X-Robots-Tag carries the same indexing directives as the meta robots tag, but in the HTTP response header instead of the HTML body. That makes it the way to apply noindex or nofollow to non-HTML resources like PDFs and images, where a meta tag has nowhere to live.
The noindex meta tag
The noindex value of the meta robots tag tells search engines to keep a page out of their index. The catch trips people up constantly: for noindex to work, the crawler must be able to fetch the page — so you must not block the same URL in robots.txt.
Writing an AI crawler policy for robots.txt
An AI crawler policy is a deliberate decision about which AI-related tokens you allow and which you disallow in robots.txt. This page offers a structured way to make and document those choices, while staying realistic: robots.txt is a request to compliant crawlers, not a legal or technical guarantee.
llms.txt basics
llms.txt is a community-proposed convention for a plain-text file that helps large language models find and read your most relevant content. It complements robots.txt rather than replacing it, and like robots.txt it is a hint that cooperating tools may follow — not an enforced control.
Wildcards and path matching in robots.txt
Although the original protocol used simple prefix matching, major crawlers support two wildcards in path rules: * matches any sequence of characters, and $ anchors the end of the URL. This page covers how they behave, useful patterns, and the mistakes that make a rule too broad.
How to block PerplexityBot in robots.txt
If you do not want Perplexity's indexing crawler fetching your site, you can disallow PerplexityBot in robots.txt. This page gives the exact rule, clarifies it does not cover Perplexity-User (the real-time, user-triggered fetch), and stays honest about the limits of robots-based blocking.
How to block Bytespider in robots.txt
Bytespider is a web crawler affiliated with ByteDance. This page gives the robots.txt rule to disallow its token and is honest that, because Bytespider's documentation and robots.txt compliance are less clearly published than for major crawlers, the rule should be treated as a request rather than a guarantee.
How to block Amazonbot in robots.txt
Amazonbot is Amazon's web crawler. This page gives the robots.txt rule to disallow it, notes that Amazon documents Amazonbot's robots.txt compliance and a way to verify its requests, and keeps the usual caveat that robots.txt is a request to compliant crawlers, not enforcement.
How to opt out of Apple AI with Applebot-Extended
Applebot-Extended is a robots.txt token Apple provides so site owners can opt out of having content used to train Apple's generative AI models. It is a standalone control: disallowing Applebot-Extended does not stop Applebot, which keeps crawling for Apple search features and Siri.
How to block Meta-ExternalAgent in robots.txt
Meta-ExternalAgent is the token Meta uses for its crawler supporting AI products. This page gives the robots.txt rule to disallow it and explains that the related Meta-ExternalFetcher token covers a different fetch behaviour and must be targeted separately.
How to control Bingbot in robots.txt
Bingbot is Microsoft's search crawler. You can target it in robots.txt with the bingbot token, but fully disallowing it typically removes your pages from Bing search over time. For load concerns, Bing offers crawl-control settings in Bing Webmaster Tools rather than relying on a blanket block.
How to control YandexBot in robots.txt
YandexBot is the crawler for Yandex, a major search engine in Russia and nearby markets. You can target it in robots.txt with the YandexBot token. Yandex documents its robots.txt handling, has historically honoured crawl-delay, and provides additional crawl controls in Yandex.Webmaster.
How to control Baiduspider in robots.txt
Baiduspider is the crawler for Baidu, the dominant search engine in China. You can target it with the Baiduspider token in robots.txt. Blocking it removes you from Baidu over time, which chiefly matters for sites serving Chinese-language or China-based audiences.
How to block AI2Bot in robots.txt
AI2Bot is a crawler associated with the Allen Institute for AI (AI2), which produces open AI research and datasets. This page gives the robots.txt rule to disallow its token and stays cautious where public documentation is limited, marking unverified specifics rather than guessing.
Allow only specific bots, block the rest
Sometimes you want only a few named crawlers to access your site and everyone else kept out. Because each crawler obeys only its single most specific matching group, you build this by giving the allowed crawlers their own permissive groups and putting a blanket Disallow in the * group — with important caveats.
robots.txt for staging sites
Teams often try to keep a staging or pre-production site private with a robots.txt Disallow. That is the wrong tool: robots.txt is public and advisory, and a blocked staging URL linked anywhere can still surface in search. The right answer is authentication, with noindex as a secondary signal.
robots.txt size limits and parsing
robots.txt files are not unlimited. Google documents a maximum parsed size of 500 KiB and ignores anything beyond it, which can silently drop rules at the bottom of a bloated file. This page covers the size limit and how parsing precedence — most specific rule wins — interacts with it.
robots.txt path matching and case sensitivity
robots.txt path rules are compared against the URL path, and that comparison is case-sensitive: /Page and /page are different. This page covers how Google matches paths, why case and encoding matter, and how trailing characters and wildcards change the rule that applies.
Multiple user-agent groups and precedence
A robots.txt file usually has several user-agent groups. A crawler does not combine them: it selects the one most specific group whose token matches its name, per RFC 9309. This page explains how that precedence works, how multiple User-agent lines share one group, and the merging rules that surprise people.
How to test your robots.txt
A robots.txt rule is only useful if it does what you think. This page covers how to test it — checking the live file, using Google Search Console's robots.txt report and URL Inspection, and confirming in your own logs that the intended crawlers are or are not fetching the affected URLs.
X-Robots-Tag header examples
X-Robots-Tag carries indexing directives in the HTTP response header instead of the HTML body, which makes it the way to apply noindex or nofollow to PDFs, images, and other non-HTML files. This page gives concrete header examples and notes how server config applies them in bulk.
Canonical vs noindex: which to use
rel=canonical and noindex are often confused. Canonical tells search engines which of several similar URLs to treat as the primary, consolidating signals onto it. noindex removes a page from the index entirely. This page explains when each is right and why combining them on one URL sends conflicting signals.
Meta robots directives reference
The robots meta tag and X-Robots-Tag header share a vocabulary of indexing directives. This page is a reference for the common ones — noindex, nofollow, noarchive, nosnippet, and the max-snippet family — explaining what each does and how to combine them.
Using robots.txt to protect crawl budget
On large sites, crawlers spend a finite amount of effort — often called crawl budget — and can waste it on low-value or near-duplicate URLs. robots.txt can steer them away from those paths so they reach your important pages more often. This matters mostly for big sites; small sites rarely need it.
ai.txt, TDM reservation, and llms.txt
Beyond robots.txt, several conventions aim to express AI and machine-use preferences: ai.txt proposals, text-and-data-mining (TDM) reservation signals tied to EU copyright law, and llms.txt. Adoption and legal weight vary and are still settling, so this page describes the intent without overclaiming enforcement.
robots.txt common mistakes
Most robots.txt problems come from a handful of recurring mistakes. This page collects the big ones — blocking the CSS and JS crawlers need to render, trying to deindex with Disallow, advertising secret paths, and treating an advisory file as enforcement — with the correct approach for each.
robots.txt and sitemap coordination
robots.txt and your XML sitemap work together: the Sitemap directive advertises your sitemap to crawlers, and Search Console submission gives Google a direct feed. The key is consistency — do not list URLs in a sitemap that your robots.txt disallows, or you send crawlers contradictory instructions.
How to block PetalBot in robots.txt
PetalBot is the crawler operated by Huawei to power Petal Search, the search service used on Huawei devices. This page gives the robots.txt rule to disallow the PetalBot token, explains how it identifies itself, and notes that PetalBot documents robots.txt support through Huawei's crawler help pages.
How to block DataForSeoBot in robots.txt
DataForSeoBot is the crawler operated by DataForSEO to gather data for its SEO and SERP data APIs. This page gives the robots.txt rule to disallow the DataForSeoBot token and notes that DataForSEO publishes crawler documentation describing how it respects robots.txt.
How to block Diffbot in robots.txt
Diffbot operates a crawler that extracts structured data from web pages to build its Knowledge Graph and power data-extraction APIs. This page gives the robots.txt rule to disallow the Diffbot token and notes that Diffbot documents its crawler and robots.txt support.
How to block ImagesiftBot in robots.txt
ImagesiftBot is the crawler associated with ImageSift, a project that indexes images found on the public web. This page gives the robots.txt rule to disallow the ImagesiftBot token and notes that ImageSift documents the crawler and robots.txt support.
How to block BLEXBot in robots.txt
BLEXBot is the crawler operated by WebMeUp to build a backlink index used by SEO tools. This page gives the robots.txt rule to disallow the BLEXBot token and notes that WebMeUp documents the crawler and its robots.txt support.
How to block omgilibot in robots.txt
omgilibot is the crawler historically associated with Omgili and the Webz.io web-data project, which collects public web content for datasets resold to third parties, including AI uses. This page gives the robots.txt rule to disallow the omgilibot token and flags where documentation is limited.
How to block cohere-ai in robots.txt
cohere-ai is the robots.txt token associated with Cohere's web fetching for its AI products. This page gives the rule to disallow the cohere-ai token, explains where it fits in an AI-crawler policy, and stays cautious where Cohere's public documentation is limited.
How to block MJ12bot in robots.txt
MJ12bot is the distributed crawler operated by Majestic-12 to build the Majestic backlink index. This page gives the robots.txt rule to disallow the MJ12bot token and notes that Majestic documents both a disallow and a Crawl-delay option for slowing it down.
How to block DotBot in robots.txt
DotBot is the crawler operated by Moz to build the link index behind its SEO tools. This page gives the robots.txt rule to disallow the DotBot token and notes that Moz documents the crawler and a Crawl-delay option for throttling it.
How to block Applebot in robots.txt
Applebot is Apple's crawler, used to power features such as Siri and Spotlight Suggestions. This page gives the robots.txt rule to disallow the Applebot token and explains that it is separate from Applebot-Extended, the token that governs AI-training use.
How to block Yahoo Slurp in robots.txt
Slurp is Yahoo's historical web-crawler token. This page gives the robots.txt rule to disallow Slurp and explains that, because Yahoo Search has long been powered substantially by Bing, blocking Slurp may not change how Yahoo presents your site — verify what is actually crawling you first.
How to block DuckDuckBot in robots.txt
DuckDuckBot is DuckDuckGo's own crawler token, used for parts of its service such as Instant Answers. This page gives the robots.txt rule to disallow DuckDuckBot and explains that DuckDuckGo's main web results draw substantially on Bing, so blocking DuckDuckBot has limited effect on core results.
How to block coccocbot in robots.txt
coccocbot is the crawler operated by Cốc Cốc, a search engine and browser popular in Vietnam. This page gives the robots.txt rule to disallow the coccocbot token and explains that it matters mainly if your audience includes Vietnamese-market users.
How to block rogerbot in robots.txt
rogerbot is the crawler Moz uses for site-audit crawling in Moz Pro campaigns. This page gives the robots.txt rule to disallow the rogerbot token and explains that it is separate from DotBot, Moz's link-index crawler, so blocking one does not affect the other.
max-snippet and preview directives explained
max-snippet, max-image-preview, and max-video-preview are Google robots directives that cap how much of your content appears in result-page previews. This page explains the values each accepts, where to set them, and how they differ from blocking indexing.
The data-nosnippet attribute explained
data-nosnippet is a Google-supported HTML attribute that marks portions of a page so they are not used in search snippets. This page explains how to apply it, which elements support it, and how it differs from the page-level nosnippet directive.
The nosnippet robots directive explained
nosnippet is a Google robots directive that tells Google not to show any text snippet or video preview for a page in search results. This page explains where to set it, what it affects, and how it relates to the finer-grained max-snippet and data-nosnippet controls.
The noarchive robots directive explained
noarchive is a robots directive that asks search engines not to offer a cached copy of a page. This page explains where to set it, which engines historically honoured it, and why its practical relevance changed after Google retired its cache link.
The max-image-preview robots directive explained
max-image-preview is a Google robots directive that bounds how large an image preview may appear for your pages in search results. This page explains its three values, where to set it, and why it matters for visual content and Discover-style surfaces.
The unavailable_after robots directive explained
unavailable_after is a Google robots directive that tells Google to stop showing a page in search results after a given date and time. This page explains the date format, where to set it, and how it differs from noindex and from removing the page.
The Host directive in robots.txt explained
Host was a non-standard robots.txt directive, used mainly by Yandex, to indicate a site's preferred mirror or hostname. This page explains what it did, why it is not part of the robots.txt standard, and what to use instead for hostname canonicalisation today.
The Clean-param directive in robots.txt explained
Clean-param is a Yandex-specific robots.txt directive that lists URL query parameters Yandex should ignore when crawling, helping consolidate duplicate URLs. This page explains its syntax, what it does, and why Google relies on different mechanisms.
How robots.txt works across subdomains
robots.txt applies per host, so each subdomain needs its own file. This page explains how the robots.txt scope is defined by scheme, host, and port, why a root-domain file does not govern subdomains, and how to manage policy across many hostnames.
How to block Sogou Spider
Sogou Spider is the web crawler for Sogou, a Chinese search engine. This page shows how to disallow it in robots.txt using its documented user-agent tokens, explains what blocking does and does not affect, and how to confirm the rule is honoured.
How to block SeznamBot
SeznamBot is the crawler for Seznam.cz, a major Czech search engine. This page shows how to disallow SeznamBot in robots.txt, when blocking makes sense, and the visibility trade-off for sites that serve a Czech audience.
How to block 360Spider
360Spider is the crawler for so.com, the search engine operated by Qihoo 360 in China. This page shows how to disallow 360Spider in robots.txt, confirms what a block affects, and how to check the crawler honours it.
How to block Mail.RU_Bot
Mail.RU_Bot is the crawler for Mail.ru's search service. This page shows how to disallow it in robots.txt, when blocking is sensible, and the trade-off for sites that get traffic from the Russian Mail.ru search market.
How to block serpstatbot
serpstatbot is an SEO crawler that gathers backlink and ranking data for the Serpstat platform. This page shows how to disallow it in robots.txt, how to slow rather than fully block it, and how to confirm the rule works.
How to block the SISTRIX crawler
SISTRIX runs a crawler to gather SEO visibility and ranking data for its platform. This page shows how to disallow the SISTRIX crawler in robots.txt, how to throttle it instead of blocking, and how to confirm the directive is honoured.
How to block Barkrowler
Barkrowler is the crawler that feeds Babbar's link-graph and SEO datasets. This page shows how to disallow Barkrowler in robots.txt, how to slow it with crawl-delay, and how to confirm it has stopped.
How to handle Pingdom's bot in robots.txt
Pingdom is an uptime and performance monitoring service whose checks fetch your pages on a schedule. This page explains why robots.txt is not the right tool to stop monitoring requests, how Pingdom identifies itself, and how to exclude it cleanly from analytics.
How to handle UptimeRobot in robots.txt
UptimeRobot is an uptime monitoring service that pings configured URLs on an interval. This page explains why robots.txt is not the right way to stop its checks, how UptimeRobot identifies itself, and how to exclude it from analytics cleanly.
How to block GoogleOther
GoogleOther is a generic crawler Google uses for internal research and development fetches, separate from Googlebot. This page shows how to disallow GoogleOther in robots.txt while leaving Search indexing by Googlebot intact.
How to block Bedrockbot
Bedrockbot is the crawler associated with Amazon Bedrock for fetching web content. This page shows how to disallow Bedrockbot in robots.txt, how it differs from Amazonbot, and how to confirm the rule is honoured.
AdSense and Mediapartners-Google in robots.txt
Mediapartners-Google is the crawler Google AdSense uses to read your pages so it can serve relevant ads. This page explains how it interacts with robots.txt, why blocking it can hurt ad targeting, and the exact rule if you have a reason to disallow it.
How to block the Screaming Frog SEO Spider
Screaming Frog SEO Spider is a desktop crawling tool used for SEO audits. This page shows how its default user agent can be addressed in robots.txt, why a configurable tool may bypass it, and when blocking actually makes sense.
How to block the Seekport Crawler
The Seekport Crawler is the bot for Seekport, a search engine. This page shows how to disallow it in robots.txt, what blocking it costs in Seekport visibility, and how to confirm the rule is honoured.
robots.txt for images
robots.txt can control how image crawlers like Googlebot-Image fetch your images. This page explains how to allow or disallow image crawling, the trade-off with Google Images visibility, and why blocking images for search is different from blocking pages.
robots.txt for PDFs and non-HTML files
PDFs and other non-HTML files can rank in search. This page explains why X-Robots-Tag noindex (not robots.txt Disallow) is the right way to keep a PDF out of the index, and when blocking the file directory is appropriate.
robots.txt and URL query parameters
Query-string URLs (?sort=, ?utm_source=, ?sessionid=) can multiply crawlable URLs. This page explains how robots.txt wildcards match parameters, when blocking helps, and why canonical or noindex is often better than a Disallow for duplicates.
How Allow and Disallow precedence works
When robots.txt has both Allow and Disallow rules that match a URL, the outcome depends on rule precedence. This page explains Google's most-specific-match rule, how length decides conflicts, and the tie-break when rules are equally specific.
How crawlers cache robots.txt
Crawlers do not re-fetch robots.txt on every request — they cache it. This page explains Google's caching window, why your edits take time to take effect, and how caching interacts with HTTP cache headers and fetch failures.
How crawlers handle a redirected robots.txt
When /robots.txt returns a 3xx redirect, crawlers must decide whether to follow it. This page explains how Google follows robots.txt redirects, the hop limit, and why redirecting the file (especially cross-host) can lead to unexpected crawl behavior.
What crawlers do when robots.txt returns 404 or 5xx
The HTTP status of /robots.txt changes crawl behavior. This page explains why a 404 means crawl everything, why a persistent 5xx can pause crawling, and how Google's handling shifts when a server error lasts a long time.
robots.txt for international and multilingual sites
International sites split content by country or language using ccTLDs, subdomains, or subfolders. This page explains how robots.txt scope applies to each model and why blocking localized URLs can break hreflang and regional indexing.
Serving robots.txt behind a CDN
A CDN sits between crawlers and your origin, so it shapes how robots.txt is delivered. This page explains edge caching of robots.txt, ensuring each hostname serves the right file, and avoiding stale rules from aggressive caching.
Monitoring robots.txt for changes and errors
robots.txt is a single file that can accidentally block an entire site. This page explains why monitoring it matters, which failure modes to watch (Disallow: /, 404, 5xx, unexpected diffs), and how crawl-behavior signals confirm a problem.
How to block ZoominfoBot
ZoominfoBot is the crawler associated with ZoomInfo, a business-data platform that compiles company and contact information from public web pages. This page shows how the crawler identifies itself, the robots.txt token to target, and why a Disallow is a request rather than enforcement against a non-compliant fetcher.
How to block the Internet Archive crawler
The Internet Archive operates crawlers (historically using the ia_archiver token, and more recently archive.org_bot) that capture public pages for the Wayback Machine. This page explains how the crawler identifies itself, the robots.txt rule to disallow it, and the important caveat that the Archive's robots.txt handling has changed over time.
robots.txt and JavaScript/CSS files
Google renders pages with a headless browser before indexing, so it must fetch the JavaScript and CSS your page depends on. Disallowing those resources in robots.txt can prevent proper rendering and harm how the page is understood. This page explains why render-critical resources should stay crawlable.
robots.txt for API endpoints
JSON APIs are sometimes added to robots.txt to keep crawlers out, but robots.txt only requests compliance from polite crawlers and does nothing to authenticate or hide an endpoint. This page covers when disallowing /api is reasonable, what it does not do, and why access control belongs at the application layer.
robots.txt vs a firewall/WAF
robots.txt and a firewall/WAF solve different problems: robots.txt politely asks compliant crawlers what to skip, while a firewall or WAF actually blocks requests at the network or edge layer. This page contrasts the two, explains when each is appropriate, and warns against using robots.txt for jobs only enforcement can do.
robots.txt generators and pitfalls
robots.txt generators turn a few form choices into a ready file, which is convenient but error-prone: they can emit an accidental Disallow: /, miswrite path patterns, or use directives a target crawler ignores. This page explains common generator pitfalls and the validation steps to run before publishing the output.
How to block BUbiNG
BUbiNG is an open-source, high-throughput web crawler developed for research and large-scale web data collection. Because anyone can run an instance, its behavior depends on the operator. This page shows the robots.txt token to target and why a Disallow only steers compliant deployments.
How to block magpie-crawler
magpie-crawler is a web crawler associated with Brandwatch's social and web monitoring platform, which gathers public content for brand and media analysis. This page shows the robots.txt token to target, what the crawler does, and why a Disallow steers only compliant fetchers.
How to block the OnPage.org / Ryte crawler
OnPage.org was an SEO site-audit platform that later became Ryte; its crawler fetches your pages to analyse technical SEO. This page shows the robots.txt token to target, notes the relationship to Ryte's crawler, and explains why a Disallow steers only compliant fetchers.
How to block cocolyzebot
cocolyzebot is the crawler operated by Cocolyze, an SEO analytics platform that fetches pages to analyse rankings, backlinks, and on-page factors. This page shows the robots.txt token to target and why a Disallow steers only compliant fetchers.
How to block SEOkicks
SEOkicks is a backlink-index service whose crawler fetches public pages to build a link database. This page shows the robots.txt token to target, explains what the crawler collects, and why a Disallow steers only compliant fetchers.
How to block the Ryte crawler (botLogen)
Ryte is a technical-SEO platform (the rebrand of OnPage.org) whose crawler fetches pages to evaluate site quality and crawlability. This page shows the robots.txt token to target, notes the historical OnPage relationship, and explains why a Disallow steers only compliant fetchers.
How to block Pandalytics
Pandalytics is a web crawler associated with marketing and SEO analytics tooling that fetches public pages to support its reports. This page shows the robots.txt token to target, what the crawler collects, and why a Disallow steers only compliant fetchers.
How to block the Censys scanner
Censys runs internet-wide scanning that catalogs hosts and services for security research. Because it operates at the host/port level rather than fetching pages as a polite web crawler, robots.txt is largely ineffective. This page explains what Censys does and why firewall-level controls, not robots.txt, are the right response.
How to block Exabot (Exalead)
Exabot is the web crawler historically associated with Exalead, a search engine. Its crawler fetches public pages to build a search index. This page shows the robots.txt token to target, notes the crawler's search-engine origin, and explains why a Disallow steers only compliant fetchers.
How to block the Oncrawl crawler
Oncrawl is a technical-SEO platform whose crawler fetches your pages to analyse crawlability, structure, and on-page factors. This page shows the robots.txt token to target and why a Disallow steers only compliant fetchers. Note that an authorised Oncrawl audit you commission usually needs the crawler allowed.
How to block the JetOctopus crawler
JetOctopus is a technical-SEO crawler and log analyser that fetches your pages for site audits. This page shows the robots.txt token to target and why a Disallow steers only compliant fetchers. As with other audit tools, an audit you commission usually needs the crawler allowed.
How to block the CognitiveSEO crawler
CognitiveSEO is an SEO and backlink-analysis platform whose crawler fetches public pages to support its reports. This page shows the robots.txt token to target, what the crawler collects, and why a Disallow steers only compliant fetchers.
robots.txt and AMP pages
AMP pages depend on the AMP runtime, cached resources, and a crawlable canonical relationship. Disallowing AMP paths or required resources in robots.txt can break validation, caching, or discovery. This page explains which AMP-related resources must stay crawlable and how robots.txt interacts with AMP.
robots.txt and page rendering
Google indexes the rendered version of a page, fetched in a second pass by its Web Rendering Service. robots.txt rules that block render-critical resources cause the renderer to skip them, producing an incomplete rendered DOM. This page explains the rendering pipeline and how robots.txt interacts with it.
robots.txt and evergreen Googlebot
Googlebot runs an "evergreen" rendering engine — a regularly updated Chromium — so it can execute modern JavaScript and CSS. This raises the stakes for robots.txt: an evergreen renderer that supports your framework still cannot use resources you disallow. This page explains the implications for robots.txt on JavaScript-heavy sites.
How to block the SimilarWeb crawler
SimilarWeb operates a crawler that gathers public web data for its market-intelligence and traffic-estimation products. It is a declared crawler with a documented robots.txt token, so operators who do not want their pages crawled for competitive-analytics datasets can disallow it. This page shows the token to target and the rule to use.
How to block Archive-It
Archive-It is the Internet Archive's subscription web-archiving service, used by libraries and institutions to capture and preserve websites. Its crawls are performed by the Internet Archive's crawler infrastructure, which uses a documented robots.txt token. This page explains how to ask Archive-It crawls to stay out and the caveats around archival capture.
How to block Brandwatch
Brandwatch operates crawlers that collect public web and social content for its consumer-intelligence and social-listening products. It is a declared crawler with a documented robots.txt token. Operators who do not want their pages collected into brand-monitoring datasets can disallow it; this page shows the token and rule.
robots.txt and infinite crawl spaces
An infinite crawl space is a part of a site that generates an unbounded number of low-value URLs — next-month calendar links, every combination of faceted filters, or session identifiers appended to paths. Crawlers can get stuck fetching them, wasting crawl budget. This page explains how to spot infinite spaces and fence them off with robots.txt.
Multiple Sitemap directives in robots.txt
robots.txt can carry more than one Sitemap directive, and each is a full absolute URL pointing at a sitemap or sitemap index file. This is the standard way to advertise multiple sitemaps — by section, by media type, or by language — to every crawler at once. This page covers the syntax, ordering, and the sitemap-index alternative.
robots.txt comments and encoding
robots.txt supports comments with the hash character and is parsed as a UTF-8 plain-text file. Getting the encoding wrong — a stray byte order mark, a non-UTF-8 charset, or comments placed where a directive is expected — can cause crawlers to misread or ignore rules. This page covers comment syntax and the encoding requirements that keep a file valid.
How to block the Netcraft crawler
Netcraft runs crawlers for its internet-survey, technology-detection and anti-phishing services. They are declared crawlers with a documented robots.txt token. Operators who do not want their site sampled into Netcraft's surveys can disallow the crawler, with the caveat that security-related scanning may not be governed by robots.txt at all.
How to block Majestic's MJ12bot
MJ12bot is the distributed crawler that feeds Majestic's backlink index, one of the large independent link-graph datasets used in SEO. It is a declared crawler with a documented robots.txt token and supports Crawl-delay. This page shows how to disallow it or slow it down, and links the crawler reference.
How to block the BinaryEdge scanner
BinaryEdge runs internet-wide scans that catalogue exposed services and web properties for its attack-surface and threat-intelligence datasets. Where it crawls web content with a declared token, robots.txt can ask it to stop; but much internet-wide scanning operates below the HTTP-courtesy layer, so a firewall rule is usually the real control. This page covers both.
How to block the Qualys web scanner
Qualys runs web-application and vulnerability scanners used by security teams to assess sites. When a Qualys crawler fetches content with a declared token, robots.txt can ask it to stop — but a scan you own is configured inside Qualys, so the right control depends on whether the scan is yours or a third party's. This page covers both cases.
How to block the Gigablast crawler
Gigablast was an independent search engine whose crawler, GigaBot, fetched public pages to build its index. The service is no longer operating as it once did, but the token can still appear in logs from residual or impersonating clients. This page shows the robots.txt rule and how to interpret leftover GigaBot activity.
How to block the SafeDNS crawler
SafeDNS operates a crawler that fetches public pages to categorise sites for its DNS-based content-filtering service. It is a declared crawler with a documented robots.txt token. This page shows how to disallow it, with the caveat that site categorisation can also draw on sources other than a live crawl.
NOODP and NOYDIR — legacy robots meta values
NOODP and NOYDIR were robots meta values that told search engines not to use the Open Directory Project (DMOZ) or the Yahoo Directory description and title for a page in results. Both directories are long gone and the directives are obsolete. This page explains what they did and why you can safely remove them from legacy templates.
robots.txt empty Disallow means allow
In robots.txt an empty Disallow value — Disallow: with nothing after it — means there is nothing to disallow, so the crawler may fetch everything. It is the opposite of Disallow: / which blocks the whole site. Confusing the two is a classic, high-impact mistake. This page explains the rule and the safest way to express allow-all.
Why uptime monitors should fetch robots.txt
A broken or accidentally restrictive robots.txt can quietly stop search engines from crawling your whole site. Treating the file as a monitored asset — checking that it returns 200, is reachable, and has not flipped to a site-wide Disallow — turns a silent catastrophe into an alert. This page covers what to monitor and the signals that matter.
Trailing slashes in robots.txt paths
In robots.txt, whether a path ends in a slash changes what it matches. Disallow: /dir/ blocks the directory and everything under it; Disallow: /dir (no slash) is a prefix that also matches /directory and /dir.html. Misreading the trailing slash is a frequent cause of rules that block too much or too little. This page makes the distinction concrete.
noindex in robots.txt is unsupported
Some operators once added a noindex line to robots.txt to keep pages out of search. It was never part of the standard, and Google announced it would stop honouring an unsupported robots.txt noindex from September 2019. This page explains why the directive does nothing in robots.txt and which mechanisms actually remove a page from the index.

Other reference hubs

AI crawlers
Search bots
User agents
Referrers
UTM tracking
Crawl diagnostics
Geo traffic

See how WebmasterID applies this in product: Bot intelligence, AI referrals, and AI visibility analytics.