Does Scrapy obey robots.txt?

Scrapy has a ROBOTSTXT_OBEY setting that, when enabled, makes spiders respect robots.txt. It can be disabled by the operator, so compliance depends on configuration, not the framework alone.

User agents

Scrapy crawler user agent

Scrapy is a popular Python web-scraping and crawling framework. Out of the box its requests carry a default user agent containing the Scrapy token and version, but operators frequently override it, so a Scrapy token is a strong signal while its absence does not rule scraping out. It is automation, not a human visit.

Verified against primary sources

What this means

Scrapy is an open-source framework for building spiders that crawl and extract data from websites. By default, each request sends a user agent that contains the Scrapy token together with a version and a self-identifying project URL.

That default identifies the framework clearly. However, Scrapy exposes a USER_AGENT setting, and scraping projects routinely change it — often to a browser-like string — so many Scrapy crawls do not carry the token at all.

How Scrapy identifies itself

Out of the box, Scrapy's user agent contains the Scrapy token, a version number, and a link to the project site. Match on the Scrapy token substring rather than a fixed version. The framework documents the default and the USER_AGENT override.

Because the user agent is trivially changed, treat the Scrapy token as a strong but optional signal. Absence of the token does not mean a request is not Scrapy.

Default user agent contains the Scrapy token plus version and project URL
USER_AGENT setting is commonly overridden
Token present = strong signal; token absent ≠ not Scrapy

Catching overridden Scrapy crawls

When the user agent is changed, identify Scrapy-style crawling behaviourally: consistent request cadence, broad URL enumeration, ignored or fetched-once robots.txt, missing browser headers, and no asset/JS loading. These patterns matter more than the string once the default is replaced.

For framework traffic that respects robots.txt, you can set crawl-delay or disallow paths; for traffic that ignores it, behavioural detection and rate limiting are the practical controls.

How it appears in analytics and logs

A request whose user agent contains the Scrapy token is a Scrapy spider with its default UA. Because the setting is commonly overridden, Scrapy traffic often arrives under a custom or browser-like user agent, so the token is sufficient but not necessary evidence.

Diagnostic use case

Spot Scrapy-based scraping in logs, understand why the default token is often missing, and decide how to treat framework-driven crawling.

What WebmasterID can help detect

WebmasterID classifies the default Scrapy token server-side as automation and surfaces it on the bot-intelligence view, while noting that overridden user agents need behavioural signals to catch.

Common mistakes

Assuming all Scrapy traffic carries the Scrapy token — the UA is often overridden.
Counting Scrapy spider requests as human visits.
Relying on robots.txt alone against scrapers configured to ignore it.

Privacy and accuracy notes

Scrapy detection uses only the user agent and request shape. No human identity is involved — it is a script. WebmasterID records it as a bot event, separate from human analytics.

Frequently asked questions

Does Scrapy obey robots.txt?: Scrapy has a ROBOTSTXT_OBEY setting that, when enabled, makes spiders respect robots.txt. It can be disabled by the operator, so compliance depends on configuration, not the framework alone.

↑ All user-agent families in User agents

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.