Scrapy crawler user agent
Scrapy is a popular Python web-scraping and crawling framework. Out of the box its requests carry a default user agent containing the Scrapy token and version, but operators frequently override it, so a Scrapy token is a strong signal while its absence does not rule scraping out. It is automation, not a human visit.
What this means
Scrapy is an open-source framework for building spiders that crawl and extract data from websites. By default, each request sends a user agent that contains the Scrapy token together with a version and a self-identifying project URL.
That default identifies the framework clearly. However, Scrapy exposes a USER_AGENT setting, and scraping projects routinely change it — often to a browser-like string — so many Scrapy crawls do not carry the token at all.
How Scrapy identifies itself
Out of the box, Scrapy's user agent contains the Scrapy token, a version number, and a link to the project site. Match on the Scrapy token substring rather than a fixed version. The framework documents the default and the USER_AGENT override.
Because the user agent is trivially changed, treat the Scrapy token as a strong but optional signal. Absence of the token does not mean a request is not Scrapy.
- Default user agent contains the Scrapy token plus version and project URL
- USER_AGENT setting is commonly overridden
- Token present = strong signal; token absent ≠ not Scrapy
Catching overridden Scrapy crawls
When the user agent is changed, identify Scrapy-style crawling behaviourally: consistent request cadence, broad URL enumeration, ignored or fetched-once robots.txt, missing browser headers, and no asset/JS loading. These patterns matter more than the string once the default is replaced.
For framework traffic that respects robots.txt, you can set crawl-delay or disallow paths; for traffic that ignores it, behavioural detection and rate limiting are the practical controls.
How it appears in analytics and logs
A request whose user agent contains the Scrapy token is a Scrapy spider with its default UA. Because the setting is commonly overridden, Scrapy traffic often arrives under a custom or browser-like user agent, so the token is sufficient but not necessary evidence.
Diagnostic use case
Spot Scrapy-based scraping in logs, understand why the default token is often missing, and decide how to treat framework-driven crawling.
What WebmasterID can help detect
WebmasterID classifies the default Scrapy token server-side as automation and surfaces it on the bot-intelligence view, while noting that overridden user agents need behavioural signals to catch.
Common mistakes
- Assuming all Scrapy traffic carries the Scrapy token — the UA is often overridden.
- Counting Scrapy spider requests as human visits.
- Relying on robots.txt alone against scrapers configured to ignore it.
Privacy and accuracy notes
Scrapy detection uses only the user agent and request shape. No human identity is involved — it is a script. WebmasterID records it as a bot event, separate from human analytics.
Frequently asked questions
- Does Scrapy obey robots.txt?
- Scrapy has a ROBOTSTXT_OBEY setting that, when enabled, makes spiders respect robots.txt. It can be disabled by the operator, so compliance depends on configuration, not the framework alone.
Related pages
- python-requests user agent
The popular Python requests library sends a default user agent in the form python-requests/x.y. Seeing it means a Python script made the request — for an integration, a scraper, a webhook, or your own code. It is honest automation, not a browser, though the default can be overridden. This page covers the pattern.
- Playwright and Puppeteer user agents
Playwright and Puppeteer are browser-automation libraries that drive real Chromium, Firefox, or WebKit instances. Because they use the actual browser engine, their default user agent matches a normal browser — sometimes with a HeadlessChrome marker — so the user agent alone rarely reveals them. Detection relies on automation signals, not the string.
- SEO crawler user agents
SEO platforms run their own crawlers to build backlink indexes and audit data. Bots such as AhrefsBot, SemrushBot, and DotBot identify themselves with a documented token and honour robots.txt. They are not search-engine indexers. This page explains the family and how to recognise and control it.
- Bot vs human
Separate framework-driven crawling from real human visits.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.