Search engine bot reference: crawlers behind search & SEO tools
A reference to the traditional search-engine crawlers and third-party SEO crawlers that index the web. Each page covers the crawler's category and purpose, how to verify it (since user agents can be spoofed), documented robots.txt behavior, and the diagnostic angle for operators.
134 search bots documented · part of the Web Crawler & Traffic Intelligence Encyclopedia.
- Googlebot Smartphone — Google's mobile-first crawler
Googlebot Smartphone is the mobile user-agent variant of Googlebot and, under mobile-first indexing, Google's primary crawler for most sites. It uses the Googlebot robots.txt token and can be verified through reverse DNS and Google's published crawler IP ranges.
- Bingbot — Microsoft Bing's web crawler
Bingbot is the crawler Microsoft Bing uses to discover and index web pages. It uses the bingbot robots.txt token and can be verified through Bing's reverse-DNS method and published IP ranges. Bing also powers results for other surfaces, so Bingbot coverage has reach beyond Bing.com.
- Googlebot Desktop — Google's secondary crawler
Googlebot Desktop is the desktop user-agent variant of Googlebot. Under mobile-first indexing it is secondary to Googlebot Smartphone for most sites. It shares the Googlebot robots.txt token and is verified the same way: reverse DNS into googlebot.com or google.com, or matching Google's published crawler IP ranges.
- Googlebot-Image — Google's image crawler
Googlebot-Image is the Google crawler that fetches image files for Google Images. It uses the Googlebot-Image robots.txt user-agent token and is documented among Google's common crawlers, so operators can target image crawling separately from page crawling.
- Googlebot-News — Google News crawling
Googlebot-News is the user agent that governs crawling for Google News. Google documents it as relying on Googlebot for actual crawling, with the Googlebot-News token letting publishers control Google News inclusion separately from general Search.
- Googlebot-Video — Google's video crawler
Googlebot-Video is the Google crawler that fetches video content for Google's video features. It uses the Googlebot-Video robots.txt user-agent token and is documented among Google's common crawlers, letting operators control video crawling separately from page crawling.
- AdsBot-Google — Google Ads landing-page checker
AdsBot-Google is the Google crawler that checks the quality of web pages used as Google Ads landing pages. Google documents that AdsBot crawlers may ignore the global robots.txt wildcard group, so to control them you target the AdsBot-Google token explicitly.
- BingPreview — Bing page snapshot fetcher
BingPreview is a user agent associated with Bing's fetching of page snapshots and previews. It is part of Bing's crawling activity alongside Bingbot. Specific current behaviour is best confirmed in Bing Webmaster documentation, so this entry is marked partially verified.
- DuckDuckBot — DuckDuckGo's crawler
DuckDuckBot is the crawler operated by DuckDuckGo. DuckDuckGo also draws on third-party indexes for some results, so DuckDuckBot is one part of how its results are built. DuckDuckGo documents the crawler and publishes IP addresses operators can use to verify it.
- YandexBot — Yandex's web crawler
YandexBot is the main crawler for Yandex, a search engine with a strong presence in Russian-language search. It uses the YandexBot robots.txt token and can be verified through reverse DNS, where the IP should resolve into a Yandex domain, confirmed by a matching forward lookup.
- Baiduspider — Baidu's web crawler
Baiduspider is the main crawler for Baidu, a leading search engine for Chinese-language search. It uses the Baiduspider robots.txt token. Baidu's documentation is primarily in Chinese and verification options are more limited than Google's, so treat verification with care.
- Applebot — Apple's web crawler
Applebot is the web crawler operated by Apple, used to power features such as Siri and Spotlight Suggestions. It uses the Applebot robots.txt token and supports reverse-DNS verification. It is distinct from Applebot-Extended, a separate token governing use of crawled content for Apple's generative models.
- SeznamBot — Seznam's web crawler
SeznamBot is the crawler for Seznam.cz, a long-established Czech search engine and web portal. It uses the SeznamBot robots.txt token. Some details are documented primarily in Czech, so this entry is marked partially verified pending confirmation in Seznam's own documentation.
- Sogou Spider — Sogou's web crawler
Sogou Spider is the crawler for Sogou, a Chinese search engine. Its user agent contains the Sogou identifier. English-language documentation is limited and verification options are not well published, so this entry is marked partially verified.
- PetalBot — Huawei Petal Search crawler
PetalBot is the crawler operated for Huawei's Petal Search. It uses the PetalBot robots.txt token and is documented as honouring robots.txt. Petal Search is associated with Huawei devices and services, so PetalBot crawling supports that search experience.
- AhrefsBot — Ahrefs SEO crawler
AhrefsBot is the crawler operated by Ahrefs to build its SEO and backlink index. It is a third-party crawler, not a search engine, so it does not affect Google or Bing rankings directly. It uses the AhrefsBot robots.txt token and is documented as respecting robots.txt and crawl-delay.
- SemrushBot — Semrush SEO crawler
SemrushBot is the crawler operated by Semrush to gather data for its SEO toolset. It is a third-party crawler, not a search engine, so it does not affect search rankings directly. It uses the SemrushBot robots.txt token and is documented as respecting robots.txt.
- MJ12bot — Majestic's web crawler
MJ12bot is the crawler behind Majestic's backlink index. It is notable for being distributed — run across many independent operators — rather than a single central crawl. It uses the MJ12bot robots.txt token and is documented as honouring robots.txt.
- DotBot — Moz's web crawler
DotBot is the crawler operated by Moz to build its link index, which powers Moz's link-analysis tools. It is a third-party crawler, not a search engine. It uses the DotBot robots.txt token and is documented as honouring robots.txt.
- Screaming Frog SEO Spider
Screaming Frog SEO Spider is a desktop application that site owners and SEO professionals run themselves to audit a site. It is not a public, continuously operating crawler like Googlebot; its user agent is user-controlled and its crawling is initiated by whoever runs the tool.
- Naver Yeti — South Korea search crawler
Yeti is the web crawler operated by Naver, the search and content portal that holds a leading share of search in South Korea. Its robots.txt token is Yeti. Naver provides webmaster tooling and documentation, much of it in Korean, so some specifics are marked partially verified.
- Yahoo Slurp — Yahoo's web crawler
Slurp is the web crawler historically operated by Yahoo. Modern Yahoo Search blends results from search partners rather than relying solely on its own index, so Slurp's role is narrower than it once was. Its robots.txt token is Slurp; current scope is best confirmed in Yahoo's documentation, so this entry is partially verified.
- CocCocBot — Vietnam search crawler
CocCocBot is the crawler for Coc Coc, a browser and search engine focused on the Vietnamese market with features tuned for Vietnamese-language text. Its robots.txt token is coccocbot. Documentation is largely in Vietnamese, so some specifics are marked partially verified.
- 360Spider — Qihoo 360 crawler
360Spider is the crawler associated with Qihoo 360's search service in China. Its user agent contains the 360Spider identifier. English-language documentation and verification methods are limited, so this entry is marked partially verified.
- MojeekBot — independent index crawler
MojeekBot is the crawler for Mojeek, a search engine notable for building and operating its own independent web index rather than reselling another engine's results. Its robots.txt token is MojeekBot. Mojeek documents the crawler, though some specifics are marked partially verified.
- Brave Search crawler
Brave Search operates its own independent web index, and uses a dedicated crawler to gather pages for it. Brave documents the crawler and its robots.txt token. The exact token and verification specifics should be confirmed in Brave's documentation, so this entry is marked partially verified.
- Google-InspectionTool — Search Console tester
Google-InspectionTool is the user agent Google uses for its Search testing tools, including URL Inspection in Search Console and the Rich Results Test. Google documents it as a separate token from Googlebot, so the requests you see are on-demand tests, not routine indexing crawls.
- Google-Safety — abuse/malware crawler
Google-Safety is the user agent Google uses for abuse-related crawling, such as malware detection tied to specific products. Google documents that, because of its safety purpose, it is not bound by robots.txt rules in the way ordinary crawlers are.
- Storebot-Google — shopping crawler
Storebot-Google is the Google crawler associated with shopping experiences, used to crawl product and shopping-related pages. Google documents it among its crawlers with its own user-agent token, so operators can recognise shopping-related crawling distinctly from general Googlebot indexing.
- Feedfetcher-Google — feed fetcher
Feedfetcher-Google is the user agent Google uses to fetch RSS and Atom feeds for Google products. Google documents that Feedfetcher is not used for indexing, and that because feed fetches are user-requested subscriptions, it is handled differently from indexing crawlers.
- AdIdxBot — Microsoft Ads crawler
AdIdxBot is the crawler used by Microsoft Advertising (Bing Ads) to crawl ads and the websites they link to. Microsoft documents it as distinct from bingbot, the organic search crawler, with its own user-agent token, so ad-related crawling can be recognised separately.
- BLEXBot — WebMeUp backlink crawler
BLEXBot is a crawler associated with WebMeUp/SEO backlink tooling. It is a third-party crawler that builds a backlink index, not a search engine, so it does not affect search rankings directly. Its robots.txt token is BLEXBot; some specifics are marked partially verified.
- DataForSeoBot — SEO data crawler
DataForSeoBot is the crawler operated by DataForSEO to gather data exposed through its SEO data APIs. It is a third-party crawler, not a search engine, so it does not affect rankings directly. Its robots.txt token is DataForSeoBot; DataForSEO documents it, and some specifics are partially verified.
- Seekport Crawler
Seekport Crawler is associated with the Seekport search project. Its user agent is reported to contain a SeekportBot identifier. Public documentation is sparse and some specifics cannot be confidently sourced, so this entry describes the identification pattern and leaves unconfirmed details unverified.
- How to verify Googlebot
The Googlebot user agent is widely spoofed, so a request claiming to be Googlebot should be verified, not trusted. Google documents two methods: a reverse-DNS check that resolves into googlebot.com or google.com confirmed by a matching forward lookup, and matching the source IP against Google's published crawler IP ranges.
- How to verify Bingbot
The Bingbot user agent is commonly spoofed, so a request claiming to be Bingbot should be verified rather than trusted. Microsoft documents a reverse-DNS method: the source IP should resolve into search.msn.com, confirmed by a forward lookup back to the same IP. Bing also publishes IP information for verification.
- Search crawlers vs SEO crawlers
Search-engine crawlers like Googlebot and Bingbot build the indexes that determine search visibility. Third-party SEO crawlers like AhrefsBot and SemrushBot feed analysis tools and do not affect rankings directly. Distinguishing them matters for crawl-budget reasoning and for deciding what to allow or limit.
- Managing third-party SEO crawler load
Third-party SEO crawlers such as AhrefsBot and SemrushBot can generate significant request volume without contributing to search visibility. You can manage their load by targeting their tokens in robots.txt, using crawl-delay where the crawler supports it, and blocking those that bring no value to you.
- Regional search engines overview
In several markets a regional search engine leads instead of Google: Yandex in Russian-language search, Baidu in China, Naver in South Korea, Seznam in the Czech Republic, and Coc Coc in Vietnam. Recognising their crawlers matters because being indexed by them is how you reach those audiences.
- Crawler IP verification methods
Because user-agent strings are trivially copied, the reliable way to confirm a crawler is to check its source. The two documented methods are reverse DNS with a forward-confirm step, and matching the source IP against the engine's published IP ranges. Together they defend against spoofed crawler traffic.
- How Googlebot renders JavaScript
Googlebot does not just read raw HTML — it also renders pages using the Web Rendering Service (WRS) so JavaScript-generated content can be indexed. Google documents rendering as a stage that can happen after the initial crawl, which is why content that depends on client-side JavaScript may be processed with a delay.
- Fake search-bot traffic
Because search-engine crawlers are widely allowed, abusive clients copy the Googlebot or Bingbot user-agent string to slip past rules meant for real crawlers. This fake search-bot traffic is identified by verifying the source: genuine crawlers pass reverse-DNS and published-IP checks, spoofed ones do not.
- rogerbot — Moz's site-audit crawler
rogerbot is the crawler operated by Moz to power Moz Pro Campaigns, site crawl, and link-related features. It is an SEO tool crawler, not a search engine, so its visits do not affect search rankings. Moz documents the rogerbot token and publishes guidance for operators who want to identify or restrict it in robots.txt.
- SISTRIX crawler — SISTRIXCrawler bot
The SISTRIX crawler fetches pages to build data for the SISTRIX SEO toolbox, including its visibility and on-page analyses. It is a third-party SEO tool crawler based in Germany, not a search engine. SISTRIX documents the crawler and provides guidance for operators who want to identify or restrict it.
- Barkrowler — Babbar's web crawler
Barkrowler is the web crawler operated by Babbar (eXensa) to build the link graph and authority metrics behind the Babbar SEO platform. It is a third-party SEO/link-data crawler, not a search engine. Babbar documents Barkrowler and provides robots.txt guidance for operators who want to identify or restrict it.
- SerpstatBot — Serpstat's web crawler
SerpstatBot is the crawler operated by Serpstat to collect backlink and SEO data for its platform. It is a third-party SEO tool crawler, not a search engine. Serpstat documents SerpstatBot and publishes robots.txt and crawl-rate guidance for operators who want to identify or restrict it.
- Qwant search crawler — Qwantify
Qwant is a privacy-focused search engine based in France that operates its own crawler to index the web for European search results. Its crawler self-identifies with a Qwant token (historically Qwantify). It is a genuine search-engine indexer, so allowing it can help your pages appear in Qwant results.
- Mail.ru search crawler — Mail.RU_Bot
Mail.ru is a major Russian internet portal and search provider, and it operates its own crawler to index the web for its search results. The crawler self-identifies with a Mail.RU_Bot token. It is a genuine regional search-engine indexer, so allowing it can help your pages appear in Mail.ru search.
- Oncrawl bot — OnCrawl technical-SEO crawler
Oncrawl is a technical-SEO and log-analysis platform whose crawler fetches pages to build site-structure and on-page audits for its subscribers. It is a third-party SEO tool crawler, not a search engine. Oncrawl documents the crawler and provides robots.txt guidance for operators who want to identify or restrict it.
- JetOctopus crawler — technical-SEO auditor
JetOctopus is a technical-SEO platform whose crawler audits large sites for structure, indexability, and on-page issues, alongside log-file analysis. It is a third-party SEO tool crawler, not a search engine. JetOctopus documents the crawler and supports robots.txt and crawl-rate controls.
- Sitebulb crawler — desktop/cloud SEO auditor
Sitebulb is a desktop and cloud SEO auditing tool whose crawler fetches pages to map site structure, internal links, and on-page issues. It is a third-party SEO tool crawler, not a search engine. Sitebulb documents its user agent and supports robots.txt handling and a configurable crawl identity.
- Lumar (DeepCrawl) — enterprise SEO crawler
Lumar (formerly DeepCrawl) is an enterprise technical-SEO platform whose crawler audits large sites for indexability, structure, and on-page health. It is a third-party SEO tool crawler, not a search engine. Lumar documents its crawler and supports robots.txt and crawl-rate controls for operators.
- Botify crawler — enterprise SEO platform
Botify is an enterprise SEO platform whose crawler fetches pages to build crawl, indexability, and content analyses for large sites, often combined with log-file analysis. It is a third-party SEO tool crawler, not a search engine. Botify documents its crawler and supports robots.txt and crawl-rate controls.
- ContentKing crawler — real-time SEO monitor
ContentKing, now part of Conductor, is a real-time SEO monitoring tool whose crawler continuously checks pages for changes in content, indexability, and on-page health. It is a third-party SEO tool crawler, not a search engine. ContentKing documents its crawler and supports robots.txt handling.
- CognitiveSEO crawler — backlink/SEO data bot
CognitiveSEO is an SEO and backlink-analysis platform whose crawler fetches pages to build link and on-page datasets for its subscribers. It is a third-party SEO tool crawler, not a search engine. This entry describes the documented pattern; confirm its current token in CognitiveSEO's materials before relying on it.
- Linkdex bot — SEO platform crawler
Linkdex was an enterprise SEO platform whose crawler fetched pages to build link and on-page datasets. It is a third-party SEO tool crawler, not a search engine. Because Linkdex has been absorbed into other products over time, this entry describes the documented pattern and is marked partially verified.
- Pingdom bot — uptime/performance monitor
Pingdom (SolarWinds) is an uptime and performance monitoring service that fetches your pages on a schedule to check availability and speed. Its requests are automated monitoring, not search indexing or human visits. Pingdom documents its checks and the identifiers operators can use to recognise them.
- UptimeRobot bot — uptime monitoring checks
UptimeRobot is an uptime monitoring service that requests your URLs at intervals to confirm they respond. Its requests are automated availability checks, not search indexing or human visits. UptimeRobot identifies its monitor in the user agent so operators can recognise its checks.
- Goo search crawler — Japanese portal
Goo is a long-running Japanese web portal that offers search alongside dictionaries, news, and other services. Its search has historically relied on partner indexes, and a Goo-identified crawler may appear when fetching pages. This entry describes the documented pattern and is marked partially verified.
- Daum crawler — Korean portal (Kakao)
Daum is a major South Korean web portal, now operated under Kakao, offering search alongside news, mail, and community services. It runs a search crawler that self-identifies with a Daum token. It is a regional search-engine indexer, important for reaching Korean-market audiences.
- Rambler crawler — Russian portal
Rambler is a long-running Russian web portal offering news, mail, and search. Its search has historically been powered by partner indexes rather than solely its own crawl. A Rambler-identified crawler may appear when fetching pages; this entry describes the documented pattern and is marked partially verified.
- Startpage — proxied Google results, no own crawler
Startpage is a privacy-focused search engine based in the Netherlands that returns results sourced from a partner index (historically Google) rather than operating its own large-scale web crawler. As a result, you generally do not see a Startpage indexing bot in your logs; your visibility there depends on the upstream index.
- Yep crawler — YepBot (Ahrefs search)
Yep is a web search engine built by Ahrefs, and it is served by a crawler that self-identifies as YepBot. It is a genuine search-engine indexer, distinct from AhrefsBot (which powers Ahrefs' SEO/backlink datasets). Allowing YepBot can help your pages appear in Yep search results.
- Monitoring bots vs search crawlers
Monitoring bots (uptime and performance checkers such as Pingdom and UptimeRobot) fetch your pages on a schedule to confirm availability, not to index them. They differ from search crawlers, which build a search index, and from SEO crawlers, which gather competitive data. Telling them apart keeps synthetic checks out of human analytics.
- Seobility crawler — SECrawler audit bot
Seobility is a German SEO audit tool whose crawler fetches pages to check on-page issues, internal links, and technical health for its subscribers. It is a third-party SEO tool crawler, not a search engine. Seobility documents its crawler and provides robots.txt guidance for operators who want to identify or restrict it.
- APIs-Google fetcher
APIs-Google is a Google fetcher user agent used when Google products send push notifications or other API-driven requests to a developer's server — for example a PubSubHubbub (WebSub) delivery. It is not the search crawler and is not used to build the search index. Google documents it in the list of Google crawlers and fetchers, and it is verifiable against Google's published crawler IP ranges.
- Mediapartners-Google (AdSense crawler)
Mediapartners-Google is the crawler Google AdSense uses to fetch and analyse the content of pages that show AdSense ads, so it can choose relevant ads. It is documented in Google's crawler list as a special-case crawler tied to the AdSense product, distinct from the Googlebot search crawler, and it has its own robots.txt token.
- Google-Read-Aloud fetcher
Google-Read-Aloud is a user-triggered Google fetcher that retrieves a page so Google can convert its text to speech and read it aloud to a user. It is documented among Google's user-triggered fetchers, is not the search crawler, and because it acts on a user's request it generally ignores robots.txt the way other user-triggered fetchers do.
- Google Site Verifier fetch
Google-Site-Verification is the fetcher Google uses to confirm a site-ownership verification token — for example retrieving an HTML verification file or checking a meta tag — when you verify a property in Search Console or other Google services. It is documented among Google's special-case fetchers and is unrelated to ongoing search indexing.
- The Search Console Crawl Stats report
The Crawl Stats report is a Google Search Console feature that summarises Googlebot's crawling of your site over the last 90 days — total crawl requests, total download size, average response time, and breakdowns by response code, file type, crawl purpose (discovery vs refresh), and Googlebot type. It is the primary first-party place to understand how Google crawls a property.
- Verifying Applebot
Applebot is the crawler behind Apple features such as Siri and Spotlight Suggestions. Because any client can copy its user-agent string, verifying a request claiming to be Applebot means confirming the source rather than trusting the token. Apple documents Applebot and identification guidance, and the standard reverse-DNS-then-forward-DNS technique applies.
- GoogleProducer fetcher
GoogleProducer is a user agent that appears among Google's crawlers and fetchers and is used by Google product infrastructure rather than the search index. It is documented by Google as one of the special-case Google agents. Some specifics of when it fetches are not exhaustively published, so it is recorded here as partially verified to avoid inventing behaviour.
- DuplexWeb-Google fetcher
DuplexWeb-Google is a Google user agent linked to Duplex on the web features. It is documented among Google's crawlers and fetchers as a special-case agent rather than the search crawler. Because Duplex on the web has limited public footprint, exact current behaviour is not fully documented, so this entry is partially verified and avoids inventing specifics.
- Googlebot crawl frequency
Googlebot's crawl frequency is governed by two forces Google describes as crawl capacity limit and crawl demand. Capacity reflects how much your server can handle without slowing down; demand reflects how interesting and fresh Google judges your URLs to be. Google removed the manual crawl-rate setting, so the rate is mostly automatic and responds to your site's health and value.
- Crawl budget for large sites
Crawl budget is the practical limit on how many URLs Googlebot will crawl on your site in a given period, set by crawl capacity and crawl demand. Google says most sites do not need to worry about it, but very large sites (hundreds of thousands of URLs) or sites with many auto-generated URLs should manage it so Google spends crawling on valuable pages, not duplicates and dead ends.
- Crawler traps and how to avoid them
A crawler trap (or spider trap) is a structure that produces an effectively unlimited number of low-value URLs, such as an infinite calendar, faceted-filter combinations, or session IDs in URLs. Traps waste crawl budget, can dilute indexing signals, and make logs noisy. They are recognised in Google's crawl-budget guidance and are fixable with URL hygiene.
- SEOkicks crawler (SEOkicks-Robot)
SEOkicks-Robot is the crawler operated by SEOkicks, a backlink-analysis service. Like other link-index crawlers, it fetches pages to discover and record hyperlinks for its backlink database rather than to serve a public search engine. The token and self-identifying URL are observable in logs; some operational specifics are not exhaustively published, so this entry is partially verified.
- Ryte crawler (BotLogen)
Ryte is a website-quality and SEO auditing platform, and its crawler has been observed identifying with the BotLogen token. It fetches a customer's pages to run technical and content audits. As a tool crawler it does not serve a consumer search engine. Token and self-identifying URL are observable; some specifics are not exhaustively published, so this entry is partially verified.
- ZoominfoBot crawler
ZoominfoBot is the crawler associated with ZoomInfo, a company that builds B2B contact and company datasets. It fetches publicly available business-related web content to support that data product rather than to power a consumer search engine. The self-identifying token is observable in logs; ZoomInfo's published crawler specifics are limited, so this entry is partially verified.
- BUbiNG research crawler
BUbiNG is an open-source distributed web crawler developed by the Laboratory for Web Algorithmics (LAW) at the University of Milan. It is designed for high-throughput crawling for research and dataset building, not to power a consumer search engine. Because anyone can run the open-source software, a BUbiNG user agent indicates the crawler software, not a single operator.
- Magpie-crawler (Brandwatch)
Magpie-crawler is a crawler that has been associated with Brandwatch's Magpie data-collection infrastructure for social and web monitoring. It fetches publicly available pages to support media monitoring and analytics rather than a consumer search engine. The self-identifying token is observable; published specifics are limited, so this entry is partially verified.
- Pandalytics crawler (Domsignal)
Pandalytics is a crawler that has been observed identifying with the Pandalytics token, associated with Domsignal's SEO and website-analysis tooling. It fetches pages to support that analysis rather than to serve a consumer search engine. The self-identifying token is observable; published specifics are limited, so this entry is partially verified.
- Cocolyze crawler (CocolyzeBot)
CocolyzeBot is the crawler associated with Cocolyze, an SEO platform offering rank tracking, site audits, and backlink analysis. It fetches pages to support those tools rather than to power a consumer search engine. The self-identifying token is observable in logs; published crawler specifics are limited, so this entry is partially verified.
- OnPage.org / Ryte audit crawler heritage
OnPage.org was a German SEO and website-quality platform that rebranded to Ryte. Its audit crawler fetched a customer's site to run technical and content checks. Older log entries may reference OnPage-era tokens. This entry documents the lineage and audit-crawler behaviour; exact historic UA strings are not asserted, so it is partially verified.
- Baiduspider-image (Baidu image crawler)
Baiduspider-image is the image-oriented variant of Baiduspider, the crawler for Baidu, China's leading search engine. It fetches images to support Baidu image search, separate from the general-web Baiduspider. Baidu documents Baiduspider and its variants; some specifics are documented in Chinese and not exhaustively in English, so this entry is partially verified.
- Sogou web spider vs image spider
Sogou, a major Chinese search engine, operates more than one crawler variant, including a general web spider and an image-oriented spider. Separating them helps operators read Sogou crawl activity by purpose and set policy accordingly. Sogou documents its spiders primarily in Chinese; some specifics are not exhaustively published in English, so this entry is partially verified.
- Onet search crawler (Poland)
Onet is one of Poland's largest web portals, with search and content surfaces. Its crawler fetches public pages to support those surfaces. As a regional player it matters mainly for sites targeting the Polish market. Onet's crawler documentation is limited and primarily in Polish, so the self-identifying token is the reliable signal and this entry is partially verified.
- Feedfetcher vs APIs-Google
Feedfetcher-Google and APIs-Google are both special-case Google fetchers that are easy to confuse with each other and with Googlebot. Feedfetcher pulls RSS/Atom feeds for Google products that subscribe to them; APIs-Google delivers API-triggered messages such as push notifications. Neither is the search crawler. Google documents both in its crawlers and fetchers list.
- User-triggered fetchers vs crawlers
Google groups its automated agents into common crawlers, special-case crawlers, and user-triggered fetchers. User-triggered fetchers act because a person asked for something now — like reading a page aloud or fetching a preview — and are treated differently from indexing crawlers, including how they relate to robots.txt. Understanding the distinction prevents wrong robots.txt and analytics decisions.
- ia_archiver and the Internet Archive crawler
ia_archiver is a long-standing user-agent token associated with crawling for the Internet Archive's Wayback Machine and related collections. The Internet Archive operates archival crawlers that fetch public pages to preserve snapshots over time. The token has historic ties to the Alexa crawler that fed early Archive collections, so log entries may show ia_archiver or archive.org-related agents depending on the crawl source.
- archive.org_bot — Internet Archive web crawler
archive.org_bot is a user-agent associated with Internet Archive crawling that fetches public web pages for preservation in collections such as the Wayback Machine. It is an archival agent, distinct from search-engine indexing crawlers, and identifies via an archive.org URL in its user-agent. Operators see it when their public pages are captured for long-term snapshots.
- Wayback Machine Save Page Now fetcher
Save Page Now is the Internet Archive feature that captures a specific URL on demand when a person requests a snapshot through the Wayback Machine. Unlike background archival crawling, this fetch happens because someone asked for it right now, making it a user-triggered archival fetch. It appears in logs as an archive.org-identifying request tied to a save request rather than a scheduled crawl.
- Censys and Shodan scanning crawlers
Censys and Shodan are internet-wide scanning services that map reachable hosts, open ports, and exposed services for security research and asset discovery. They are not search-engine crawlers indexing your content for ranking; they probe infrastructure. Their requests appear in logs as scanning activity from their published scanner identities, and they offer opt-out mechanisms for operators.
- BinaryEdge scanning crawler
BinaryEdge is an internet-scanning service that collects data on reachable hosts, open ports, and exposed services for security and threat-intelligence use. Like Censys and Shodan, it probes infrastructure rather than indexing your pages for search ranking. Its scanning appears in logs as automated probes, and the service provides information for operators who want to identify or exclude it.
- Qualys web application scanner
Qualys operates security scanning that assesses web applications and infrastructure for vulnerabilities and misconfigurations. Some Qualys scanning is authorised by the site owner (an internal security assessment); some is part of broader internet measurement. It is a security tool, not a search crawler, and its probes appear in logs as scanning rather than content fetching for ranking.
- Security scanners vs search crawlers
Security scanners (Censys, Shodan, BinaryEdge, Qualys and similar) probe hosts, ports, and application surface to assess exposure and find vulnerabilities. Search crawlers (Googlebot, Bingbot) fetch and index content to rank it. Confusing the two leads to wrong robots.txt decisions and misread logs: robots.txt governs content crawling, not port scanning, and scan traffic should never be counted as audience.
- Lighthouse and PageSpeed Insights fetchers
Lighthouse is Google's open-source page-quality auditing tool, and PageSpeed Insights is the hosted service that runs Lighthouse audits and reports field and lab performance data. Both fetch a page on demand to measure it, not to index it for search. Their fetches are user-triggered performance audits and appear in logs as a single page load with related resource requests, not a crawl.
- GTmetrix and WebPageTest fetchers
GTmetrix and WebPageTest are web-performance testing tools that load a page from controlled test agents to measure load time, rendering, and resource behaviour. They fetch on demand to benchmark a URL, not to crawl or index a site. Their fetches appear in logs as a full page load plus resources from test infrastructure, often from specific test locations the user selects.
- Web-performance fetchers overview
Web-performance tools — Lighthouse, PageSpeed Insights, GTmetrix, WebPageTest and similar — load a page on demand to measure speed, rendering, and resource behaviour. They are neither search crawlers nor human visitors: they are user-triggered measurement automation. Reading them correctly keeps performance audits out of audience metrics and out of search-crawl coverage.
- Brandwatch and social-monitoring crawlers
Brandwatch is a social-listening and consumer-intelligence platform that gathers public web and social content to track brand mentions, sentiment, and trends. It and similar tools crawl or fetch public pages to feed mention analysis, not to index content for search ranking. Their fetches appear in logs as monitoring traffic from the platform's infrastructure.
- Meltwater media-monitoring crawler
Meltwater is a media-monitoring and PR-intelligence platform that gathers public news and web content to track coverage, mentions, and sentiment for its customers. It fetches public pages to feed monitoring, not to index them for search ranking. Its activity appears in logs as monitoring traffic from the platform's infrastructure.
- Talkwalker social-analytics crawler
Talkwalker is a social-analytics and consumer-intelligence platform that gathers public web and social content to measure mentions, sentiment, and trends. Its fetches collect public content for monitoring, not for search ranking. Activity appears in logs as monitoring traffic from the platform's infrastructure, with much data also sourced via APIs and partnerships.
- Social-listening crawlers overview
Social-listening and media-monitoring platforms collect public web and social content to track brand mentions, sentiment, and trends for their customers. They are monitoring tools, not search crawlers: they analyse public conversation rather than indexing your pages to rank them. Much of their data also comes from platform APIs and licensed feeds, not only direct crawling.
- ExaBot (Exalead) crawler
ExaBot is the crawler associated with Exalead, a French-origin web search engine that built its own index. ExaBot fetched public pages to populate Exalead's search results. Exalead's consumer web search has long since wound down, so ExaBot is largely a legacy token: you may still see it in historic logs or from residual crawling, identified by the ExaBot user-agent.
- Gigablast crawler (GigaBot)
Gigablast was an independent search engine, known for running its own web index and open-sourcing parts of its technology. Its crawler (associated with the GigaBot identity) fetched public pages to build that index. Gigablast's public search has wound down, so its crawler is largely a legacy token seen in historic logs rather than an active mainstream engine.
- Cliqz search crawler
Cliqz was a German privacy-focused search engine and browser project that built its own search index rather than relying on the major engines. Its crawler fetched public pages for that index. The Cliqz project was discontinued, so its crawler is a legacy token: you may see it in historic logs, associated with the Cliqz identity.
- Superfeedr feed fetcher
Superfeedr is a feed-handling service that polls and processes RSS/Atom feeds and pushes updates to subscribers, historically supporting real-time feed delivery via PubSubHubbub/WebSub. It fetches your feed URLs to detect new items, not your pages for search ranking. Its activity appears in logs as repeated feed fetches from Superfeedr infrastructure.
- Feedbin feed reader fetcher
Feedbin is a hosted RSS/Atom feed reader that polls feeds on behalf of its subscribers so they can read new items. It fetches your feed URLs to detect updates, not your pages for search ranking. Its activity appears in logs as repeated feed fetches from Feedbin infrastructure, with a self-identifying client.
- Archival crawlers overview
Archival crawlers — led by the Internet Archive's Wayback Machine crawling — fetch public pages to preserve point-in-time snapshots for research, journalism, and the historical record. They are not search crawlers: they capture how a page looked, not rank it. Understanding the difference keeps robots.txt and analytics decisions sensible, since archiving and indexing serve different goals.
- SafeDNS content-classification crawler
SafeDNS is a DNS-based web-filtering service that classifies sites into content categories so its customers can allow or block them. To build and maintain that categorisation, it fetches public pages to analyse their content. This is classification for filtering, not search indexing, and appears in logs as fetches from SafeDNS infrastructure.
- Project Honey Pot and Http:BL
Project Honey Pot is a community effort that uses honeypot pages to catch email harvesters, comment spammers, and other malicious bots, and exposes its findings through the Http:BL (HTTP blacklist) service. It is not a search crawler: it identifies bad bots so operators can recognise them. Understanding it helps separate abusive automation from legitimate search and SEO crawling.
- Archive-It crawler (Internet Archive)
Archive-It is a subscription web-archiving service run by the Internet Archive, used by libraries, universities, and institutions to capture and preserve selected websites on a schedule. Its crawler fetches the public pages an institution has chosen to archive, building curated collections rather than indexing the whole web for search. It appears in logs as archival fetches associated with the Internet Archive.
- SimilarWeb crawler
SimilarWeb is a digital-intelligence company whose crawler fetches publicly accessible web pages as one input to its market-research, traffic-estimation, and competitive-analytics products. It is a data-collection crawler, not a search engine: it gathers signals about websites rather than building a public search index. SimilarWeb publishes a self-identifying crawler user-agent and a page describing the bot so operators can recognise and control it.
- Netcraft survey crawler
Netcraft is a security and internet-research company known for its long-running Web Server Survey, which measures the software, hosting, and configuration of public web servers across the internet. Its crawler fetches public endpoints to record server signals rather than to index page content for search. It appears in logs as periodic survey probes associated with Netcraft's research and anti-phishing operations.
- Semantic Scholar academic crawler
Semantic Scholar is a free academic search engine and research corpus built by the Allen Institute for AI (AI2). Its crawler fetches scholarly pages, papers, and metadata to index research literature and power citation-aware academic search. Unlike a general web crawler, it focuses on academic and publisher content, and AI2 publishes documentation and an open API around the corpus it builds.
- Idealo price-comparison crawler
Idealo is a major European price-comparison platform, particularly in Germany, that aggregates product offers so shoppers can compare prices across retailers. Its crawler fetches retailer and merchant product pages to read prices, availability, and product details for that comparison. It is a shopping/commerce crawler, not a search engine, and operates alongside merchant feeds rather than building a general web index.
- SEranking and Mangools crawlers
SEranking and Mangools are SEO software platforms whose crawlers fetch web pages, backlinks, and on-page signals to power site audits, rank tracking, and keyword research for their subscribers. They are SEO data crawlers, not search engines: they build private datasets for marketing tools rather than a public index. Both publish self-identifying crawler user-agents so operators can recognise and control them in robots.txt.
- SSL Labs / Qualys SSL scanner
SSL Labs is a free TLS/SSL assessment service from Qualys that probes a server's HTTPS configuration — protocols, ciphers, certificate chain, and known vulnerabilities — and produces a letter-grade report. It runs on demand when someone tests a hostname, connecting to the public HTTPS endpoint rather than crawling page content. It appears in logs as TLS handshakes and probes against port 443, not as content indexing.
- SpyFu crawler
SpyFu is a competitive-research platform focused on SEO and paid-search (PPC) intelligence, letting marketers see competitors' keywords, ranking history, and ad data. Its crawler and data collection fetch search and web signals to build those datasets for subscribers. It is an SEO/PPC data tool, not a search engine, and operates to populate private research dashboards rather than a public index.
- Ubersuggest crawler
Ubersuggest is an SEO platform, associated with Neil Patel, offering keyword research, content ideas, site audits, and backlink analysis. Its crawler fetches web pages and link data to populate those features for users. It is an SEO data crawler rather than a search engine, building datasets for its dashboards instead of a public consumer index.
- Sitechecker crawler
Sitechecker is an SEO platform that crawls a website to run technical audits, monitor on-page health, and track changes over time. Its crawler fetches a site's pages to check status codes, metadata, links, and other on-page signals for its subscribers. It is an SEO audit crawler, typically invoked on a user's own verified site, rather than a search engine building a public index.
- CORE academic aggregator crawler
CORE is one of the world's largest aggregators of open-access research papers, harvesting content from institutional and subject repositories to provide a unified scholarly search and dataset. Its crawler and harvesters fetch open-access papers and metadata from repositories rather than indexing the general web. It appears in logs as scholarly harvesting, typically against repository and publisher endpoints.
- OpenAlex crawler
OpenAlex, run by the non-profit OurResearch, is a free and open catalogue of the global research system — papers, authors, institutions, venues, and concepts — offered as data and an API. Its crawler and harvesters gather scholarly metadata and links to build an open scientific knowledge graph. It is a research-metadata aggregator rather than a general web search engine.
- Crossref crawler
Crossref is a non-profit DOI registration agency that links scholarly publications through persistent identifiers and rich metadata. Its services fetch publisher landing pages and content to support DOI registration, metadata deposit, similarity checking, and link resolution. It is scholarly-infrastructure crawling for the academic citation ecosystem, not general web search indexing.
- PriceRunner crawler
PriceRunner is a price-comparison platform, strong across the Nordics and Europe, that helps shoppers compare prices and offers across retailers. Its crawler fetches retailer product pages to read prices, availability, and product details, complementing structured merchant feeds. It is a commerce/shopping crawler rather than a search engine, gathering offer data for comparison rather than a public web index.
- Shopzilla crawler
Shopzilla is a long-running shopping-comparison brand (part of the Connexity/Kit network of comparison and retail-media services) that aggregates product offers and prices for shoppers. Its data collection combines merchant feeds with crawling of retailer product pages to read prices and availability. It is a commerce/shopping crawler rather than a search engine building a public web index.
- Netsparker / Invicti scanner
Netsparker, rebranded as Invicti, is a commercial dynamic application security testing (DAST) scanner. It crawls a target web application and actively tests inputs for vulnerabilities such as injection and misconfiguration, producing a security report. It is a security-testing tool meant to be run against your own sites with authorization, not a search engine, and its traffic looks like aggressive crawling plus probe requests.
- axe accessibility scanner
axe is an open-source accessibility-testing engine from Deque Systems, embedded in browser extensions, CI pipelines, and tools that evaluate pages against WCAG accessibility rules. When run in an automated or hosted mode it fetches and renders a page to analyse its accessibility, rather than indexing content for search. It typically runs on your own pages, on demand, as part of accessibility QA.
- WAVE accessibility crawler
WAVE is a web accessibility evaluation tool from WebAIM that analyses pages for accessibility and WCAG issues, available as a hosted checker, browser extension, and API. Its hosted and API modes fetch a page to evaluate its accessibility and report errors and alerts. It is an accessibility-evaluation tool, typically run on pages you want to test, rather than a search engine indexing content.
- NewsBlur feed fetcher
NewsBlur is a personal RSS/Atom feed reader that polls subscribed feeds on a schedule and fetches new posts for its users. Its fetcher requests your feed (and sometimes the linked page) to deliver fresh items to people who subscribed in NewsBlur. It is a feed-reader fetcher, not a search engine, and its traffic scales with how many NewsBlur users follow your feed.
- Tiny Tiny RSS fetcher
Tiny Tiny RSS (TT-RSS) is a free, self-hosted RSS/Atom feed reader that individuals run on their own servers to follow feeds privately. Each TT-RSS instance polls the feeds its owner subscribes to and fetches new posts. Because it is self-hosted, requests originate from many independent installations rather than one central service, and it appears in logs as feed polling, not search indexing.
- Web intelligence and traffic crawlers — overview
Web-intelligence and traffic crawlers fetch public pages to build market-research, traffic-estimation, and internet-measurement datasets rather than to power consumer search. This overview explains how to recognise them, why they are distinct from search and SEO crawlers, and how to set policy. They build private analytics or research datasets, so their crawling reflects measurement coverage rather than audience.
- Academic and research crawlers — overview
Academic and research crawlers fetch scholarly papers and metadata to build research search engines, open catalogues, and citation infrastructure. This overview covers how Semantic Scholar, CORE, OpenAlex, and Crossref differ from general web crawlers, why much of their work is metadata harvesting via standard protocols, and how to set policy. For sites hosting research, they generally increase scholarly discoverability.
- Price-comparison and shopping crawlers — overview
Price-comparison and shopping crawlers fetch retailer product pages to read prices, availability, and product details for comparison platforms. This overview explains how Idealo, PriceRunner, and Shopzilla operate, why they combine crawling with structured merchant feeds, and how retailers should set policy. They build offer-comparison datasets, not a general search index, so their crawling reflects offer-refresh cadence.
Other reference hubs
See how WebmasterID applies this in product: Bot intelligence, AI referrals, and AI visibility analytics.