BUbiNG research crawler
BUbiNG is an open-source distributed web crawler developed by the Laboratory for Web Algorithmics (LAW) at the University of Milan. It is designed for high-throughput crawling for research and dataset building, not to power a consumer search engine. Because anyone can run the open-source software, a BUbiNG user agent indicates the crawler software, not a single operator.
What this means
BUbiNG is an open-source, distributed, high-throughput web crawler from the LAW group at the University of Milan, the same group behind UbiCrawler and the WebGraph datasets. It is used to crawl large portions of the web for research and to build datasets, not to serve a public search engine.
Because the software is open source, a BUbiNG user agent tells you the crawler software in use, not a single fixed operator. Different deployments may be run by different researchers or organisations.
How it identifies itself
It uses a BUbiNG user-agent token, typically with a self-identifying URL or contact set by whoever runs it. Match on the BUbiNG token, but remember the contact/operator portion is configurable by the deployer.
Do not assume a fixed IP range or operator; verify by the self-identifying details of the specific deployment and by behaviour.
- Open-source crawler from LAW, University of Milan
- High-throughput, research/dataset focused
- Operator varies by deployment; not a search product
How it appears in analytics and logs
A request carrying a BUbiNG user agent means someone is running the BUbiNG crawler, often for academic or dataset purposes. The identity behind it is not fixed, so treat it as research bot traffic and verify behaviour rather than assuming a specific organisation.
Diagnostic use case
Recognise BUbiNG-identified requests as research/dataset crawling, understand the operator is whoever deployed the open-source crawler, and set policy accordingly.
What WebmasterID can help detect
WebmasterID classifies BUbiNG as a research crawler distinct from search engines, so its requests are visible separately and excluded from human analytics, even though the operator behind it can vary.
Common mistakes
- Assuming all BUbiNG traffic comes from one organisation.
- Treating a research crawler as a search-engine indexer.
- Inventing a fixed IP range for an open-source crawler anyone can run.
Privacy and accuracy notes
BUbiNG is identified by its user-agent token only. It is crawler software, not a person; WebmasterID records it as a bot event with no visitor profile attached.
Related pages
- Magpie-crawler (Brandwatch)
Magpie-crawler is a crawler that has been associated with Brandwatch's Magpie data-collection infrastructure for social and web monitoring. It fetches publicly available pages to support media monitoring and analytics rather than a consumer search engine. The self-identifying token is observable; published specifics are limited, so this entry is partially verified.
- BLEXBot — WebMeUp backlink crawler
BLEXBot is a crawler associated with WebMeUp/SEO backlink tooling. It is a third-party crawler that builds a backlink index, not a search engine, so it does not affect search rankings directly. Its robots.txt token is BLEXBot; some specifics are marked partially verified.
- Managing third-party SEO crawler load
Third-party SEO crawlers such as AhrefsBot and SemrushBot can generate significant request volume without contributing to search visibility. You can manage their load by targeting their tokens in robots.txt, using crawl-delay where the crawler supports it, and blocking those that bring no value to you.
- Web crawlers
How research and dataset crawlers are categorised.
Sources and verification notes
- LAW — BUbiNG crawlerOpen-source crawler; operator varies, so per-deployment specifics are not fixed.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.