How to block BUbiNG
BUbiNG is an open-source, high-throughput web crawler developed for research and large-scale web data collection. Because anyone can run an instance, its behavior depends on the operator. This page shows the robots.txt token to target and why a Disallow only steers compliant deployments.
What BUbiNG is
BUbiNG is an open-source distributed crawler from the Laboratory for Web Algorithmics, designed for high-throughput web data collection in research contexts. Because it is software anyone can deploy, the crawler you see is run by whoever set up that instance, and politeness settings are operator-controlled.
Match on the documented BUbiNG user-agent token rather than a version string. Since deployments vary, a robots.txt rule reaches only the instances configured to honour it.
- robots.txt token: BUbiNG (self-identifying user agent)
- Open-source research crawler; run by many operators
- Politeness and volume depend on the deployment
robots.txt rule
To ask BUbiNG to stay off your site:
User-agent: BUbiNG Disallow: /
BUbiNG implementations that respect robots.txt will back off. A deployment that ignores robots.txt — or runs with politeness disabled — is not stopped by this rule, so if crawl load continues, escalate to edge-level rate limiting or a WAF.
How it appears in analytics and logs
Requests carrying the BUbiNG token are research-crawler events, not human visits. Because BUbiNG is run by many operators, volume and politeness vary by deployment; treat the hits as bot traffic.
Diagnostic use case
Reduce crawl load from BUbiNG-based research crawlers on your public pages and confirm the disallow targeted the right token.
What WebmasterID can help detect
WebmasterID classifies BUbiNG server-side as a crawler and shows whether it keeps reaching your pages after a robots.txt rule, helping you tell compliant deployments from ones that ignore it.
Common mistakes
- Assuming one rule stops every BUbiNG instance regardless of its configuration.
- Expecting robots.txt to enforce a block rather than request compliance.
- Counting research-crawler hits as human sessions.
Privacy and accuracy notes
Blocking BUbiNG uses only the request user-agent token. No visitor identity is involved, and WebmasterID records the crawl as a bot event separate from human analytics.
Related pages
- BUbiNG research crawler
BUbiNG is an open-source distributed web crawler developed by the Laboratory for Web Algorithmics (LAW) at the University of Milan. It is designed for high-throughput crawling for research and dataset building, not to power a consumer search engine. Because anyone can run the open-source software, a BUbiNG user agent indicates the crawler software, not a single operator.
- robots.txt vs a firewall/WAF
robots.txt and a firewall/WAF solve different problems: robots.txt politely asks compliant crawlers what to skip, while a firewall or WAF actually blocks requests at the network or edge layer. This page contrasts the two, explains when each is appropriate, and warns against using robots.txt for jobs only enforcement can do.
- robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
- Bot intelligence
Categorise research and open-source crawlers in your traffic.
Sources and verification notes
- BUbiNG — Laboratory for Web AlgorithmicsBUbiNG open-source crawler project page; token matched on the self-identifying user agent.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.