Search bots

Semantic Scholar academic crawler

Semantic Scholar is a free academic search engine and research corpus built by the Allen Institute for AI (AI2). Its crawler fetches scholarly pages, papers, and metadata to index research literature and power citation-aware academic search. Unlike a general web crawler, it focuses on academic and publisher content, and AI2 publishes documentation and an open API around the corpus it builds.

Verified against primary sources

What this means

Semantic Scholar indexes scientific and academic literature, layering citation context and AI-assisted features over a large research corpus. Its crawler gathers papers and scholarly pages to keep that corpus current.

If your site hosts research, preprints, or publisher content, Semantic Scholar's crawler may fetch it to include in academic search results. This is scholarly indexing, distinct from general web search and from corpora gathered to train foundation models.

How it identifies itself

Semantic Scholar crawling carries a self-identifying user-agent referencing Semantic Scholar / AI2. Match on the stable Semantic Scholar identity rather than a version string. AI2 also documents an open Academic Graph API, so much corpus access is via API rather than crawling.

As with any crawler, the user-agent is a claim and can be copied; corroborate with behaviour where authenticity matters.

Operator: Allen Institute for AI (AI2)
Scope: academic papers, scholarly pages, and metadata
Open Semantic Scholar Academic Graph API also available

robots.txt considerations

To express a crawl preference for Semantic Scholar, target its documented user-agent token in robots.txt. Because Semantic Scholar also offers an open API, blocking the crawler does not necessarily remove already-collected open metadata.

robots.txt is honoured by compliant crawlers and is not an access control. Consider that scholarly indexing can increase the discoverability of legitimate research hosted on your site.

How it appears in analytics and logs

A Semantic Scholar request means its academic search crawler reached your page, typically because it hosts scholarly content or links to research. It is search-indexing bot traffic scoped to academic literature, not a human visit and not a general web index.

Diagnostic use case

Recognise Semantic Scholar's academic crawler in logs, distinguish scholarly indexing from general web search and from AI-training crawlers, and read it as research-literature coverage.

What WebmasterID can help detect

WebmasterID classifies the Semantic Scholar crawler server-side as an academic search bot and surfaces its activity on the bot-intelligence surface, so scholarly-index coverage stays separate from human analytics.

Common mistakes

Confusing scholarly academic indexing with general web search indexing.
Assuming Semantic Scholar is an AI-training crawler rather than an academic search engine.
Counting academic crawl hits as human research-paper readers.

Privacy and accuracy notes

Identification uses only the request user-agent. No visitor identity is involved. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a profile.

↑ All search bots in Search bots

Sources and verification notes

Semantic Scholar — about and API (Allen Institute for AI)Academic search corpus and open Academic Graph API documented.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.