WebmasterID logoWebmasterID
Search bots

Semantic Scholar academic crawler

Semantic Scholar is a free academic search engine and research corpus built by the Allen Institute for AI (AI2). Its crawler fetches scholarly pages, papers, and metadata to index research literature and power citation-aware academic search. Unlike a general web crawler, it focuses on academic and publisher content, and AI2 publishes documentation and an open API around the corpus it builds.

Verified against primary sources

What this means

Semantic Scholar indexes scientific and academic literature, layering citation context and AI-assisted features over a large research corpus. Its crawler gathers papers and scholarly pages to keep that corpus current.

If your site hosts research, preprints, or publisher content, Semantic Scholar's crawler may fetch it to include in academic search results. This is scholarly indexing, distinct from general web search and from corpora gathered to train foundation models.

How it identifies itself

Semantic Scholar crawling carries a self-identifying user-agent referencing Semantic Scholar / AI2. Match on the stable Semantic Scholar identity rather than a version string. AI2 also documents an open Academic Graph API, so much corpus access is via API rather than crawling.

As with any crawler, the user-agent is a claim and can be copied; corroborate with behaviour where authenticity matters.

robots.txt considerations

To express a crawl preference for Semantic Scholar, target its documented user-agent token in robots.txt. Because Semantic Scholar also offers an open API, blocking the crawler does not necessarily remove already-collected open metadata.

robots.txt is honoured by compliant crawlers and is not an access control. Consider that scholarly indexing can increase the discoverability of legitimate research hosted on your site.

How it appears in analytics and logs

A Semantic Scholar request means its academic search crawler reached your page, typically because it hosts scholarly content or links to research. It is search-indexing bot traffic scoped to academic literature, not a human visit and not a general web index.

Diagnostic use case

Recognise Semantic Scholar's academic crawler in logs, distinguish scholarly indexing from general web search and from AI-training crawlers, and read it as research-literature coverage.

What WebmasterID can help detect

WebmasterID classifies the Semantic Scholar crawler server-side as an academic search bot and surfaces its activity on the bot-intelligence surface, so scholarly-index coverage stays separate from human analytics.

Common mistakes

Privacy and accuracy notes

Identification uses only the request user-agent. No visitor identity is involved. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a profile.

Related pages

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.