Semantic Scholar academic crawler
Semantic Scholar is a free academic search engine and research corpus built by the Allen Institute for AI (AI2). Its crawler fetches scholarly pages, papers, and metadata to index research literature and power citation-aware academic search. Unlike a general web crawler, it focuses on academic and publisher content, and AI2 publishes documentation and an open API around the corpus it builds.
What this means
Semantic Scholar indexes scientific and academic literature, layering citation context and AI-assisted features over a large research corpus. Its crawler gathers papers and scholarly pages to keep that corpus current.
If your site hosts research, preprints, or publisher content, Semantic Scholar's crawler may fetch it to include in academic search results. This is scholarly indexing, distinct from general web search and from corpora gathered to train foundation models.
How it identifies itself
Semantic Scholar crawling carries a self-identifying user-agent referencing Semantic Scholar / AI2. Match on the stable Semantic Scholar identity rather than a version string. AI2 also documents an open Academic Graph API, so much corpus access is via API rather than crawling.
As with any crawler, the user-agent is a claim and can be copied; corroborate with behaviour where authenticity matters.
- Operator: Allen Institute for AI (AI2)
- Scope: academic papers, scholarly pages, and metadata
- Open Semantic Scholar Academic Graph API also available
robots.txt considerations
To express a crawl preference for Semantic Scholar, target its documented user-agent token in robots.txt. Because Semantic Scholar also offers an open API, blocking the crawler does not necessarily remove already-collected open metadata.
robots.txt is honoured by compliant crawlers and is not an access control. Consider that scholarly indexing can increase the discoverability of legitimate research hosted on your site.
How it appears in analytics and logs
A Semantic Scholar request means its academic search crawler reached your page, typically because it hosts scholarly content or links to research. It is search-indexing bot traffic scoped to academic literature, not a human visit and not a general web index.
Diagnostic use case
Recognise Semantic Scholar's academic crawler in logs, distinguish scholarly indexing from general web search and from AI-training crawlers, and read it as research-literature coverage.
What WebmasterID can help detect
WebmasterID classifies the Semantic Scholar crawler server-side as an academic search bot and surfaces its activity on the bot-intelligence surface, so scholarly-index coverage stays separate from human analytics.
Common mistakes
- Confusing scholarly academic indexing with general web search indexing.
- Assuming Semantic Scholar is an AI-training crawler rather than an academic search engine.
- Counting academic crawl hits as human research-paper readers.
Privacy and accuracy notes
Identification uses only the request user-agent. No visitor identity is involved. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a profile.
Related pages
- Academic and research crawlers — overview
Academic and research crawlers fetch scholarly papers and metadata to build research search engines, open catalogues, and citation infrastructure. This overview covers how Semantic Scholar, CORE, OpenAlex, and Crossref differ from general web crawlers, why much of their work is metadata harvesting via standard protocols, and how to set policy. For sites hosting research, they generally increase scholarly discoverability.
- OpenAlex crawler
OpenAlex, run by the non-profit OurResearch, is a free and open catalogue of the global research system — papers, authors, institutions, venues, and concepts — offered as data and an API. Its crawler and harvesters gather scholarly metadata and links to build an open scientific knowledge graph. It is a research-metadata aggregator rather than a general web search engine.
- Crossref crawler
Crossref is a non-profit DOI registration agency that links scholarly publications through persistent identifiers and rich metadata. Its services fetch publisher landing pages and content to support DOI registration, metadata deposit, similarity checking, and link resolution. It is scholarly-infrastructure crawling for the academic citation ecosystem, not general web search indexing.
- Web crawlers
How academic and search crawlers are detected and categorised.
Sources and verification notes
- Semantic Scholar — about and API (Allen Institute for AI)Academic search corpus and open Academic Graph API documented.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.