CORE academic aggregator crawler
CORE is one of the world's largest aggregators of open-access research papers, harvesting content from institutional and subject repositories to provide a unified scholarly search and dataset. Its crawler and harvesters fetch open-access papers and metadata from repositories rather than indexing the general web. It appears in logs as scholarly harvesting, typically against repository and publisher endpoints.
What this means
CORE aggregates open-access research from thousands of repositories and journals worldwide, offering search, a dataset, and an API over harvested scholarly content. It is run as a not-for-profit scholarly service.
If you operate an institutional repository or open-access journal, CORE may harvest your papers and metadata to include them in its aggregated index. This is scholarly harvesting, distinct from general web search crawling.
How it identifies itself
CORE harvesting carries a CORE-identifying user-agent and frequently uses standard repository protocols such as OAI-PMH for metadata, alongside HTTP fetches for full text. Match on the CORE identity and the harvesting context rather than an exact version string.
As with any crawler, the user-agent is a claim and can be copied. Corroborate with behaviour where authenticity matters.
- Operator: CORE (open-access research aggregator)
- Scope: open-access papers and metadata from repositories
- Often uses OAI-PMH metadata harvesting plus full-text fetch
robots.txt considerations
To express a crawl preference for CORE, target its documented user-agent token in robots.txt. Note that repository metadata is often exposed deliberately via OAI-PMH for harvesting, so blocking the HTTP crawler may not stop metadata aggregation.
robots.txt is honoured by compliant crawlers and is not an access control. For open-access content, harvesting generally increases discoverability of your research.
How it appears in analytics and logs
A CORE request means an open-access aggregator harvested scholarly content or metadata from your repository. It is academic-harvesting bot traffic, not a human visit and not a general web-search crawl; it reflects inclusion in a unified research index.
Diagnostic use case
Recognise CORE's open-access harvesting in repository logs, distinguish scholarly aggregation from general web search, and read it as research-content collection.
What WebmasterID can help detect
WebmasterID classifies CORE's harvester server-side as an academic-aggregation bot and surfaces it on the bot-intelligence surface, so scholarly harvesting stays separate from human analytics.
Common mistakes
- Confusing scholarly open-access harvesting with general web search indexing.
- Blocking the HTTP crawler while still exposing metadata via OAI-PMH.
- Counting harvesting hits as human research readers in analytics.
Privacy and accuracy notes
Identification uses only the request user-agent and harvesting context. No visitor identity is involved. WebmasterID records the fetch as a bot event, separate from human analytics, and never attaches it to a profile.
Related pages
- Academic and research crawlers — overview
Academic and research crawlers fetch scholarly papers and metadata to build research search engines, open catalogues, and citation infrastructure. This overview covers how Semantic Scholar, CORE, OpenAlex, and Crossref differ from general web crawlers, why much of their work is metadata harvesting via standard protocols, and how to set policy. For sites hosting research, they generally increase scholarly discoverability.
- OpenAlex crawler
OpenAlex, run by the non-profit OurResearch, is a free and open catalogue of the global research system — papers, authors, institutions, venues, and concepts — offered as data and an API. Its crawler and harvesters gather scholarly metadata and links to build an open scientific knowledge graph. It is a research-metadata aggregator rather than a general web search engine.
- Semantic Scholar academic crawler
Semantic Scholar is a free academic search engine and research corpus built by the Allen Institute for AI (AI2). Its crawler fetches scholarly pages, papers, and metadata to index research literature and power citation-aware academic search. Unlike a general web crawler, it focuses on academic and publisher content, and AI2 publishes documentation and an open API around the corpus it builds.
- Web crawlers
How academic aggregators and search crawlers are categorised.
Sources and verification notes
- CORE — open-access research aggregatorAggregates open-access papers from repositories; harvesting documented.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.