Search bots

CORE academic aggregator crawler

CORE is one of the world's largest aggregators of open-access research papers, harvesting content from institutional and subject repositories to provide a unified scholarly search and dataset. Its crawler and harvesters fetch open-access papers and metadata from repositories rather than indexing the general web. It appears in logs as scholarly harvesting, typically against repository and publisher endpoints.

Verified against primary sources

What this means

CORE aggregates open-access research from thousands of repositories and journals worldwide, offering search, a dataset, and an API over harvested scholarly content. It is run as a not-for-profit scholarly service.

If you operate an institutional repository or open-access journal, CORE may harvest your papers and metadata to include them in its aggregated index. This is scholarly harvesting, distinct from general web search crawling.

How it identifies itself

CORE harvesting carries a CORE-identifying user-agent and frequently uses standard repository protocols such as OAI-PMH for metadata, alongside HTTP fetches for full text. Match on the CORE identity and the harvesting context rather than an exact version string.

As with any crawler, the user-agent is a claim and can be copied. Corroborate with behaviour where authenticity matters.

Operator: CORE (open-access research aggregator)
Scope: open-access papers and metadata from repositories
Often uses OAI-PMH metadata harvesting plus full-text fetch

robots.txt considerations

To express a crawl preference for CORE, target its documented user-agent token in robots.txt. Note that repository metadata is often exposed deliberately via OAI-PMH for harvesting, so blocking the HTTP crawler may not stop metadata aggregation.

robots.txt is honoured by compliant crawlers and is not an access control. For open-access content, harvesting generally increases discoverability of your research.

How it appears in analytics and logs

A CORE request means an open-access aggregator harvested scholarly content or metadata from your repository. It is academic-harvesting bot traffic, not a human visit and not a general web-search crawl; it reflects inclusion in a unified research index.

Diagnostic use case

Recognise CORE's open-access harvesting in repository logs, distinguish scholarly aggregation from general web search, and read it as research-content collection.

What WebmasterID can help detect

WebmasterID classifies CORE's harvester server-side as an academic-aggregation bot and surfaces it on the bot-intelligence surface, so scholarly harvesting stays separate from human analytics.

Common mistakes

Confusing scholarly open-access harvesting with general web search indexing.
Blocking the HTTP crawler while still exposing metadata via OAI-PMH.
Counting harvesting hits as human research readers in analytics.

Privacy and accuracy notes

Identification uses only the request user-agent and harvesting context. No visitor identity is involved. WebmasterID records the fetch as a bot event, separate from human analytics, and never attaches it to a profile.

↑ All search bots in Search bots

Sources and verification notes

CORE — open-access research aggregatorAggregates open-access papers from repositories; harvesting documented.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.