Backfill and reprocessing
When a pipeline misses data or processes it with a bug, backfilling re-runs it over the affected window to correct the record. Done carelessly, a backfill appends rows on top of existing ones and double-counts, or it overwrites good data with a still-buggy transform. This page explains how to reprocess a window safely so corrections fix the gap instead of creating a new one.
Two ways a backfill goes wrong
A backfill re-runs processing over a past window. The first failure mode is duplication: if the job appends results without first removing the prior output for that window, the corrected rows stack on top of the originals and the period double-counts. The second is overwriting with bad logic: replacing good data using a transform that still has the bug, making things worse.
Either way the historical number changes, which also breaks anyone who cached or reported on the old figure.
- Append-without-delete double-counts the window
- Overwriting with a still-buggy transform worsens data
- Changed history breaks downstream caches and reports
Reprocessing safely
Make the load idempotent: scope the backfill to a bounded window and replace that partition atomically — delete-then-insert or write to a new partition and swap — so re-running yields the same result, not more rows. Validate the corrected window against an independent source before publishing. Communicate that a historical figure changed, and re-run any downstream jobs that consumed the old version.
Idempotency keys and partition-replace are the mechanics that make this repeatable.
How it appears in analytics and logs
Totals that jump for a past period after a maintenance run usually mean a backfill appended instead of replacing, double-counting that window.
Diagnostic use case
Correct a historical data gap or bug by reprocessing the affected window without double-counting or overwriting good data.
What WebmasterID can help detect
WebmasterID's source events give a fixed reference to validate a backfilled window against the original totals.
Common mistakes
- Appending backfilled rows without removing the prior output.
- Reprocessing with a transform that still contains the bug.
- Changing history without re-running downstream consumers.
Privacy and accuracy notes
Reprocessing must respect deletions and retention from the original window. This page is educational, not legal advice.
Related pages
- Idempotency and dedup keys
Distributed pipelines deliver at least once, so the same event can arrive twice from retries, replays, or backfills. An idempotency key — a stable, unique identifier per event — lets the pipeline recognize a repeat and keep exactly one copy, so re-processing does not inflate counts. This page explains idempotency and de-duplication keys and how to choose one that survives the whole pipeline.
- ETL and pipeline failures
Analytics often flows through an ETL/ELT pipeline that extracts events, transforms them, and loads them into reporting tables. A failure at any stage — a timed-out extract, a transform exception, a half-written load — leaves data partial or stale, and if the failure is silent it reads as a genuine traffic dip. This page explains ETL failure modes and how to tell a pipeline gap from a real one.
- Late data reprocessing
Reports for recent periods are provisional. As offline conversions upload, late hits arrive, modeling recalculates, and identity stitches resolve, the platform reprocesses and the numbers move. GA4 and similar tools have processing windows during which figures are not final. This page explains why recent data is unstable and when it can be trusted as settled.
- Website observability
Validate a backfilled window before publishing.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.