Data quality

Backfill and reprocessing

When a pipeline misses data or processes it with a bug, backfilling re-runs it over the affected window to correct the record. Done carelessly, a backfill appends rows on top of existing ones and double-counts, or it overwrites good data with a still-buggy transform. This page explains how to reprocess a window safely so corrections fix the gap instead of creating a new one.

Partially verified

Two ways a backfill goes wrong

A backfill re-runs processing over a past window. The first failure mode is duplication: if the job appends results without first removing the prior output for that window, the corrected rows stack on top of the originals and the period double-counts. The second is overwriting with bad logic: replacing good data using a transform that still has the bug, making things worse.

Either way the historical number changes, which also breaks anyone who cached or reported on the old figure.

Append-without-delete double-counts the window
Overwriting with a still-buggy transform worsens data
Changed history breaks downstream caches and reports

Reprocessing safely

Make the load idempotent: scope the backfill to a bounded window and replace that partition atomically — delete-then-insert or write to a new partition and swap — so re-running yields the same result, not more rows. Validate the corrected window against an independent source before publishing. Communicate that a historical figure changed, and re-run any downstream jobs that consumed the old version.

Idempotency keys and partition-replace are the mechanics that make this repeatable.

How it appears in analytics and logs

Totals that jump for a past period after a maintenance run usually mean a backfill appended instead of replacing, double-counting that window.

Diagnostic use case

Correct a historical data gap or bug by reprocessing the affected window without double-counting or overwriting good data.

What WebmasterID can help detect

WebmasterID's source events give a fixed reference to validate a backfilled window against the original totals.

Common mistakes

Appending backfilled rows without removing the prior output.
Reprocessing with a transform that still contains the bug.
Changing history without re-running downstream consumers.

Privacy and accuracy notes

Reprocessing must respect deletions and retention from the original window. This page is educational, not legal advice.

↑ All data-quality topics in Data quality

Sources and verification notes

Google Cloud — Idempotent and replayable pipelines

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.