Data quality

Watermarking late data

Streaming analytics groups events into time windows, but some events arrive late — buffered offline, delayed in transit. A watermark is the pipeline's estimate of how far event time has progressed, used to decide when a window is complete enough to emit. Set it too tight and late events are dropped; too loose and results lag. This page explains watermarking and its trade-off for late-arriving data.

Partially verified

What a watermark is

In stream processing, results are computed over windows of event time — say, per hour. Because events can arrive late, the system needs to know when an hour is 'done' enough to emit. A watermark is a moving estimate of event-time progress: when it passes the end of a window, the window is considered complete and its result is produced. Events arriving after the watermark are late.

The watermark is a heuristic about lateness, not a guarantee that no later event exists.

Windows are defined in event time
Watermark estimates how far event time has advanced
Events past the watermark are 'late'

The latency trade-off

A conservative watermark waits longer, capturing more late events but delaying results; an aggressive one emits sooner but risks dropping or having to amend late data. Many systems add an allowed-lateness grace period that re-emits a window when qualifying late events arrive, at the cost of mutable results. Choose the trade-off to match how late your sources realistically are and how fresh consumers need the numbers.

This is the streaming counterpart to late-data reprocessing in batch pipelines.

How it appears in analytics and logs

A window whose total keeps changing after it 'closed', or late events missing entirely, reflects how the watermark and allowed lateness are set.

Diagnostic use case

Decide when a time window's results are final by using a watermark that trades waiting for late events against reporting latency.

What WebmasterID can help detect

WebmasterID's event timestamps let late hits be placed in their true window rather than the moment they arrived.

Common mistakes

Setting a watermark tighter than real lateness, dropping events.
Treating a window total as final before the watermark passes.
Ignoring allowed-lateness, so amended results surprise consumers.

Privacy and accuracy notes

Watermarking uses event timing, not visitor identity. This page is educational, not legal advice.

↑ All data-quality topics in Data quality

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.