Data quality

Analytics data-quality reference: trust your numbers

A reference to analytics data quality. Each page explains a source of error or a diagnostic — sampling, cross-tool discrepancies, bot and spam filtering, double-counting, time-zone drift, deduplication — and how to reason about and reduce it.

123 data-quality topics documented · part of the Web Crawler & Traffic Intelligence Encyclopedia.

Analytics sampling: when reports estimate
Sampling is when an analytics tool computes a report from a fraction of the data and extrapolates. It keeps big queries fast, but it adds estimation error — worst for small segments and rare events, where a few sampled sessions get scaled into a confident-looking number. Knowing when a report is sampled is the first defence.
Bot traffic in analytics: filtering it out
Bots — crawlers, scrapers, monitors, scanners — generate requests that, unfiltered, inflate pageviews and distort every metric. Client-side analytics often misses bots (many do not run JavaScript) or miscounts the ones that do. Server-side classification at ingest is the reliable way to keep bot traffic out of human reports.
Why two analytics tools disagree
It is normal for two analytics tools to report different numbers for the same site. The differences are structural, not bugs: each tool defines a session differently, filters bots differently, samples or does not, attributes on different windows, and fires its tag at a different moment. This page explains the recurring causes and how to reconcile them.
Referral spam and ghost traffic
Referral spam and ghost traffic are fake hits crafted to appear in your reports. Crawler spam loads pages to leave a referrer in your logs; ghost spam sends hits straight to a measurement endpoint without ever visiting your site. Both add phantom sessions with no engagement. This page explains the mechanics and the filtering that removes them.
Ad blockers and analytics gaps
Content blockers and privacy extensions block requests to known analytics and tracking domains, so a share of visitors never fire the tag. The effect is a systematic undercount in client-side analytics that varies by audience and browser. This page explains how blocking works, why the gap is uneven, and how first-party server-side measurement reduces it.
ITP and browser tracking prevention
Intelligent Tracking Prevention (ITP) in Safari/WebKit, and equivalent protections in other browsers, limit how long cookies set by scripts survive and restrict cross-site tracking. The result: returning visitors look new, attribution windows shorten, and cohort retention is understated. This page explains the mechanisms and their effect on analytics.
Filtering internal traffic
Visits from your own team, contractors, and office networks inflate engagement on a small site and pollute conversion tests. Analytics tools let you define and filter internal traffic, usually by IP range or a tagging rule. This page covers how internal-traffic filtering works, why developer and QA traffic matters most, and the common mistakes that leave it on.
An analytics data-validation checklist
Before you act on a report, validate the data that produced it. This checklist walks the recurring failure points — duplicate tags, unfiltered bots, internal traffic, wrong time zone, broken events, sampling — and gives a concrete check for each. Run it after any tracking change and periodically, so a metric you trust is a metric you have verified.
Double-counting pageviews
Double-counting happens when a single page load fires the analytics tag more than once. Two snippets on the page, a tag added in both the site and a tag manager, or an SPA that fires a virtual pageview on top of the full-load one all do it. The result inflates pageviews and drags engagement and bounce metrics. This page covers detection and the fixes.
Time-zone mismatches in reporting
Every analytics property reports against a configured time zone, and it decides which calendar day each hit belongs to. A wrong zone shifts your daily curve; two tools on different zones never match day-to-day; and daylight-saving changes create a short or doubled hour. This page explains how the reporting time zone shapes data and the artefacts to expect.
Self-referrals and lost attribution
A self-referral is when your own site shows up as a referring source in your reports. It usually means a session was broken and a new one started attributed to your domain, often when a visitor crosses subdomains or returns from a payment provider. Self-referrals fragment sessions and steal credit from the real source. This page explains the causes and the fix.
Validating event tracking
Custom events power conversions, funnels, and product analytics — and they break quietly. A renamed CSS selector, a refactor, or a tag-manager edit can stop an event firing or change its parameters without any error. This page covers validating events: confirming they fire on the right action, exactly once, with the expected name and parameter values.
New vs returning misclassification
New-vs-returning depends on recognising the same visitor across visits, which relies on a stored identifier. When that identifier is missing — cleared cookies, tracking prevention, a different device or browser, or declined consent — a returning visitor is recorded as new. The result over-states 'new' visitors and understates loyalty. This page explains the failure modes.
Consent, modelling, and data gaps
Where consent is required before analytics runs, declined or pending consent means no data is collected for those visitors — a real gap, not lost interest. Some tools fill the gap with modelled estimates rather than measured counts. This page explains how consent shapes collection, what modelling is, and how to read a dataset that mixes measured and modelled data. Educational, not legal advice.
Direct traffic as a catch-all bucket
Direct traffic is often misread as 'people who typed the URL'. In practice it is a catch-all for any session with no usable referrer or campaign: untagged links, stripped referrers, app and messaging clicks, and redirects that lose data. When other attribution fails, direct swells. This page explains what really lands in the direct bucket and how to shrink it.
Language spam and keyword spam
Language spam and keyword spam place messages — promotions, slogans, even instructions — into fields like browser language or a site-search term. The values are forged, sent by bots or crafted hits to be read by whoever opens the report. They are not real visitor attributes. This page explains how the injection works and how to filter and recognise it.
Developer and QA traffic in reports
Development, staging, preview, and automated test traffic can all reach a production analytics property if the same measurement ID is reused or environments are not separated. The hits look like engaged users but represent your own pipeline. This page explains how dev and QA traffic leaks into reports and the configuration that keeps environments cleanly apart.
Late-arriving and offline hits
Not every hit arrives when it happens. A device offline queues events and sends them on reconnect; processing pipelines add delay; and tools backfill recent data. The effect is that today's and yesterday's numbers are provisional and keep rising as late hits land. This page explains why fresh reports change under you and how to read them.
High cardinality and the (other) row
Every analytics tool has limits on how many distinct values a dimension can hold in a report. When a high-cardinality dimension — like full URLs or custom IDs — exceeds the limit, the overflow is bundled into an aggregate (other) row. Detail you expected vanishes into it, and totals look complete while breakdowns are not. This page explains the cause and the workarounds.
Session fragmentation and inflation
A session is meant to represent one continuous visit, but several rules can split one journey into many. A timeout during a pause, a campaign parameter mid-visit, crossing midnight, or a self-referral each starts a fresh session. The result inflates session counts and shrinks per-session metrics. This page explains the fragmentation rules and how to read counts affected by them.
URL parameters splitting page reports
When URLs carry query parameters — campaign tags, ad-click IDs, session tokens, sort and filter state — analytics often treats each variant as a different page. One article scatters across dozens of rows, no single line shows its true total, and cardinality balloons. This page explains how URL parameter noise fragments page reports and how normalising paths fixes it.
Hostname leakage across properties
Your measurement ID is visible in page source, so anyone can paste it on another site and have that traffic report into your property. Staging copies, scraped clones, and proxies do this too. The leaked hits inflate and pollute your data with another domain's traffic. This page explains hostname/property leakage and the valid-hostname filtering that contains it.
Cross-domain tracking issues
A single user crossing from one domain to another (site to a separate checkout or booking host) should stay one user and one session. Without cross-domain tracking, the second domain starts a fresh session and often a self-referral, double-counting users and breaking attribution. This page explains how the GA4 linker passes the client ID and the common reasons it does not arrive.
Subdomain tracking issues
Subdomains under the same registrable domain (blog.example.com, shop.example.com) typically share a first-party cookie set on the parent domain, so a user stays continuous. Problems arise when the cookie domain is scoped too narrowly, when subdomains use separate properties, or when one subdomain appears as a referrer to another. This page distinguishes subdomain handling from true cross-domain tracking.
(not set) and Unassigned values
GA4 shows `(not set)` when no value was collected for a dimension at the time data was recorded, and `Unassigned` when traffic could not be matched to any defined channel group. These are not errors so much as honest placeholders — but each has distinct, documented causes worth diagnosing rather than ignoring. This page separates the placeholders and what produces them.
Data thresholding in GA4
Data thresholding is a GA4 privacy mechanism: when a report could let someone infer the identity of individual users from low-volume rows (especially with Google Signals or demographics enabled), GA4 hides some data. The result is missing rows and report totals that do not reconcile. This page explains when thresholding applies and how to recognize it.
Duplicate transactions in ecommerce data
Duplicate transactions occur when one purchase is counted more than once — usually because the order-confirmation page is reloaded, bookmarked, or shared, or because a retry resends the same event. GA4 deduplicates ecommerce purchases on `transaction_id`, so an absent or unstable ID is the root cause. This page covers detection and the deduplication key.
Missing currency or value on events
GA4 monetary events such as `purchase` need both a `value` and a `currency` field, and currency must be a valid ISO 4217 code. If currency is missing or invalid, GA4 may not credit the revenue; if value is missing, the event records but contributes nothing to monetary metrics. This page explains the requirement and the silent failure modes.
Daylight-saving time anomalies
When a region enters or leaves daylight saving time, one local hour vanishes (spring forward) or repeats (fall back). Reports bucketed by local wall-clock time then show a missing hour or a doubled hour, and the affected day has 23 or 25 hours. This page explains the artifact and why UTC-based analysis avoids it.
Server time vs client time
An event's timestamp can come from the client (the browser's clock at the moment of the action) or the server (when the collector received the hit). The two differ because of clock skew, network delay, and offline buffering, and the choice affects ordering, attribution windows, and which day an event lands on. This page contrasts the two clocks.
Data import errors in GA4
GA4 data import merges external files (cost data, item metadata, offline events, user attributes) with collected data by matching on a key. When the key, the column names, the date format, or the schema do not match exactly, rows fail to import or join to nothing — leaving partial or absent enriched data with no obvious error in reports. This page covers the join model and its failure points.
Referrer exclusion list mistakes
GA4's unwanted-referrals (referrer exclusion) configuration tells analytics not to treat certain domains — payment gateways, SSO providers, your own domains — as new traffic sources. Get it wrong and you either fragment sessions (under-listing a gateway) or erase real referrers (over-listing). This page explains the mechanism and the two-sided error.
Datacenter traffic filtering
A large share of non-human traffic originates from datacenter and cloud-hosting IP ranges — automation, scrapers, and monitoring that may not declare themselves as bots. Filtering on known datacenter ranges removes a class of noise that user-agent rules miss, but ranges change and some legitimate users (VPNs, corporate proxies) also live there. This page covers the technique and its limits.
IP filtering pitfalls
Filtering out internal or unwanted traffic by IP address is intuitive but fragile: residential IPs are dynamic, mobile and shared networks sit behind carrier-grade NAT, IPv6 prefixes differ from IPv4 rules, and privacy relays mask the real address. As a result IP filters silently stop matching or match the wrong people. This page details the pitfalls of IP-based filtering.
Ads vs analytics discrepancies
It is normal for Google Ads and GA4 to report different conversion and click numbers for the same campaign. They use different attribution models, count conversions at different times (Ads at click time, GA4 at conversion time), define a click versus a session differently, and apply different windows and de-duplication. This page enumerates the documented reasons the two tools diverge.
Tag Manager misconfiguration
Google Tag Manager (GTM) sits between your site and analytics, so a misconfigured container quietly distorts every downstream metric. Typical faults include triggers that fire on the wrong pages, tags that fire twice, dataLayer values pushed after the tag reads them, and changes left in Preview but never published. This page catalogs the misconfiguration classes and how to verify a container.
Tag firing order and timing
The order in which tags and scripts execute determines whether an event has the data it needs. If the analytics tag fires before the consent decision, before the dataLayer is populated, or after the user has already navigated away, the resulting event is dropped or stripped of context. This page explains firing-order and timing faults and the sequencing controls that fix them.
Consent-driven data loss
Under consent frameworks, visitors who decline analytics cookies cause measurement to be blocked or sent in a cookieless, anonymized form. The lost data is not random — it skews toward privacy-conscious users and certain regions — so totals understate reality in a structured way. This page distinguishes consent-driven loss from ad-blocking and explains the modeling response, as education rather than legal advice.
Dark traffic in analytics
Dark traffic (or dark social) is genuine human traffic whose source is lost, so it falls into the Direct bucket. It comes from links opened inside apps and messaging clients, email programs, documents, and secure-to-insecure transitions that strip the Referer header. The result is an inflated Direct channel that hides real acquisition. This page explains the mechanisms that erase the referrer.
Hit and event collection limits
Analytics collection is bounded: GA4 limits the number of distinct event names, the parameters per event, the length of names and values, and the size of each request. Exceed a limit and the surplus is truncated or dropped — usually without a visible error — so reports are quietly incomplete. This page summarizes the documented limits that cause silent data loss.
Modeled vs observed data
Modern analytics reports mix two kinds of figures: observed data measured directly, and modeled data — statistical estimates that fill gaps left by declined consent, cookie loss, and unmeasured sessions. Modeled conversions and behavioral modeling are estimates, can change as models update, and should not be treated as exact counts. This page distinguishes the two and explains how to interpret blended numbers.
PII leakage in URLs and reports
When URLs carry personal data — an email in a query string, a name in a path, a reset token after a redirect — analytics ingests that PII into page-path and page-location dimensions. Google Analytics policy prohibits sending PII, and once collected it is hard to remove. This page explains how leakage happens and how to redact before data is sent, as education rather than legal advice.
Currency and locale mismatches
Revenue breaks when monetary events mix currencies or send locale-formatted strings. A value like "1.234,56" (European format) or "$1,234.56" is not a number GA4 can sum, and reporting many currencies without per-event ISO codes makes totals meaningless. GA4 converts to a property base currency only when each event carries a valid currency. This page covers currency and locale formatting faults.
Sampling thresholds and cardinality interplay
Three GA4 mechanisms quietly limit what a report shows: sampling (when a query exceeds the event quota), data thresholds (privacy suppression of small groups), and cardinality limits (high-cardinality dimensions collapsing into an 'other' row). They have different triggers and effects, but in complex explorations they compound — so a report can be sampled, thresholded, and capped at once. This page untangles how they interact.
BigQuery vs UI discrepancies
When GA4's BigQuery export and the reporting interface show different totals, it is usually not a bug. The UI applies sampling, data thresholds, (other) aggregation, and behavioral/conversion modeling on top of the raw event stream; BigQuery exports the unmodeled, unsampled events. Knowing which transformations the UI adds explains most gaps.
Single-page-app tracking gaps
In a single-page application, the browser loads once and the framework swaps views via the History API without a new document load. Analytics that depend on the load event therefore record only the first screen. This page explains the gaps — missing virtual pageviews, stale page paths, and broken referrer chains — and how SPA-aware tracking closes them.
Auto-tagging vs UTM conflicts
Google Ads auto-tagging appends a gclid to ad landing URLs; manual tagging adds utm_ parameters. When a link carries both, the two systems describe the same click differently and can disagree. By default GA4 gives precedence to gclid-based auto-tagging, so hand-set utm_source/medium on Ads links may not appear as expected. This page explains the precedence and how to avoid the conflict.
Transaction ID deduplication
Ecommerce purchases are deduplicated on transaction_id. If a confirmation page reloads or a user refreshes, the same purchase event can fire twice; GA4 collapses repeated transaction_ids so revenue is not double-counted. The flip side: a missing transaction_id, or one reused across different orders, breaks dedup and corrupts revenue. This page explains the mechanism and its failure modes.
GA4 session redefinition gaps
GA4 redefined the session. Instead of Universal Analytics' rules that broke a session at midnight and on each new campaign source, GA4 starts a session with a session_start event, keeps it alive within a timeout window (default 30 minutes), and does not split it on a new campaign or at midnight. Teams migrating from UA see session counts and per-session metrics shift because of this redefinition.
AMP analytics gaps
Accelerated Mobile Pages restrict JavaScript and require the amp-analytics component instead of standard tags. Because AMP pages are frequently served from a cache (e.g. a Google AMP cache) on a different origin, the client identifier and referrer differ from the canonical page, so the same user can look like two users and a session can split when they move from AMP to canonical. This page explains the gaps and the linking that mitigates them.
Engaged session edge cases
GA4's engaged session is the basis for engagement rate and the inverse bounce rate. A session counts as engaged if it lasts longer than the engagement-time threshold (default 10 seconds), records a key event, or has at least two pageviews/screenviews. The edge cases — fast single-view satisfactions, a changed threshold, background time — quietly move engagement and bounce numbers. This page documents them.
Key event counting changes
GA4 renamed 'conversions' to 'key events' and added a counting-method choice: count a key event once per event, or once per session. The same traffic yields different totals under the two methods, and the rename plus the Ads-side split (conversions stay an Ads concept) confuse reconciliation. This page explains the counting methods and why totals move when they change.
Data modeling accuracy in GA4
When consent or identity data is missing, GA4 can estimate the unobserved portion using behavioral modeling (for users/sessions under consent mode) and conversion modeling (for unattributed conversions). These figures are modeled estimates, not counted events, and only appear when data volume meets Google's eligibility thresholds. This page explains what modeling does and the limits on its accuracy.
GCLID stripping and loss
The gclid is the Google Ads click identifier appended by auto-tagging. If anything between the ad and the page removes the parameter — a redirect that drops query strings, a CMS canonical rewrite, a link shortener, or a privacy tool — the landing page never sees the gclid and the click cannot be attributed. This page explains where gclid loss happens and how to detect it.
UTM overwrite issues
Campaign attribution depends on which UTM values are present when a session begins. Two patterns cause trouble: a second campaigned link mid-journey can overwrite the first, and internal links that accidentally carry UTM parameters can reset attribution to an internal source. This page explains how UTM values get overwritten and how to keep internal links clean.
Iframe tracking issues
An iframe is a nested browsing context with its own document, origin, and storage partition. Analytics running inside an iframe report the iframe's URL (not the parent page), see the parent as the referrer, and — under storage partitioning — cannot share cookies with the top-level site. This produces orphaned pageviews, self-referrals, and broken identity. This page explains the constraints.
Redirect and referrer loss
The referrer tells analytics where a visit came from, but it is fragile. A redirect hop can replace the original referrer with the redirector's URL, and Referrer-Policy or HTTPS-to-HTTP downgrades can suppress it entirely. When the referrer is empty, the visit falls into direct; when it is the redirector's domain, it can look like a self-referral. This page explains referrer loss in transit.
Timestamp skew and clock drift
Hit timestamps can come from the client device, whose clock may be set wrong or drift. An incorrect client clock produces events stamped minutes, hours, or even days off, which corrupts session boundaries, event ordering, and time-based reports. Platforms that adjust against server-received time mitigate this. This page explains clock drift and how analytics pipelines correct for it.
API export limits
Programmatic exports through the GA4 Data API are bounded: a single response returns up to a fixed number of rows, and each query is limited in how many dimensions and metrics it may combine. Pulls that ignore these limits truncate without obviously failing, producing partial datasets that look complete. This page explains the row and field caps and the pagination that avoids silent truncation.
Quota and throttling
GA4's Data API meters usage with token buckets per property, charging more tokens for larger and more complex queries. Concurrent requests are also capped. Pipelines that fan out too many or too-expensive queries exhaust the quota and get throttled with quota errors, so an export can fail partway and leave a gap. This page explains the quota model and how to stay under it.
Schema drift in event data
Schema drift is the gradual, uncoordinated change of event names, parameter keys, value types, or enumerations in an analytics stream. A renamed event, a parameter that switches from string to number, or a new value an enum did not expect can break joins, drop rows from filters, or quietly corrupt aggregates. This page explains how drift arises in event pipelines and how to guard against it.
Late data reprocessing
Reports for recent periods are provisional. As offline conversions upload, late hits arrive, modeling recalculates, and identity stitches resolve, the platform reprocesses and the numbers move. GA4 and similar tools have processing windows during which figures are not final. This page explains why recent data is unstable and when it can be trusted as settled.
Server-side deduplication
Server-side tagging and the Measurement Protocol let the server emit events alongside the browser. If a conversion fires from both the client tag and the server without coordination, it is counted twice. Deduplication on a shared event identifier prevents this, mirroring how ad platforms dedupe browser and server signals. This page explains the dual-send problem and the id-based dedup that solves it.
Sampling in explorations
GA4's Explore module samples differently from standard reports. When an exploration's query exceeds an event-count quota for the date range, GA4 analyses a representative subset and scales the results, flagging the sampling level. Deep, wide, or long-range explorations are most exposed. This page explains when Explorations sample and how to read the sampling indicator.
Partial data and freshness
Data freshness is how recently the data behind a report was processed. The current day and the most recent hours are partial: not every event has arrived or been processed, so totals are understated and shapes incomplete. GA4 exposes freshness expectations and shows real-time data separately. This page explains partial-data pitfalls and how to read freshness.
Cross-account data leakage
Cross-account data leakage is when events meant for one property land in another. It happens when a measurement ID is copied to the wrong site, a shared GTM container loads on multiple unrelated domains, or a tag template references the wrong destination. The result is inflated, contaminated data in the receiving property and missing data in the intended one. This page explains the causes and the hostname checks that catch it.
GA4 vs Search Console discrepancies
GA4 and Google Search Console measure adjacent but different events, so comparing their totals directly always shows a gap. Search Console counts clicks and impressions from search results; GA4 counts sessions and users that load your tag. Different time zones, filtering, de-duplication, and the moment of measurement all widen the difference. This page explains why the two never reconcile exactly and how to read each correctly.
GA4 vs Google Ads conversion gaps
GA4 and Google Ads frequently report different conversion numbers for the same campaigns. The causes are structural: Ads credits conversions to the click date and can count multiple per click, while GA4 attributes on its own model and counts on the conversion date. Add different attribution windows, modelling, and de-duplication and the totals diverge. This page explains the differences and how to compare them sensibly.
Search Console data gaps and limits
Search Console is a powerful but bounded dataset. It omits rare queries to protect privacy, caps the number of rows you can export, and reports recent days incompletely while data finalises. As a result query-level totals do not sum to the property total, and the latest days look low. This page explains the structural gaps in Search Console data so you read it without over-reaching.
Looker Studio discrepancies
A Looker Studio dashboard can show different figures from the GA4 property it draws on. The causes sit in the reporting layer: the connector may trigger sampling, default date ranges and filters differ, blended data sources fan out rows on joins, and cached results lag the source. This page explains why a dashboard and its source disagree and how to make a report trustworthy.
BigQuery export schema changes
The GA4 BigQuery export has a documented but evolving schema. Google adds fields, changes nested structures, and the intraday and daily tables do not always carry identical columns. A query written against an old shape can break or silently miss new data. This page explains how the export schema changes, where the risks are, and how to write queries that survive evolution.
Intraday vs daily export tables
The GA4 BigQuery export writes provisional intraday tables during the day and a finalised daily table afterwards. Intraday data is incomplete and can be reprocessed, and once the daily table lands the intraday one is removed. Querying both, or trusting intraday as final, causes double counts and shifting numbers. This page explains the two table types and how to query them correctly.
Streaming export gaps in BigQuery
GA4 offers two BigQuery export modes — continuous streaming and a once-daily batch — and they do not always agree. Streaming optimises for freshness and can omit or delay some events that the daily export later includes after fuller processing. Comparing the two for the same day reveals a gap. This page explains why streaming and daily exports differ and which to trust for a given purpose.
Event timestamp vs collection time
Every event carries more than one notion of time: when it happened on the client, when it was sent, and when the server received and processed it. These diverge with offline queuing, clock skew, and processing lag. Reports built on one clock will not match reports built on another. This page explains event timestamp versus collection time and which to use for which question.
User deletion and report effects
Honouring deletion requests and data-retention limits removes user-level data from analytics. Aggregate reports built on standard processing are largely unaffected, but user-scoped explorations, audiences, and the raw export can shrink as records are removed. Understanding what deletion touches prevents misreading a privacy action as a data fault. This page explains deletion's report effects. Educational, not legal advice.
Channel grouping rule changes
Default channel groupings are sets of rules that map sources and mediums to channels like Organic, Paid, and Referral. When a platform revises those rules — adding a channel, retiring one, or changing how a source is classified — traffic moves between channels and historical trends appear to jump. This page explains how channel-rule changes reshape reports and how to read a channel trend across a definition change.
Custom channel grouping pitfalls
Building a custom channel grouping gives you control, but it also introduces failure modes: rules that overlap so order decides classification, sessions that match nothing and fall to 'Unassigned', and changes that may not apply to history. This page covers the pitfalls of custom channel groupings and how to build one that classifies traffic cleanly and consistently.
Unwanted referrals and exclusions
Unwanted referrals are domains you do not want treated as a traffic source — payment gateways, single sign-on providers, and your own properties. Left unmanaged, a return from them starts a new, self-referred session and steals credit from the real source. A referral-exclusion list tells analytics to ignore those domains. This page explains unwanted referrals and how to configure the exclusion correctly.
Social vs referral misclassification
Traffic from social platforms should appear in a social channel, but it often lands in Referral instead. The cause is classification: analytics recognises social by matching the referrer against a known list, and app clients, short-link domains, and new platforms may not match. The result understates social and inflates referral. This page explains the misclassification and how to correct channel attribution.
Referral vs organic misattribution
Organic search should be credited to a search channel, but visits sometimes land in Referral instead. It happens when a search engine is not on the recognised-search list, when a search result passes a non-standard referrer, or when redirects strip the search context. The effect undercounts organic and inflates referral. This page explains referral-versus-organic misattribution and how to correct it.
Measurement ID mix-ups
A measurement ID is the address a tag sends data to. Wire up the wrong one and a site reports into another property, splits its traffic across two IDs, or sends nothing useful at all. Mix-ups arise from copy-paste, multiple environments, and migrations. This page explains the failure modes of measurement-ID mistakes and how hostname and real-time checks surface them quickly.
Multiple tags on one page
When more than one analytics tag loads on the same page, hits get duplicated, events fire twice, and tags can race or overwrite each other's configuration. It usually stems from a snippet hard-coded in the template and also added via a tag manager, or two tag-manager containers. This page explains how multiple tags on one page distort data and how to detect and consolidate them.
Session timeout customization
A session ends after a period of inactivity, and that timeout is configurable. Lengthen it and long pauses no longer split a visit into two sessions; shorten it and they do. Either change moves session counts, sessions-per-user, and engagement, and it makes your data diverge from any tool on a different timeout. This page explains how customising the session timeout reshapes metrics.
Campaign timeout window effects
A campaign or acquisition timeout controls how long the source that brought a visitor keeps getting credit for their later sessions. When that window expires before the visitor returns, the next session is no longer attributed to the original source and often falls to Direct. Changing the window moves attribution between channels. This page explains the campaign timeout and its effect on source reports.
Data-collection region restrictions
Where analytics may collect, and at what granularity, can vary by region. Regulatory requirements, regional data settings, and features like restricting fine-grained location and device data mean visitors from some regions are measured less completely than others. The result is uneven coverage and granularity across geographies, not a uniform dataset. This page explains regional collection restrictions. Educational, not legal advice.
Geo and IP location mismatch
Analytics infers a visitor's location from their IP address, and that inference is approximate. VPNs and proxies relocate visitors, mobile carrier routing can place a user far from where they are, and IP databases are imprecise at city level. The result is location data that is directional, not exact. This page explains why geo and IP location mismatch and how to read location reports with appropriate caution.
Locale and number formatting noise
Numbers carry locale: a comma is a thousands separator in one place and a decimal point in another, and currency and date formats differ everywhere. When values are imported, parsed, or merged across locales without normalising, amounts are misread — a price becomes a thousand times larger or a decimal collapses. This page explains locale and number-formatting noise and how to normalise to avoid it.
Attribution window mismatch across tools
Attribution look-back windows define how far back a tool searches for the touchpoints that earn conversion credit. When two tools use different window lengths or models, the same conversion is credited differently — or to a touchpoint one tool can see and the other cannot. This page explains how attribution-window mismatches across tools produce diverging conversion and channel numbers, and how to compare fairly.
Measurement Protocol spam
The GA4 Measurement Protocol lets servers send events over HTTP. Because the measurement ID is visible in page source, attackers can craft requests that inject fabricated events, hostnames, or referrers into a property. The api_secret raises the bar but is a shared key, not per-user proof. This page explains how Measurement Protocol spam enters GA4 and how to recognize and contain it.
Tracking plan governance
A tracking plan documents every event, its parameters, types, and meaning, so collection stays consistent as teams ship. Governance is the process around it: who can add events, how changes are reviewed, and how naming is enforced. Without it, the same action gets logged three ways and reports quietly diverge. This page describes how tracking-plan governance keeps an analytics schema trustworthy.
GTM server container issues
A server-side Google Tag Manager container receives requests at an endpoint you host, transforms them with client and tag templates, and forwards to vendors. The extra hop adds failure modes: the transport URL must point at the server container, a client must claim each incoming request, and the server must be reachable. This page covers the common server-container issues that quietly drop or duplicate data.
Safari 7-day cookie cap
Apple's Intelligent Tracking Prevention caps the lifetime of cookies set via document.cookie in JavaScript to seven days, and to one day for visits classified as coming from a tracker-laden link with query parameters. First-party analytics cookies set this way expire early, so returning visitors look new and attribution windows shorten. This page explains the cap and its effect on data quality.
GA4 vs Ads conversion timing
GA4 and Google Ads can report the same conversion in different time buckets because they anchor it to different moments. GA4 attributes a conversion to the day the event happened; Google Ads, for its conversion columns, can credit it to the day of the click that led to it. Over a date range the totals reconcile, but day-by-day they diverge. This page explains conversion-timing differences between the two tools.
History-change double counting
Single-page apps signal navigation through the History API, and GTM's History Change trigger listens for it. Some routers call pushState and then emit their own navigation event, or fire popstate alongside a manual push, so the trigger fires more than once per route change. Each firing sends a virtual pageview, inflating counts. This page explains history-change double counting and how to make the trigger fire once.
BigQuery streaming vs daily tables
The GA4 BigQuery export produces two table types: events_intraday_YYYYMMDD streaming tables that fill through the day, and events_YYYYMMDD daily tables finalized afterward. A row can appear in intraday and then again in the consolidated daily table. Querying both, or assuming intraday is complete, distorts counts. This page distinguishes streaming and daily export tables and how to query them safely.
BigQuery intraday table schema
GA4's events_intraday_ streaming table is not always schema-identical to the finalized events_ daily table. Some fields populated after processing — certain attribution, session, or derived columns — may be empty or absent intraday. Queries written against the daily schema can fail or return nulls against intraday. This page explains intraday schema differences so real-time queries do not silently lose fields.
Event parameter unnesting in BigQuery
In the GA4 BigQuery export, event_params is an array of key/value records where the value lives in one of string_value, int_value, float_value, or double_value. Reading a parameter requires UNNEST plus a key filter, and doing it carelessly multiplies rows so event counts inflate. This page explains how to unnest event parameters correctly and why the wrong join over-counts.
BigQuery cost and quota limits
BigQuery on-demand pricing bills by bytes scanned, and the service enforces quotas on concurrent and daily activity. GA4 export tables are date-sharded, so a query that ignores the date suffix scans every day and runs up cost; quota limits can reject jobs at peak. This page explains how cost and quotas affect GA4 export work and how to keep scans and jobs bounded.
BigQuery user_id vs pseudo_id
In the GA4 BigQuery export, user_pseudo_id is the device/instance identifier and user_id is the optional ID you set for logged-in users. They count different things: pseudo_id resets when storage clears, while user_id can unify a person across devices. Treating them interchangeably miscounts users. This page explains the two identifiers and how each affects user counts in the export.
Measurement Protocol api_secret handling
GA4's Measurement Protocol requires an api_secret paired with a measurement ID to accept events. The secret is a shared property-level key, not a per-user credential, so wherever it leaks — client bundles, public repos, logs — anyone can send events as you. This page explains how to handle the api_secret and why its exposure, not its existence, is the data-quality risk.
Fake event protection
Fabricated events reach analytics through the Measurement Protocol, replayed beacons, or scripted bots. Because collection endpoints accept well-formed requests by default, defense relies on validation: allow-listing hostnames, checking event shape, and flagging implausible patterns. This page describes layered protections that keep fake events out of trusted totals without claiming any single control is foolproof.
Server-side event validation
Server-side collection gives one place to validate every event before it is stored or forwarded. Checks fall into shape (does it match the tracking plan), type (are values the right kind), and plausibility (is the sequence possible). Rejecting or quarantining failures keeps malformed and fabricated data out of downstream tables. This page describes how server-side event validation gates an analytics pipeline.
Duplicate pageviews in SPAs
Single-page apps often send a pageview on initial load and another from a route-change listener, and on the very first view both can fire for the same URL. Strict-mode double renders, mounting effects that run twice, and duplicate listeners add more. The result is inflated pageviews concentrated on entry pages. This page explains duplicate SPA pageviews and how to fire exactly one per route.
Hash and hashbang routing gaps
Some single-page apps route via the URL fragment (#/path or the older #! hashbang). The fragment after # is never sent to the server and does not trigger a navigation, so server logs and pageview scripts that rely on full navigations miss these route changes. The result is undercounted views and a landing URL stuck at the entry path. This page explains hash-routing measurement gaps.
First-party cookie expiry
A first-party analytics cookie carries a client identifier whose lifetime depends on its configured expiry, the browser's caps, and the user clearing storage. When it expires, the same person starts as a new visitor and prior-visit linkage is lost. Configured lifetimes are upper bounds browsers can shorten. This page explains the forces that govern first-party cookie expiry and their effect on counts.
Data contracts for events
A data contract is a versioned, machine-checkable agreement between the team that emits an event and the teams that consume it, fixing field names, types, semantics, and compatibility rules. Unlike a wiki page, it is enforced in CI or at the boundary, so a producer cannot ship a breaking change unnoticed. This page explains data contracts and how they protect analytics from upstream drift.
Event schema enforcement
A schema only protects data quality if something enforces it. Enforcement validates each event against the declared schema and decides what happens on failure — reject, drop a field, or quarantine. It can run in CI against instrumentation, in the SDK, or at the collection boundary. This page explains where to enforce an event schema and the trade-offs of each point.
Semantic versioning for events
Semantic versioning gives event schemas a shared vocabulary for change: a major bump means a breaking change, minor means a backward-compatible addition, patch means a fix that does not alter shape. Tagging events and their schemas this way tells consumers whether an upgrade is safe to ignore or requires work. This page maps semver to event-schema changes and how it coordinates producers and consumers.
Conversion count vs event count
A key event (conversion) in GA4 is derived from an event, but the conversion total need not equal the raw count of that event. Counting method (every event vs once per session), the historical 'each time' versus 'one per session' setting, and de-duplication all separate the two numbers. This page explains why conversions and the events behind them diverge and how to read each.
View vs engaged-view conversions
An engaged-view conversion credits a conversion to a video ad the user watched for a qualifying duration without clicking, distinct from a click-through conversion. The two answer different questions, and summing them or comparing a click-only tool to an engaged-view-inclusive one overstates credit. This page explains view versus engaged-view conversions and how to avoid double-counting them.
Currency conversion timing
When events arrive in different currencies, analytics converts each to the property's reporting currency using an exchange rate tied to a date. Which date — event day, processing day, prior-day rate — determines the converted total, so the same orders can sum to different revenue depending on timing. This page explains how currency-conversion timing affects revenue figures and reconciliation.
ETL and pipeline failures
Analytics often flows through an ETL/ELT pipeline that extracts events, transforms them, and loads them into reporting tables. A failure at any stage — a timed-out extract, a transform exception, a half-written load — leaves data partial or stale, and if the failure is silent it reads as a genuine traffic dip. This page explains ETL failure modes and how to tell a pipeline gap from a real one.
Backfill and reprocessing
When a pipeline misses data or processes it with a bug, backfilling re-runs it over the affected window to correct the record. Done carelessly, a backfill appends rows on top of existing ones and double-counts, or it overwrites good data with a still-buggy transform. This page explains how to reprocess a window safely so corrections fix the gap instead of creating a new one.
Idempotency and dedup keys
Distributed pipelines deliver at least once, so the same event can arrive twice from retries, replays, or backfills. An idempotency key — a stable, unique identifier per event — lets the pipeline recognize a repeat and keep exactly one copy, so re-processing does not inflate counts. This page explains idempotency and de-duplication keys and how to choose one that survives the whole pipeline.
Schema evolution for events
Event schemas must change as products evolve, but a careless change breaks every consumer reading the old shape. Schema evolution is the discipline of changing shape compatibly: additive changes that old readers ignore, backward compatibility so new code reads old data, forward compatibility so old code tolerates new data. This page explains compatible schema evolution for analytics events.
Event naming collisions
When two teams independently use the same event name for different actions — or reuse a platform's reserved name — the analytics tool merges them into one stream. A 'submit' event that means a newsletter signup in one place and a checkout in another becomes an uninterpretable blend. This page explains event naming collisions, including reserved-name clashes, and how namespacing prevents them.
PII redaction in the pipeline
Personal data leaks into analytics through URLs, free-text parameters, and over-eager instrumentation. Redacting it at the collection boundary — before storage — is more reliable than deleting it later. Techniques include allow-listing permitted fields, scrubbing known patterns (emails, tokens), and stripping query parameters. This page explains pipeline-level PII redaction and why the boundary is the right place for it.
Consent state in the pipeline
Whether an event may be processed for analytics or ads depends on the visitor's consent at collection time. If that consent state is not captured on the event and carried through every pipeline stage, downstream jobs cannot honor it — they may store or forward data the user declined. This page explains propagating consent state through a pipeline so processing matches what was granted.
Dead-letter queue for events
When an event cannot be processed — it fails validation, throws in a transform, or repeatedly errors — a dead-letter queue (DLQ) holds it instead of discarding it. The DLQ preserves the data for inspection and replay, and its depth is a live signal that something upstream broke. This page explains how a dead-letter queue protects analytics completeness and surfaces failures.
Event ordering guarantees
Events that happen in one order can arrive in another: parallel transport, retries, and varied network paths reorder them. Analyses that assume arrival order — funnels, first-touch, session sequencing — then draw wrong conclusions. This page explains why ordering is not guaranteed in distributed collection and how event timestamps and partition keys let you reconstruct true order.
Watermarking late data
Streaming analytics groups events into time windows, but some events arrive late — buffered offline, delayed in transit. A watermark is the pipeline's estimate of how far event time has progressed, used to decide when a window is complete enough to emit. Set it too tight and late events are dropped; too loose and results lag. This page explains watermarking and its trade-off for late-arriving data.
Data freshness SLAs
A data freshness SLA states the maximum acceptable lag between when an event happens and when it is queryable — for example, dashboards no more than an hour behind. Measuring freshness and alerting when it slips turns silent staleness into a known, bounded condition. This page explains freshness SLAs and how to monitor data age so decisions are not made on stale numbers.
Monitoring event volume anomalies
The fastest signal that instrumentation broke is usually event volume: a deploy that drops a tag halves an event count overnight; an injection spike doubles it. Monitoring volume per event type against its recent norm catches these before anyone reads a wrong report. This page explains anomaly monitoring on event volume and how to separate breakage from genuine change.

Other reference hubs

AI crawlers
Search bots
User agents
Referrers
UTM tracking
Robots & crawl control
Crawl diagnostics
Geo traffic
Analytics metrics
Analytics dimensions
Event tracking
Attribution models
Privacy & compliance
Conversion & funnels
Analytics platforms
Reports & dashboards

See how WebmasterID applies this in product: Bot intelligence, AI referrals, and AI visibility analytics.