ContextFlow

Adobe-Livestream-style real-time clickstream pipeline — synthetic web events at 1k+ QPS, side-input geo enrichment, BigQuery Storage Write API, partitioned + clustered for cheap dashboard scans.

Pub/Sub Apache Beam Cloud Dataflow BigQuery (Storage Write API) Looker Studio CloudFront + S3 API Gateway + Lambda Workload Identity Federation Power BI (optional)
1,100QPS sustained
~10 send-to-end latency
8cols (6 raw + 2 enriched)
DAYpartition + 2-key cluster
3clouds, 1 deploy

Try it — publish a live event

Click the button to publish a synthetic clickstream event to clickstream-raw-topic on GCP Pub/Sub. The streaming Dataflow job picks it up, enriches it via side input, and writes it to contextflow.clickstream_events within ~10 s.

How it works — 4-step data flow

1

Ingest

Google's Streaming_Data_Generator Flex template emits CLICKSTREAM events into Pub/Sub at 1.1k QPS. The button above publishes the same shape via Lambda.

2

Parse

ParseClickstreamFn normalises camelCase ↔ snake_case and coerces event_timestamp to a Beam Timestamp (Storage Write API rejects raw strings).

3

Enrich

geo_lookup.csv is loaded once from GCS and broadcast as pvalue.AsDict(...). Every event picks up country + city via first-octet match on ip_address — zero per-event RPCs.

4

Land

WriteToBigQuery(STORAGE_WRITE_API, with_auto_sharding=True, triggering_frequency=10s) streams into a DAY-partitioned, (event_name, country)-clustered table.

Architecture

Three clouds, one repo. GCP runs the pipeline + warehouse. AWS hosts this page + the public test trigger. Looker Studio embeds the live report below.

flowchart LR
    subgraph SRC["Event sources"]
        SDG["Dataflow
Streaming_Data_Generator
1.1k QPS"] WEB["cfa.vinhnx.ca
(this page)"] end subgraph AWS["AWS — public test trigger"] CF["CloudFront + S3
static frontend"] APIGW["API Gateway
POST /publish"] L["Lambda publish_event
(WIF to GCP)"] end subgraph GCP["Google Cloud"] PS[("Pub/Sub
clickstream-raw-topic")] SUB[("Subscription
clickstream-raw-sub")] GCS[("GCS
enrichment/geo_lookup.csv")] subgraph DF["Apache Beam on Cloud Dataflow"] direction TB PARSE["ParseClickstreamFn"] GEOSI[("Side Input
pvalue.AsDict")] ENRICH["EnrichGeoFn"] WRITE["WriteToBigQuery
STORAGE_WRITE_API"] end BQ[("BigQuery
contextflow.clickstream_events
DAY-partitioned · clustered")] LKR["Looker Studio
events per city"] end SDG --> PS WEB --> CF WEB -- "POST /publish" --> APIGW --> L --> PS PS --> SUB --> PARSE --> ENRICH --> WRITE --> BQ GCS --> GEOSI -.->|broadcast| ENRICH BQ -.->|DirectQuery| LKR classDef gcp fill:#E8F0FE,stroke:#1A73E8,color:#0B57D0; classDef aws fill:#FFF4E5,stroke:#FF9900,color:#7A3E00; class PS,SUB,GCS,PARSE,GEOSI,ENRICH,WRITE,BQ,LKR gcp; class CF,APIGW,L aws;

Event schema

8 columns total: 6 original from the publisher + 2 enriched by the side input.

#ColumnTypeSourceNotes
1user_idSTRINGoriginale.g. u-1234
2session_idSTRINGoriginal
3page_urlSTRINGoriginal
4event_nameSTRINGoriginalcluster key
5ip_addressSTRINGoriginalenrichment lookup key
6event_timestampTIMESTAMPoriginalpartition column (DAY)
7countrySTRINGenrichedside-input first-octet match
8citySTRINGenrichedsame lookup as country

Live — events per city

Live Looker Studio report bound to the BigQuery view contextflow.v_events_per_city (last 7 days, pre-aggregated per (date, country, city, event_name)). Re-queries on every page load.

If the embed shows "refused to connect", your browser is blocking third-party cookies for lookerstudio.google.com. Use the "open in new tab" link above.

Design highlights

Repo layout

Links