ContextFlow
Adobe-Livestream-style real-time clickstream pipeline — synthetic web events at 1k+ QPS, side-input geo enrichment, BigQuery Storage Write API, partitioned + clustered for cheap dashboard scans.
Try it — publish a live event
Click the button to publish a synthetic clickstream event to
clickstream-raw-topic on GCP Pub/Sub. The streaming
Dataflow job picks it up, enriches it via side input, and writes it to
contextflow.clickstream_events within ~10 s.
How it works — 4-step data flow
Ingest
Google's Streaming_Data_Generator Flex template emits
CLICKSTREAM events into Pub/Sub at 1.1k QPS. The button above
publishes the same shape via Lambda.
Parse
ParseClickstreamFn normalises camelCase ↔ snake_case
and coerces event_timestamp to a Beam
Timestamp (Storage Write API rejects raw strings).
Enrich
geo_lookup.csv is loaded once from GCS and broadcast
as pvalue.AsDict(...). Every event picks up
country + city via first-octet match on
ip_address — zero per-event RPCs.
Land
WriteToBigQuery(STORAGE_WRITE_API,
with_auto_sharding=True, triggering_frequency=10s) streams
into a DAY-partitioned, (event_name, country)-clustered
table.
Architecture
Three clouds, one repo. GCP runs the pipeline + warehouse. AWS hosts this page + the public test trigger. Looker Studio embeds the live report below.
flowchart LR
subgraph SRC["Event sources"]
SDG["Dataflow
Streaming_Data_Generator
1.1k QPS"]
WEB["cfa.vinhnx.ca
(this page)"]
end
subgraph AWS["AWS — public test trigger"]
CF["CloudFront + S3
static frontend"]
APIGW["API Gateway
POST /publish"]
L["Lambda publish_event
(WIF to GCP)"]
end
subgraph GCP["Google Cloud"]
PS[("Pub/Sub
clickstream-raw-topic")]
SUB[("Subscription
clickstream-raw-sub")]
GCS[("GCS
enrichment/geo_lookup.csv")]
subgraph DF["Apache Beam on Cloud Dataflow"]
direction TB
PARSE["ParseClickstreamFn"]
GEOSI[("Side Input
pvalue.AsDict")]
ENRICH["EnrichGeoFn"]
WRITE["WriteToBigQuery
STORAGE_WRITE_API"]
end
BQ[("BigQuery
contextflow.clickstream_events
DAY-partitioned · clustered")]
LKR["Looker Studio
events per city"]
end
SDG --> PS
WEB --> CF
WEB -- "POST /publish" --> APIGW --> L --> PS
PS --> SUB --> PARSE --> ENRICH --> WRITE --> BQ
GCS --> GEOSI -.->|broadcast| ENRICH
BQ -.->|DirectQuery| LKR
classDef gcp fill:#E8F0FE,stroke:#1A73E8,color:#0B57D0;
classDef aws fill:#FFF4E5,stroke:#FF9900,color:#7A3E00;
class PS,SUB,GCS,PARSE,GEOSI,ENRICH,WRITE,BQ,LKR gcp;
class CF,APIGW,L aws;
Event schema
8 columns total: 6 original from the publisher + 2 enriched by the side input.
| # | Column | Type | Source | Notes |
|---|---|---|---|---|
| 1 | user_id | STRING | original | e.g. u-1234 |
| 2 | session_id | STRING | original | |
| 3 | page_url | STRING | original | |
| 4 | event_name | STRING | original | cluster key |
| 5 | ip_address | STRING | original | enrichment lookup key |
| 6 | event_timestamp | TIMESTAMP | original | partition column (DAY) |
| 7 | country | STRING | enriched | side-input first-octet match |
| 8 | city | STRING | enriched | same lookup as country |
Live — events per city
Live Looker Studio report bound to the BigQuery view
contextflow.v_events_per_city (last 7 days, pre-aggregated
per (date, country, city, event_name)). Re-queries on
every page load.
Design highlights
-
Side-input enrichment — the geo CSV is loaded
once at construction and broadcast as
pvalue.AsDict(beam.Create(items)). Zero per-event network calls; refreshing requires a drain + restart. -
Storage Write API + auto-sharding — highest
sustained throughput and lowest cost for streaming Dataflow → BQ at
1k+ QPS, vs the legacy
STREAMING_INSERTS. -
Partition routing — DAY partitions on
event_timestamp+ clustering on(event_name, country)let dashboards prune to cluster blocks, not full partitions. -
Streaming detach — submit calls
p.run()and exits immediately onDataflowRunner; otherwise CI hangs forever waiting for a never-finishing streaming job. -
Budget kill-switch — a Cloud Function detaches
billing if the project crosses
$20/day. Defense in depth for a personal demo.
Repo layout
pipeline/— Beam DAG + custom optionsterraform/— Pub/Sub, BQ, GCS, IAM, budget kill-switchaws/— CloudFront + S3 + API Gateway + Lambda (this page)bi/— Looker Studio runbooksql/— partition monitor + validation queries.github/workflows/— CI tests + 3-cloud deploy pipelines
Links
- Source: github.com/nxv-can/cfa
- Spec: SPECS.md
- Author: vinhnx.ca