Building a Service Health Observability Framework on Google Cloud

Published on March 15, 2026 • Updated April 10, 2026

Introduction

Google Cloud's Personalized Service Health (PSH) provides a highly tailored view of the real-time operational status of all Google Cloud infrastructure systems relevant to your specific projects. Being able to access this data globally is crucial. However, merely having access to this data in the GCP Console is insufficient for modern Site Reliability Engineering (SRE) teams who require automated, immediate, and actionable alerts to maintain strict Service Level Agreements (SLAs).

In this deep dive, we outline a highly robust, event-driven Service Health Observability Framework built natively on GCP. By aggregating PSH events across your entire organization, querying them centrally via BigQuery, and deploying a stateless Python monitoring application on Cloud Run Jobs, you can confidently drive threaded, deduplicated incident updates directly into your corporate communication platforms like Google Chat — while simultaneously generating executive-grade PDF reports and feeding real-time analytics dashboards.

The Challenge: Organization-Wide Blind Spots

In large-scale enterprise environments deploying thousands of loosely coupled microservices across dozens of disconnected GCP projects, a single localized region outage (e.g., us-central1 Cloud SQL disruption) can trigger massive cascading downstream failures.

Without an aggregated observability methodology, incident response teams typically fall into two fatal traps: relying strictly on delayed generic emails, or suffering from "Alert Fatigue" caused by fragmented, repetitive notifications hitting Chat ops pipelines without any stateful context.

Common Pitfall: Do not rely on reactive, decentralized email alerts configured on a per-project basis. Attempting to manage hundreds of bespoke notification channels leads to misconfigurations, missed critical escalations, and a total lack of cross-organizational analytics capability.

The Solution: An Event-Driven Observability Framework

Our strategy rejects localized monitoring in favor of a strictly centralized approach. We capture everything via an Organization Log Sink, persist it cleanly in BigQuery for historical SLA analysis, and run intelligent event classification against the data to construct actionable, context-rich alerts — eliminating noise while guaranteeing zero missed incidents.

graph LR subgraph Source["📡 Event Sources"] direction TB P1(["Project A"]) P2(["Project B"]) P3(["Project N"]) end subgraph Ingestion["📥 Centralized Ingestion"] direction TB CL{{"Cloud Logging"}} LS{{"Org-Level Log Sink"}} CL --> LS end subgraph Storage["🗄️ Persistence Layer"] direction TB BQ[("BigQuery")] GCS[("Cloud Storage")] end subgraph Compute["⚙️ Processing Engine (Cloud Run)"] direction TB SCHED["Cloud Scheduler"] subgraph Pipeline["Smart Data Scanning"] direction TB CHK["Checkpointing"] EP["Event Processor"] IC["Intelligent Classifier"] AV["API Verifier"] AD["Alert Dispatcher"] SM["State Manager"] SE["Sync Engine"] end SCHED -- "Cron Trigger" --> CHK CHK -- "Time Window" --> EP EP -- "Flattened Events" --> IC IC -- "Classified Actions" --> AD IC --> AV AV -- "Verified State" --> AD AD --> SM SM --> SE end subgraph Alert["🔔 Multi-Channel Alerting"] direction TB GCHAT["Google Chat"] PD["PagerDuty"] EMAIL["Email / SMTP"] SN["ServiceNow"] end subgraph Reporting["📊 Reporting & Analytics"] direction TB PDF["PDF Reports"] PBI["Power BI"] DASH["SLA Dashboards"] end P1 -- "PSH Events" --> CL P2 -- "PSH Events" --> CL P3 -- "PSH Events" --> CL LS -- "Export" --> BQ EP -- "Query" --> BQ SE -- "Idempotent Sync" --> BQ SE -- "State Backup" --> GCS AD -- "Cards V2" --> GCHAT AD -- "On-Call" --> PD AD -- "Notifications" --> EMAIL AD -- "Auto-Incident" --> SN BQ -- "Weekly Metrics" --> PDF BQ -- "SQL Analytics" --> PBI BQ -- "Metrics" --> DASH style Source fill:#1a1a2e,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Ingestion fill:#16213e,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Storage fill:#0f3460,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Compute fill:#1a1a2e,stroke:#e94560,stroke-width:2px,color:#ccd6f6 style Pipeline fill:#12122a,stroke:#ff6b81,stroke-width:1px,stroke-dasharray: 3 3,color:#ccd6f6 style Alert fill:#16213e,stroke:#e94560,stroke-width:2px,color:#ccd6f6 style Reporting fill:#0f3460,stroke:#00b4d8,stroke-width:2px,color:#ccd6f6

Step 1: Aggregating Events at the Organization Level

To achieve a unified holistic view, you absolutely must centralize logs from all child projects up to a singular organization-level sink pointing straight to BigQuery.

Using the gcloud CLI, create an aggressive organization-wide log sink pushing to your centralized monitoring project:

gcloud logging sinks create organization-health-sink \
  logging.googleapis.com/organizations/YOUR_ORGANIZATION_ID \
  bigquery.googleapis.com/projects/YOUR_CENTRAL_PROJECT_ID/datasets/YOUR_DATASET \
  --include-children \
  --use-partitioned-tables \
  --log-filter='jsonPayload.@type="type.googleapis.com/google.cloud.servicehealth.logging.v1.EventLog" \
  resource.type="servicehealth.googleapis.com/Event"'

Note: Ensure the automatically created Service Account associated with this Log Sink possesses the roles/bigquery.dataEditor IAM role inside your central project dataset.

Step 2: Deriving Critical Infrastructure Insights

Once routed to BigQuery, PSH records transform from transient alerts into queryable analytical assets. The framework runs a highly optimized SQL query that leverages partition pruning against the timestamp column to eliminate full-table scans, constraining results strictly to the execution window defined by the last checkpoint. The query structurally groups logs by event_id and state using nested ARRAY_AGG operations, producing a single row per incident with all its state transitions embedded. This means the processing engine only ever evaluates the most recent, most actionable subset of data — dramatically reducing BigQuery costs and processing time.

Step 3: The Smart Data Scanning Engine

The heart of this framework is an intelligent, stateless Smart Data Scanning engine deployed as a Cloud Run Job and triggered on a fixed schedule by Cloud Scheduler. Every execution cycle follows a deterministic pipeline of seven discrete processors, each handling a single responsibility with built-in fault isolation.

Incremental Checkpointing

The first stage retrieves the last successful execution timestamp from a dedicated BigQuery checkpoints table. This timestamp creates a precise time window — the engine will only query for Service Health events that arrived after the last checkpoint and before the current execution time. If no prior checkpoint exists (first-ever run), the engine defaults to a configurable lookback period. This guarantees zero missed events and zero duplicate processing across runs, regardless of how many times the container is invoked.

Event Processing & Intelligent Classification

Within the established time window, the engine queries BigQuery for raw Personalized Service Health logs and transforms the deeply nested response (events grouped by ID and state) into a flat, actionable format. The flattener always prioritizes the CLOSED state group when it exists, ensuring incidents are resolved properly even if overlapping active/closed logs arrive simultaneously.

Each flattened event then passes through a rigorous classification filter that examines two dimensions: whether the event has been previously tracked, and whether any fields have been modified since the last notification. This produces three deterministic scenarios:

Scenario A — New Incident: The event has never been seen before. The engine dispatches a full "Incident Created" alert with complete incident details, impact data, and direct-link action buttons.
Scenario B — Duplicate / No Change: The event is already tracked and nothing has changed. Suppressed silently to prevent alert fatigue — no webhook call, no notification, zero noise.
Scenario C — Incident Update: The event is tracked but one or more fields have mutated. The engine dispatches a concise "Incident Updated" delta alert showing only the fields that changed, not the entire payload. If the event has transitioned to CLOSED, a dedicated "Incident Resolved" alert is sent instead.

2nd-Level API Verification

After the primary classification cycle, the framework performs a second-level verification step: each active incident is re-checked by directly querying the Google Cloud Service Health API. This catches edge cases where BigQuery log ingestion may be delayed but the API already reflects a state change. If the API reports an incident as no longer active, the engine automatically transitions it to closed — updating all storage layers, dispatching a resolution alert, and cleaning up tracking state. Configurable throttling between API calls prevents rate-limiting.

Impacted Project Tracking

The framework automatically extracts GCP project IDs from every Service Health log entry, building a comprehensive impact footprint for each incident. As projects are affected and later resolved, the impact list dynamically grows and shrinks. When an incident closes, the system performs a bulk mapping operation that permanently associates the historical project list with the closed incident record, providing complete blast radius analytics for post-incident review.

Multi-Channel Alert Dispatch

The alert dispatcher translates raw incident data into richly formatted Google Chat Cards V2 payloads featuring visual severity markers (🚨 New, 🔄 Updated, ✅ Resolved), dynamic impact summaries with project/service/zone counts, collapsible 3-column grids for large-scale outages, and interactive buttons linking directly to the GCP Console. All messages within the same incident are automatically threaded using the GCP event ID, keeping conversations organized and contextual.

Beyond Google Chat, the same normalized payload simultaneously fans out to PagerDuty for on-call escalation, Email/SMTP for notification distribution, and ServiceNow for automatic ITSM incident creation. This ensures consistent, high-signal alerts across every operational channel from a single processing point.

Idempotent State Management & Sync

After all events are processed and verified, the engine bulk-persists the current tracking state to a dedicated BigQuery tracking table, giving the next execution cycle full contextual awareness of every previously seen event. The final stage performs an idempotent sync: each incident record is content-hashed, and only rows whose content has genuinely changed are written to BigQuery. This eliminates redundant writes and keeps BigQuery costs minimal. The working data is then bulk-uploaded back to Cloud Storage (with stale blob cleanup) and all ephemeral local state is destroyed — leaving the engine truly stateless and ready for the next invocation.

Step 4: Executive Reporting & PDF Generation

The framework includes a fully automated weekly reporting pipeline that generates professional, GCP-branded PDF reports. These reports feature executive summaries with KPI metric cards (active incidents, resolved counts, impacted projects, mean time to resolve), severity distribution charts, incident summary tables with clickable links to the GCP Console, and detailed incident breakdown cards with timeline, impact, and workaround information. Reports are automatically uploaded to Cloud Storage and a formatted notification with a direct download link is dispatched to Google Chat.

Step 5: Business Intelligence & Executive Dashboards

Operational alerting is only half the equation. By persisting all Organization-level Service Health logs permanently in our centralized BigQuery dataset, we unlock powerful analytical capabilities. Power BI connects directly to BigQuery via SQL queries, powering real-time executive dashboards that track historical GCP stability, analyze SLA performance across regions, and correlate infrastructure incidents with business impact over time. This transforms raw incident telemetry into strategic decision-making data for leadership.

Key Capabilities

Stateless Execution: The entire engine runs ephemerally on Cloud Run Jobs. All persistent state is decoupled into BigQuery (tracking, checkpoints, incidents, reports) and Cloud Storage (incident files, logs), with transient local directories used only during execution.
Idempotent by Design: Content hashing ensures the sync engine only writes to BigQuery when incident data has genuinely changed, eliminating redundant operations and keeping costs predictable at any scale.
Intelligent Deduplication: The three-scenario classification filter (New / Duplicate / Update) prevents alert fatigue by suppressing unchanged events and dispatching concise delta-only updates when fields mutate.
2nd-Level API Verification: Active incidents are re-verified directly against the Service Health API, catching delayed ingestion edge cases and ensuring no incident lingers as "active" when Google has already resolved it.
Dynamic Impact Tracking: Impacted project IDs are automatically extracted, tracked across the incident lifecycle, and persistently mapped to closed incidents for complete blast radius analytics.
Enterprise-Grade Alerting: Formatted Google Chat Cards V2 with visual severity cues, conversation threading, PagerDuty on-call routing, email notifications, and ServiceNow auto-ticketing — all driven from a single processing pipeline.
Automated Executive Reporting: Weekly PDF reports with severity charts, MTTR analysis, and detailed incident cards are automatically generated, uploaded to Cloud Storage, and linked in Google Chat.
Analytics-Ready: Every incident persisted in BigQuery is immediately queryable by Power BI and other BI tools for SLA dashboards, trend analysis, and executive reporting.

Conclusion

This Service Health Observability Framework transforms enterprise incident management from reactive, siloed scrambling into a fully automated, centralized operation. By combining Cloud Logging aggregation, intelligent event classification, two-level state verification, idempotent state management, automated executive reporting, and multi-channel alerting — all running statelessly on Cloud Run — we achieve guaranteed, real-time awareness of every Google Cloud Service Health event across the entire organization. The result is a system where incidents are detected, classified, and communicated within minutes, not hours, giving operations teams and leadership the confidence that nothing slips through the cracks.