Back to Blogs

Building a Service Health Observability Framework on Google Cloud

Introduction

Google Cloud’s Personalized Service Health (PSH) provides a highly tailored view of the real-time operational status of all Google Cloud infrastructure systems relevant to your specific projects. Being able to access this data globally is crucial. However, merely having access to this data in the GCP Console is insufficient for modern Site Reliability Engineering (SRE) teams who require automated, immediate, and actionable alerts to maintain strict Service Level Agreements (SLAs).

In this deep dive, we outline a highly robust, event-driven Service Health Observability Framework built natively on GCP. By aggregating PSH events across your entire organization, querying them centrally via BigQuery, and deploying a stateless Python monitoring application on Cloud Run Jobs, you can confidently drive threaded, deduplicated incident updates directly into your corporate communication platforms like Google Chat.

The Challenge: Organization-Wide Blind Spots

In large-scale enterprise environments deploying thousands of loosely coupled microservices across dozens of disconnected GCP projects, a single localized region outage (e.g., us-central1 Cloud SQL disruption) can trigger massive cascading downstream failures.

Without an aggregated observability methodology, incident response teams typically fall into two fatal traps: relying strictly on delayed generic emails, or suffering from "Alert Fatigue" caused by fragmented, repetitive notifications hitting Chat ops pipelines without any stateful context.

Common Pitfall: Do not rely on reactive, decentralized email alerts configured on a per-project basis. Attempting to manage hundreds of bespoke notification channels leads to misconfigurations, missed critical escalations, and a total lack of cross-organizational analytics capability.

The Solution: An Event-Driven Observability Framework

Our strategy rejects localized monitoring in favor of a strictly centralized approach. We capture everything via an Organization Log Sink, persist it cleanly in BigQuery for historical SLA analysis, and run intelligent parsing code against the data to construct actionable human-readable contextual alerts.

graph LR subgraph Source["📡 Event Sources"] direction TB P1(["Project A"]) P2(["Project B"]) P3(["Project N"]) end subgraph Ingestion["📥 Centralized Ingestion"] direction TB CL{{"Cloud Logging"}} LS{{"Org-Level Log Sink"}} CL --> LS end subgraph Storage["🗄️ Persistence"] direction TB BQ[("BigQuery")] GCS[("Cloud Storage")] end subgraph Compute["⚙️ Processing Engine (Cloud Run)"] direction TB SCHED["Cloud Scheduler"] subgraph Pipeline["Internal Pipeline"] direction TB CHK["Checkpointing"] EP["Event Processor"] IT["Incident Tracker"] GA["Alert Dispatcher"] TB["State Manager"] BS["Sync Engine"] end SCHED -- "Cron Trigger" --> CHK CHK -- "Time Window" --> EP EP -- "Scenario A/B/C" --> IT IT --> GA GA --> TB TB --> BS end subgraph Alert["🔔 Alerting"] direction TB GCHAT["Google Chat"] PD["PagerDuty"] EMAIL["Email / SMTP"] SN["ServiceNow"] end subgraph Analytics["📊 Analytics"] direction TB PBI["Power BI"] DASH["SLA Dashboards"] end P1 -- "PSH Events" --> CL P2 -- "PSH Events" --> CL P3 -- "PSH Events" --> CL LS -- "Export" --> BQ EP -- "Query" --> BQ BS -- "MERGE Sync" --> BQ BS -- "State Backup" --> GCS GA -- "Cards V2" --> GCHAT GA -- "On-Call" --> PD GA -- "Notifications" --> EMAIL IT -- "Auto-Incident" --> SN BQ -- "SQL Analytics" --> PBI BQ -- "Metrics" --> DASH style Source fill:#1a1a2e,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Ingestion fill:#16213e,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Storage fill:#0f3460,stroke:#64ffda,stroke-width:2px,color:#ccd6f6 style Compute fill:#1a1a2e,stroke:#e94560,stroke-width:2px,color:#ccd6f6 style Pipeline fill:#12122a,stroke:#ff6b81,stroke-width:1px,stroke-dasharray: 3 3,color:#ccd6f6 style Alert fill:#16213e,stroke:#e94560,stroke-width:2px,color:#ccd6f6 style Analytics fill:#0f3460,stroke:#00b4d8,stroke-width:2px,color:#ccd6f6
Figure 1: PSH Observability Framework — End-to-end event-driven architecture from source ingestion to alerting and analytics.

Step 1: Aggregating Events at the Organization Level

To achieve a unified holistic view, you absolutely must centralize logs from all child projects up to a singular organization-level sink pointing straight to BigQuery.

  1. Using the gcloud CLI, create an aggressive organization-wide log sink pushing to your centralized monitoring project:
gcloud logging sinks create organization-health-sink \
  logging.googleapis.com/organizations/YOUR_ORGANIZATION_ID \
  bigquery.googleapis.com/projects/YOUR_CENTRAL_PROJECT_ID/datasets/organization_health_logs \
  --include-children \
  --log-filter='resource.type="project" AND (protoPayload.serviceName="servicehealth.googleapis.com" OR textPayload:"Google Cloud incident" OR jsonPayload.type="com.google.cloud.servicehealth")'

Note: Ensure the automatically created Service Account associated with this Log Sink possesses the roles/bigquery.dataEditor IAM role inside your central project dataset.

Step 2: Deriving Critical Infrastructure Insights

Once routed to BigQuery, PSH records transform from transient alerts into queryable analytical assets. Our framework runs a highly specific SQL query to extract events constrained strictly to our execution window (defined by ISO 8601 timestamps), discarding irrelevant logs:

WITH FilteredLogs AS (
  SELECT
    insertId, receiveTimestamp, timestamp,
    jsonPayload_v1_eventlog.endtime, jsonPayload_v1_eventlog.impactedProducts,
    jsonPayload_v1_eventlog.description, jsonPayload_v1_eventlog.starttime,
    jsonPayload_v1_eventlog.updatetime, jsonPayload_v1_eventlog.detailedState,
    jsonPayload_v1_eventlog.state, jsonPayload_v1_eventlog.title,
    resource.labels.event_id, resource.type, logName,
    labels.servicehealth_googleapis_com_updated_fields
  FROM `YOUR_PROJECT.YOUR_DATASET.servicehealth_googleapis_com_activity`
  WHERE timestamp >= TIMESTAMP('{start_time_iso}')
    AND timestamp <= TIMESTAMP('{end_time_iso}')
),
GroupedByState AS (
  SELECT event_id, state,
    ARRAY_AGG(STRUCT(insertId, receiveTimestamp, timestamp, endtime, /* ... */) ORDER BY updatetime ASC) AS state_logs
  FROM FilteredLogs GROUP BY event_id, state
)
SELECT event_id,
  ARRAY_AGG(STRUCT(state, state_logs) ORDER BY CASE WHEN state = 'ACTIVE' THEN 1 WHEN state = 'CLOSED' THEN 2 ELSE 3 END ASC) AS grouped_events
FROM GroupedByState GROUP BY event_id;

Notice how this query leverages ARRAY_AGG to structurally group logs first by event_id and then by state. Because Service Health fires multiple redundant logs for the exact same event over time, this groups them elegantly so our Python engine only needs to evaluate the most actionable (most recent) subset.

Step 3: The Monitoring Engine — How It Works Internally

The heart of this framework is a deterministic, stateless monitoring engine deployed as a Cloud Run Job and triggered on a fixed schedule by Cloud Scheduler. Every execution cycle follows a strict internal pipeline of six discrete processors, each handling a single responsibility. Here is the complete data flow from invocation to output:

Checkpointing & Time Windowing

The first stage retrieves the last successful execution timestamp from a dedicated BigQuery checkpoints table. This timestamp creates a precise time window — the engine will only query for Service Health events that arrived after the last checkpoint and before the current execution time. If no prior checkpoint exists (first-ever run), the engine defaults to a configurable lookback period. This guarantees zero missed events and zero duplicate processing across runs.

Event Processing & Intelligent Classification

Within the established time window, the engine queries BigQuery for raw Personalized Service Health logs and transforms the deeply nested response (events grouped by ID and state) into a flat, actionable format. It always prioritizes the CLOSED state group when it exists, ensuring incidents are resolved properly even if overlapping active/closed logs arrive simultaneously.

Each flattened event then passes through a rigorous classification filter that examines two dimensions: whether the event has been previously tracked, and whether any fields have been modified since the last notification. This produces three deterministic scenarios:

  • Scenario A — New Incident: The event has never been seen before. The engine dispatches a full "Incident Created" alert with complete incident details, impact data, and direct-link action buttons.
  • Scenario B — Duplicate / No Change: The event is already tracked and nothing has changed. Suppressed silently to prevent alert fatigue.
  • Scenario C — Incident Update: The event is tracked but one or more fields have mutated. The engine dispatches a concise "Incident Updated" delta alert showing only the fields that changed, not the entire payload.

Incident Tracking & Persistence

Every incident is persisted as a standalone JSON record keyed by its unique event ID. The engine computes a union of all impacted project IDs across both active and closed state logs, providing a comprehensive impact footprint. As events transition from active to resolved, incident records are automatically migrated between active and closed storage directories.

Multi-Channel Alert Dispatch

The alert dispatcher translates raw incident data into richly formatted Google Chat Cards V2 payloads featuring visual severity markers (🚨 New, 🔄 Updated, ✅ Resolved), dynamic impact summaries, collapsible 3-column grids for large-scale outages, and interactive buttons linking directly to the GCP Console. All messages within the same incident are automatically threaded using the GCP event ID, keeping conversations organized.

Beyond Google Chat, the same normalized payload simultaneously fans out to PagerDuty for on-call escalation, Email/SMTP for notification distribution, and ServiceNow for automatic ITSM incident creation. This ensures consistent, high-signal alerts across every operational channel from a single processing point.

State Management & Idempotent Sync

After all events are processed, the engine bulk-persists the current tracking state (mapping every event ID to its last known update time) to a dedicated BigQuery tracking table. This gives the next execution cycle full contextual awareness of every previously seen event.

The final stage performs an idempotent sync: each incident record is hashed (SHA-256), and a MERGE operation against BigQuery ensures that only rows whose content has actually mutated are written. This eliminates redundant writes and keeps BigQuery costs minimal. The working data is then bulk-uploaded back to Cloud Storage and all ephemeral local state is cleaned up — leaving the engine truly stateless and ready for the next invocation.

Step 4: Business Intelligence & Executive Dashboards

Operational alerting is only half the equation. By persisting all Organization-level Service Health logs permanently in our centralized BigQuery dataset, we unlock powerful analytical capabilities. Power BI connects directly to BigQuery via SQL queries, powering real-time executive dashboards that track historical GCP stability, analyze SLA performance across regions, and correlate infrastructure incidents with business impact over time. This transforms raw incident telemetry into strategic decision-making data for leadership.

Key Takeaways

  • Stateless Execution: The entire engine runs ephemerally on Cloud Run Jobs. All state is decoupled into BigQuery (tracking, checkpoints, incidents) and Cloud Storage (incident files), with transient local directories used only during execution.
  • Idempotent by Design: SHA-256 content hashing ensures the sync engine only writes to BigQuery when incident data has genuinely changed, eliminating redundant operations and keeping costs predictable.
  • Intelligent Deduplication: The three-scenario classification filter (New / Duplicate / Update) prevents alert fatigue by suppressing unchanged events and dispatching concise delta-only updates when fields mutate.
  • Enterprise-Grade Alerting: Formatted Google Chat Cards V2 with visual severity cues, conversation threading, PagerDuty on-call routing, email notifications, and ServiceNow auto-ticketing — all driven from a single processing pipeline.
  • Analytics-Ready: Every incident persisted in BigQuery is immediately queryable by Power BI and other BI tools for SLA dashboards, trend analysis, and executive reporting.

Conclusion

This Service Health Observability Framework transforms enterprise incident management from reactive, siloed scrambling into a fully automated, centralized operation. By combining Cloud Logging aggregation, intelligent event classification, idempotent state management, and multi-channel alerting — all running statelessly on Cloud Run — we achieve guaranteed, real-time awareness of every Google Cloud Service Health event across the entire organization. The result is a system where incidents are detected, classified, and communicated within minutes, not hours, giving operations teams and leadership the confidence that nothing slips through the cracks.

Further Reading