Observability Case Study

AWS + Datadog Observability Modernization

Unified telemetry patterns for mixed-runtime workloads on AWS. This walkthrough demonstrates how distributed traces, logs, and metrics are correlated in Datadog across ECS and Lambda services built with PHP and TypeScript.

Cloud Scope

AWS ECS + Lambda

Telemetry

Datadog APM, logs, metrics

Runtime Mix

PHP + TypeScript services

Primary Goal

Faster incident triage

How this was implemented in practice

The work focused on making observability first-class across mixed runtimes instead of treating monitoring as an afterthought. Instrumentation was introduced at the application layer, trace context was propagated across sync and async boundaries, and incident workflows were connected directly to alerting channels so responders could move from detection to root cause quickly.

Interactive trace explorer

Ingress API

TypeScript ECS

P95 latency

180ms

Error rate

0.2%

Current bottleneck

Gateway auth check

Pricing Service

PHP ECS

P95 latency

320ms

Error rate

0.4%

Current bottleneck

DB query fan-out

Event Worker

TypeScript Lambda

P95 latency

140ms

Error rate

0.1%

Current bottleneck

Cold start spikes

Billing API

PHP ECS

P95 latency

410ms

Error rate

0.5%

Current bottleneck

Third-party HTTP retries

Datadog package implementation in ECS apps

For ECS services, Datadog tracing was integrated inside the PHP and TypeScript application layer so each request path emitted spans with consistent service names, environment tags, and version metadata. This made cross-service traces searchable by deployment version and reduced ambiguity during incident triage.

Lambda instrumentation for traces and spans

Lambda handlers were instrumented with tracing libraries to capture invocation-level spans and downstream dependencies (queues, APIs, storage). Correlation IDs were passed through event payloads so async fan-out chains stayed connected inside Datadog APM views instead of fragmenting into isolated events.

OpenTelemetry and logging consolidation

OpenTelemetry conventions were used to normalize span attributes and log fields, then logs were consolidated so traces, metrics, and log lines could be pivoted from one screen. Standardized keys (service, env, tenant, operation, request ID) improved search quality for production debugging and post-incident review.

Error tracing and Slack-driven incident response

Error monitors were tuned around actionable thresholds (error classes, latency regressions, and retry storms) and routed into Slack with context links to traces and logs. On-call responders could jump directly to failing spans, confirm blast radius, and coordinate mitigation without manually reconstructing the timeline.

Implementation notes

  • - Unified trace IDs and service naming conventions across PHP-FPM ECS tasks and TypeScript Lambda handlers.
  • - Added environment and deployment tags to make release regressions visible in APM views.
  • - Standardized alert routes by severity to separate SLO drift from transient noise.

Current selection summary

Filter: All services

Active services: 4

Aggregate p95 latency: 263ms

Back to portfolio