AWS ECS + Lambda
Observability Case Study
Unified telemetry patterns for mixed-runtime workloads on AWS. This walkthrough demonstrates how distributed traces, logs, and metrics are correlated in Datadog across ECS and Lambda services built with PHP and TypeScript.
AWS ECS + Lambda
Datadog APM, logs, metrics
PHP + TypeScript services
Faster incident triage
The work focused on making observability first-class across mixed runtimes instead of treating monitoring as an afterthought. Instrumentation was introduced at the application layer, trace context was propagated across sync and async boundaries, and incident workflows were connected directly to alerting channels so responders could move from detection to root cause quickly.
Ingress API
TypeScript ECS
P95 latency
180ms
Error rate
0.2%
Current bottleneck
Gateway auth check
Pricing Service
PHP ECS
P95 latency
320ms
Error rate
0.4%
Current bottleneck
DB query fan-out
Event Worker
TypeScript Lambda
P95 latency
140ms
Error rate
0.1%
Current bottleneck
Cold start spikes
Billing API
PHP ECS
P95 latency
410ms
Error rate
0.5%
Current bottleneck
Third-party HTTP retries
For ECS services, Datadog tracing was integrated inside the PHP and TypeScript application layer so each request path emitted spans with consistent service names, environment tags, and version metadata. This made cross-service traces searchable by deployment version and reduced ambiguity during incident triage.
Lambda handlers were instrumented with tracing libraries to capture invocation-level spans and downstream dependencies (queues, APIs, storage). Correlation IDs were passed through event payloads so async fan-out chains stayed connected inside Datadog APM views instead of fragmenting into isolated events.
OpenTelemetry conventions were used to normalize span attributes and log fields, then logs were consolidated so traces, metrics, and log lines could be pivoted from one screen. Standardized keys (service, env, tenant, operation, request ID) improved search quality for production debugging and post-incident review.
Error monitors were tuned around actionable thresholds (error classes, latency regressions, and retry storms) and routed into Slack with context links to traces and logs. On-call responders could jump directly to failing spans, confirm blast radius, and coordinate mitigation without manually reconstructing the timeline.
Filter: All services
Active services: 4
Aggregate p95 latency: 263ms