· nervico-team · cloud-architecture · 11 min read
Monitoring and Observability on AWS: CloudWatch, X-Ray, and Beyond
Practical observability guide for AWS: how to configure CloudWatch, X-Ray, and complementary tools to detect problems before your users do, with real costs and proven patterns.
If your team finds out about a production problem because a customer calls, your observability has failed. It is not a tooling issue. It is a discipline issue: what you measure, how you measure it, and what you do when the numbers change.
Monitoring is knowing that the server is at 95% CPU. Observability is understanding why it is at 95% CPU, which service is causing it, which request triggered it, and how many users are affected. The difference between the two is the difference between knowing there is a fire and knowing where it is, what caused it, and how to put it out.
This article covers the observability tools available on AWS, how to configure them correctly, and the patterns that separate teams who react to incidents from those who prevent them.
The Three Pillars of Observability
Observability stands on three types of data:
Metrics
Numerical values that measure system state over time. CPU, memory, latency, error rate, requests per second. Metrics are storage-efficient and allow you to detect trends and anomalies.
Example: Your API’s P99 latency went from 200ms to 1,500ms in the last 15 minutes.
Logs
Textual records of events occurring in the system. Each log has a timestamp, a severity level, and a message. Logs are the detail: they tell you exactly what happened, in what order, and with what data.
Example: 2025-12-08T14:23:45Z ERROR [PaymentService] Timeout connecting to Stripe API after 30s. OrderId: 12345, UserId: 67890
Traces
Records of a request’s journey through multiple services. A trace shows that the user’s request entered through API Gateway, was processed by Lambda A, which called Lambda B, which queried DynamoDB, which responded in 5ms, but Lambda B took 3 seconds to process the response.
Example: A trace showing that 85% of a request’s latency occurs in a single Lambda function making an inefficient database query.
CloudWatch: The Foundation
CloudWatch is AWS’s native monitoring service. It is not the most powerful tool on the market, but it has a decisive advantage: it receives metrics from all AWS services automatically. No agents to install, no exporters to configure, no monitoring infrastructure to maintain.
CloudWatch Metrics
All AWS services publish metrics to CloudWatch automatically:
- EC2: CPUUtilization, NetworkIn, NetworkOut, DiskReadOps.
- Lambda: Invocations, Duration, Errors, Throttles, ConcurrentExecutions.
- RDS: CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadLatency, WriteLatency.
- ALB: RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count.
- DynamoDB: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests.
- SQS: ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage.
These basic metrics are free at 5-minute resolution (1-minute for EC2 with Detailed Monitoring enabled, which costs $3.50 per instance per month).
Custom metrics: You can publish your own metrics with PutMetricData. Each custom metric costs $0.30 per month. Typical metrics you should publish:
- User registrations per hour.
- Order processing time.
- Real-time conversion rate.
- Pending jobs queue depth.
CloudWatch Alarms
Alarms are the most important CloudWatch component. An alarm monitors a metric and executes actions when it crosses a threshold.
Critical alarm configurations:
Error rate alarm:
Metric: ALB HTTPCode_Target_5XX_Count
Period: 5 minutes
Threshold: > 10 5XX errors in 5 minutes
Action: SNS Notification -> Email + SlackLatency alarm:
Metric: ALB TargetResponseTime
Statistic: p99
Period: 5 minutes
Threshold: > 2 seconds
Action: SNS Notification -> PagerDutyDatabase alarm:
Metric: RDS FreeStorageSpace
Period: 15 minutes
Threshold: < 5 GB
Action: SNS Notification -> Ops team emailRecommendation: Do not create alarms for everything. Create alarms only for metrics that require immediate human action. An alarm that nobody responds to is worse than no alarm, because it trains the team to ignore notifications.
CloudWatch Logs
CloudWatch Logs stores and allows you to query logs from all your services. The structure is:
- Log Group: Groups logs by service or application. Example:
/aws/lambda/my-function,/ecs/my-service. - Log Stream: Groups logs within the group, usually by instance or invocation.
- Log Events: The individual records.
Essential configuration:
Retention: By default, logs are stored indefinitely. This accumulates cost without generating value. Configure retention policies:
- Production logs: 30-90 days in CloudWatch, export to S3 for long-term retention.
- Development/staging logs: 7-14 days.
- Audit logs (CloudTrail): 1 year minimum, depending on compliance requirements.
Structured format: Emit logs in JSON format, not plain text. A log like:
{
"timestamp": "2025-12-08T14:23:45Z",
"level": "ERROR",
"service": "payment-service",
"orderId": "12345",
"userId": "67890",
"message": "Timeout connecting to Stripe API",
"duration_ms": 30000
}allows filtering and searching by any field. A plain text log requires regex to extract the same information.
CloudWatch Logs Insights
Logs Insights is CloudWatch’s query engine. It lets you search for patterns across millions of records in seconds. The syntax is AWS-proprietary (not SQL):
fields @timestamp, @message
| filter @message like /ERROR/
| filter service = "payment-service"
| stats count() as error_count by bin(5m)
| sort error_count descThis query shows the number of errors per 5-minute interval in the payment service.
Cost: $0.0076 per GB scanned. If your logs occupy 100 GB and you run a query that scans 10 GB, it costs $0.076. Significantly cheaper than maintaining an Elasticsearch cluster.
CloudWatch Dashboards
Dashboards are visual panels that group metrics, logs, and alarms. Each dashboard costs $3 per month.
Minimum recommended dashboard:
- Overall health panel: 5XX error rate, P99 latency, requests per second.
- Database panel: Active connections, read/write latency, free storage.
- Lambda panel (if using serverless): Invocations, errors, throttles, duration.
- Cost panel: Daily spend vs budget, monthly trend.
AWS X-Ray: Distributed Tracing
What X-Ray Solves
When a request crosses multiple services (API Gateway, Lambda, DynamoDB, SQS, another Lambda), identifying the bottleneck without traces is nearly impossible. X-Ray records the complete journey of each request and shows the time each service consumed.
How It Works
- The X-Ray SDK instruments your code. In Node.js, for example, it wraps AWS SDK clients and HTTP libraries to automatically capture calls.
- The X-Ray daemon sends trace segments to the X-Ray service. On Lambda and Fargate, the daemon is built in. On EC2, you need to install it.
- X-Ray correlates segments using a unique Trace ID that propagates between services through HTTP headers.
Instrumentation in Lambda
For Lambda, instrumentation is straightforward. Enable “Active Tracing” in the function configuration:
aws lambda update-function-configuration \
--function-name my-function \
--tracing-config Mode=ActiveThis automatically captures:
- Total invocation time.
- Cold start (if it occurs).
- Calls to AWS services (DynamoDB, S3, SQS) if using the instrumented SDK.
To capture external HTTP calls (to third-party APIs), you need to instrument the HTTP client.
Service Map
X-Ray’s Service Map automatically generates a diagram of your architecture based on traces. It shows each service, the connections between them, the latency, and the error rate of each connection.
This diagram is generated automatically. You do not need to draw or maintain it. If your team does not have an up-to-date architecture diagram, X-Ray’s Service Map is a solid starting point.
Sampling
X-Ray does not capture all requests by default. It uses a sampling algorithm that captures the first request per second and 5% of the rest. This reduces costs without losing significant visibility.
You can customize sampling by route, service, or any attribute. For critical endpoints (payments, authentication), set sampling to 100%.
Cost: $5 per million traces recorded, $0.50 per million traces retrieved. For an application with 10 million requests per month and 5% sampling, the cost is approximately $2.50 per month.
Beyond AWS-Native Tools
CloudWatch vs Datadog vs Grafana Cloud
AWS-native tools cover basic functionality. But for teams with more than 10-15 services, third-party alternatives offer significant advantages:
| Feature | CloudWatch | Datadog | Grafana Cloud |
|---|---|---|---|
| Native AWS metrics | Automatic | Via integration | Via integration |
| Dashboards | Functional | Excellent | Excellent |
| Alerting | Basic | Advanced | Advanced |
| Metric-log correlation | Limited | Integrated | Integrated |
| APM / Traces | X-Ray (separate) | Integrated | Tempo |
| Cost for 20 services | ~$50-100 | ~$300-500 | ~$200-400 |
Recommendation:
- Teams of 1-5: CloudWatch + X-Ray. No significant additional cost.
- Teams of 5-15: Evaluate Grafana Cloud (open source core, lower cost than Datadog).
- Teams of 15+: Datadog offers the most complete experience but at the highest cost.
Prometheus + Grafana on AWS
If you prefer open source tools, AWS offers managed services:
- Amazon Managed Service for Prometheus (AMP): Managed Prometheus. You ingest metrics, AWS stores and scales them. Cost: $0.003 per 10,000 samples ingested + $0.03 per million queries.
- Amazon Managed Grafana (AMG): Managed Grafana with native integration with AMP, CloudWatch, X-Ray, and more. Cost: from $9 per editor per month.
This combination is especially relevant for teams using EKS, where Prometheus is the de facto standard for Kubernetes monitoring.
Observability Patterns That Work
Pattern 1: SLOs Before Metrics
Do not monitor CPU. Monitor user experience. Define SLOs (Service Level Objectives):
- Availability: 99.9% of requests succeed (less than 43 minutes of downtime per month).
- Latency: P99 under 500ms.
- Error rate: Less than 0.1% of requests returning 5XX.
Create alarms based on error budgets: if in the last 30 days you have consumed more than 50% of your error budget, investigate before it runs out.
Pattern 2: Alert on Symptoms, Not Causes
Bad: Alert when CPU exceeds 80%. Good: Alert when P99 latency exceeds 500ms.
CPU at 80% may be normal if your service is processing an expected batch. High latency always affects users. Alert on what matters; investigate causes afterward.
Pattern 3: Structured Logs with Correlation IDs
Every request entering the system receives a unique ID (correlation ID or request ID). That ID propagates to all services participating in the processing. When something fails, you search by correlation ID and get the complete trace of the request.
Minimal implementation:
- API Gateway generates a request ID (or uses the client’s
X-Request-ID). - Lambda receives the ID in headers and includes it in all logs.
- If Lambda calls another service (SQS, another Lambda), it propagates the ID.
- Logs Insights allows searching by correlation ID to reconstruct the complete flow.
Pattern 4: Canary Monitoring
A canary is a synthetic test that simulates real user behavior. CloudWatch Synthetics lets you create canaries that:
- Make HTTP requests to your endpoints every N minutes.
- Verify response codes and content.
- Measure latency from multiple regions.
- Alert when they fail.
Cost: Each canary costs $0.0012 per run. A canary running every 5 minutes (8,640 runs per month) costs approximately $10 per month.
Canaries detect problems before users because they make requests continuously, even when there is no real traffic.
Pattern 5: Anomaly Detection
CloudWatch Anomaly Detection uses machine learning to establish normal behavior patterns for your metrics and alerts when a metric deviates significantly from the expected pattern.
It is especially useful for metrics with seasonal patterns (more traffic Monday through Friday, peaks at 10 AM, valleys at 3 AM). A fixed threshold does not work well with these metrics because what is normal at 10 AM is anomalous at 3 AM.
Incident Management: From Alert to Resolution
On-Call Structure
Observability without a response process is useless. Alerts need to reach the right person at the right time:
- Level 1 (automated): Alerts that resolve themselves. Auto-scaling responding to load spikes, Lambda retries, circuit breakers opening and closing.
- Level 2 (on-call team): Alerts requiring human intervention. An on-call engineer investigates and resolves. Weekly or biweekly rotation.
- Level 3 (escalation): Incidents the on-call team cannot resolve. Escalated to the service-owning team or tech lead.
Incident management tools: PagerDuty, Opsgenie (part of Atlassian), or AWS Incident Manager (included in Systems Manager). PagerDuty is the market standard, with native integration with CloudWatch, Datadog, and most monitoring tools.
Post-Mortems
Every significant incident should produce a written post-mortem. The minimum format:
- Summary: What happened, how long it lasted, how many users were affected.
- Timeline: Detailed chronology from the first alert to resolution.
- Root cause: Not “the server went down,” but “the Security Group was manually modified, removing application access to the database.”
- Corrective actions: What changes will be implemented to prevent recurrence. Each action with an owner and a date.
Post-mortems must be blameless. The goal is to improve the system, not to blame individuals.
Automated Runbooks
CloudWatch can execute automatic actions when an alarm fires. Through Systems Manager Automation, you can create runbooks that:
- Restart an ECS service when the error rate exceeds a threshold.
- Increase DynamoDB capacity when throttling is detected.
- Revoke a compromised Access Key when GuardDuty detects suspicious activity.
- Create a snapshot of an EC2 instance before applying an emergency change.
Automated runbooks reduce mean time to resolution (MTTR) from hours to minutes for the most common incidents.
Observability Costs: How Much Should You Spend
General Rule
A reasonable observability budget is 3-5% of your total AWS bill. If you spend $1,000 per month on infrastructure, $30 to $50 per month on observability is reasonable.
Typical Breakdown for a Startup with 10 Services
| Component | Cost/Month |
|---|---|
| CloudWatch Metrics | ~$15 |
| CloudWatch Logs (50 GB) | ~$25 |
| CloudWatch Alarms (20) | ~$2 |
| CloudWatch Dashboard (1) | ~$3 |
| X-Ray (5% sampling) | ~$3 |
| Synthetics (2 canaries) | ~$20 |
| Total | ~$68 |
How to Reduce Log Costs
Logs are the most expensive observability component. Strategies to control cost:
- Filter before sending: Do not send DEBUG logs to production. Set the log level to INFO or WARN.
- Compression: CloudWatch compresses automatically, but make sure you are not sending unnecessary data.
- Appropriate retention: 30 days is usually sufficient in CloudWatch. For long-term retention, export to S3 (10x cheaper).
- Log sampling: For high-frequency access logs, consider emitting only a percentage.
Conclusion
Observability is not a project. It is a discipline built incrementally. Start with CloudWatch and X-Ray, which are already there. Configure the alarms that matter. Structure your logs. Propagate correlation IDs. And when your system’s complexity justifies it, evaluate third-party tools.
The goal is not more dashboards. It is detecting problems before your users, understanding causes quickly, and resolving them without panic.
If you need help designing your observability strategy on AWS, NERVICO works with technical teams to implement monitoring that works. Request a free audit and we will evaluate the current state of your observability.