Troubleshooting of Microservices on Amazon ECS

Scope

This knowledge bank focuses on the real-world challenges encountered while deploying Dockerized microservices on AWS ECS (Elastic Container Service) using EC2 launch type, with infrastructure managed via Terraform and automated deployments through GitHub Actions. Instead of explaining ECS fundamentals, this write-up provides a concise summary of critical deployment issues and their practical resolutions. It is ideal for DevOps engineers, cloud developers, or anyone deploying production-ready services on ECS who may encounter similar issues.

It also showcases how combining ECS with Infrastructure as Code (IaC) and CI/CD can lead to streamlined, repeatable, and resilient deployment pipelines.

Common Errors and Their Resolutions in ECS Microservices Deployment

1. Health Check Failed Due to Missing /health Route

Problem:

ECS task failed health checks repeatedly and kept restarting.
ALB marked targets as unhealthy.

Root Cause:

The /health route required by the ALB health check was missing in the application code.

Resolution:

Added a basic /health endpoint in each microservice.
Updated task definitions to point ALB health checks to the correct path (/health).
Verified ALB target groups received a 200 response from this route using curl and browser tests.
Deployed updated container images and confirmed task health stabilized.

2. Containers Frequently Draining and Restarting Due to Low Stop Timeout

Problem:

Tasks were draining and restarting too often, especially under high load.

Root Cause:

The default stopTimeout (graceful shutdown period) was too low, causing ECS to kill containers before they could close connections or clean up.

Resolution:

Increased stopTimeout in task definition to 600 seconds.
Modified application code to catch SIGTERM and shut down cleanly.
Tested graceful termination using ECS task stop actions and confirmed logs were properly flushed.

3. ECS Services Not Deploying New Docker Images Automatically

Problem:

Pushed new Docker images to ECR, but ECS services were still running older versions.

Root Cause:

ECS does not automatically redeploy tasks unless the task definition is updated or deployment is triggered manually.

Resolution:

Enabled force new deployment in the ECS service via the AWS Console and also automated it.
In the GitHub Actions workflow, added:

aws ecs update-service --cluster <cluster-name> --
service<service-name> --force-new-deployment

Ensured that Terraform outputs the latest task definition ARN to prevent reusing stale definitions.

4. ALB DNS Not Responding to Login and Booking Services

Problem:

The ALB DNS was not routing requests correctly to /login or /booking paths.

Root Cause:

Misconfigured ALB listener rules or missing target group associations for specific paths.

Resolution:

Verified and updated ALB listener rules in Terraform using path-based routing.
Confirmed correct target group mappings and added missing routes.
Used ALB logs and browser developer tools to trace broken routing.
After corrections, traffic was routed successfully to the intended microservices.

5. Reserved Too Much CPU for ECS Task

Problem:

ECS tasks reserved more CPU than the EC2 instance could handle.
Resulted in insufficient CPU for other tasks and an unhealthy task state.

Root Cause:

Overestimated CPU needs per container.
Task definition reserved 1024 CPU units per task on a t3.medium instance with only 2 vCPUs (2048 units).

Resolution:

Reduced CPU reservation to 256 units per task.
Monitored CloudWatch metrics to understand realistic CPU usage.
Scheduled tasks on instances with better resource planning, maintaining a healthy task lifecycle.

Conclusion

Deploying microservices on ECS offers scalability and automation, but several real-time issues may occur during implementation. These errors highlight the importance of:

Defining proper health check routes in the application code and ECS configuration.
Handling graceful shutdowns with adequate stop timeouts.
Automating deployments using force update mechanisms.
Correct configuration of ALB listener rules and target group associations.
Aligning task CPU reservations with actual instance capacities.
Integrating Terraform and GitHub Actions streamlines infrastructure provisioning and continuous deployment, reducing manual errors and improving consistency.

By identifying and addressing these challenges early, teams can ensure smoother deployments, better resource utilization, and higher application availability in production environments.

This knowledge bank serves as a practical troubleshooting reference for DevOps engineers deploying containerized workloads on AWS ECS using EC2, Terraform, and CI/CD tools such as GitHub Actions.

Scope

Common Errors and Their Resolutions in ECS Microservices Deployment

Conclusion

Related Posts