Maintained 99% uptime SLA across 11 global production deployments (USA, UK, Canada, Australia, Hong Kong). 25+ services on Azure with strict regional isolation requirements.
- Built Python monitor for JMS consumer queues catching silent failures before user impact. Adopted as standard team tooling.
- Built zero-downtime restart automation: verified healthy instance availability before touching degraded one. Safety check first, action second.
- Full Terraform infrastructure across 11 regions. Jenkins and Ansible CI/CD pipelines. Prometheus, Grafana, ELK, Azure Monitor observability stack.