When a banking platform handling live financial transactions generates thousands of automated alerts, most of them false alarms, the engineers responsible for keeping it running face a different problem than the one they were trained for. Ajay Devineni decided to solve this problem using machine learning.
Devineni, a Site Reliability Engineer at NCR Corporation (now Candescent) in Atlanta, supports cloud platforms for multiple credit union banking applications. These systems run 24/7 and process real customer transactions, where even brief disruptions can have serious consequences.
“Managing incidents manually just doesn’t scale anymore,” Devineni said. “There’s too much data coming in, and during an outage you don’t have time to go through everything step by step.”
Instead of reacting after problems occur, the professional built AI systems that predict and prevent failures before they reach customers. His practical work includes building an ML-driven alert system that filters PagerDuty alerts before they reach engineers, reducing actionable alerts by 34% and after-hours pages by 41%, while also improving engineer wellbeing scores from 5.8 to 7.4 out of 10. He also developed a causal analysis system that uses distributed tracing and OpenTelemetry data to trace how failures spread across microservices, helping teams identify root causes more quickly. In addition, he created a predictive monitoring system for database migrations, enabling a zero-downtime transition of a multi-billion-row banking database to Amazon Aurora with no data loss and a 41% improvement in query performance after migration.
Additionally, he automated certificate lifecycle management across hundreds of services, eliminating certificate-related outages for more than a year. These systems have been applied across production banking platforms supporting real-time financial transactions.
Devineni follows a disciplined “shadow validation” approach required in SOC 2-regulated environments: every AI system first runs in parallel with human decisions for weeks, proving its accuracy before any automation is enabled. “We don’t switch on automation immediately,” he explained. “We let the system prove itself in real conditions. Once we’re confident, then we take the next step.”
He has also focused on integrating data from across networks, applications, and infrastructure to give teams a clearer picture of problems when traditional tools fall short. More recently, he has started using AI tools that assist engineers directly by reading system configurations, suggesting fixes, and generating scripts. Tasks that once took hours can now be completed much faster.
Looking ahead, he sees the next evolution as systems that not only predict problems but also safely remediate them in defined scenarios while keeping human judgment in the loop for anything involving customer funds or regulatory controls.
“With systems now operating across live banking infrastructure, his work reflects a practical model for how AI-driven reliability engineering is being applied at scale in regulated financial environments,” Devineni said.
About Professional:
Ajay Devineni is a technology professional specializing in cloud computing, telecommunications infrastructure, and digital financial systems, with deep expertise in Site Reliability Engineering (SRE), DevOps automation, and cloud-native platforms. His work spans large-scale enterprise environments, including telecommunications systems at AT&T and digital-first banking platforms hosted on Amazon Web Services. Ajay is particularly recognized for improving system reliability, reducing downtime, and building intelligent automation for incident response in mission-critical environments. With a unique blend of cloud engineering, observability, and AI-driven infrastructure automation, he focuses on creating resilient, scalable systems for industries where uptime and security are essential.




















