
After over 15 years of working alongside engineering teams to make operations more efficient, I’ve come to appreciate a simple truth: understanding infrastructure alone is never enough. True operational insight comes from understanding the software that runs on top of it. As someone passionate about software architecture, yet deeply rooted in infrastructure, I’ve always believed in bridging that gap.
Today, I lead the Cloud Operations practice at DNX Solutions, running 24×7 operations across nearly 50 customers. My team doesn’t manage applications — we manage infrastructure. But that doesn’t mean we can ignore the application. When our customers face issues, the lines between infrastructure and code blur. This is a story about one of those moments, and how generative AI (GenAI) helped us go beyond our traditional tools and methods.
The Problem: Just a Platform Slowdown?
A customer approached us with a critical issue. After onboarding a significant volume of new users, parts of their web platform began to lag, with noticeable latency during navigation. To many, this might seem like a classic case of infrastructure under strain. So we checked the usual suspects:
- Aurora RDS (with multiple read replicas) — healthy.
- ElastiCache and Elasticsearch — metrics all within normal ranges.
- Node.js backend — showing no spikes in CPU or memory.
No bottlenecks. No errors. No alerts. For the customer, these slowdowns translated to frustrated users and potential revenue loss. For us, it was a wake-up call that the old way of looking at infrastructure wasn’t enough.
This was the first lesson: infrastructure observability alone is no longer enough. To solve this issue, we needed to peer inside the engine that powers the user experience — the application itself!
The Pivot: Application Performance Monitoring (APM) to the Rescue
We introduced an APM solution to instrument the application. Thankfully, modern APM agents are smart and non-invasive. Within days, we had traces, performance data, and transaction breakdowns. And that’s when the magic happened.
We could now see these issues that weren’t visible from infrastructure metrics alone — but they were directly impacting the end-user experience
- API calls that were taking 20-30 seconds.
- Repeated database queries within a single function.
- Cache lookups that inexplicably dragged.
- Elasticsearch calls were taking a very long time to process a query every time.
But here’s the challenge: I’m not the software engineer who wrote this code. How could I help them troubleshoot or fix the problem?
The Breakthrough: Leveraging GenAI in Day-to-Day Operations
That’s when we turned to one of our newest tools (cline) — a GenAI assistant built into our development environment that runs locally and works through the command line. It analyses code and performance issues in context — without sending any data externally. That meant we could get detailed, AI-powered debugging support quickly and securely, without pulling in extra teams or risking data exposure.
These are a few of the prompts that I used:
- To reduce database load in notifications API:
“The /notification function is hitting the database 20 times, which is causing performance issues. analyze the code for this function and suggest ways to reduce the number of database calls, aiming for a single or minimal number of queries”
The prompt above, used in “Plan” mode narrowed down the chunks of code where those calls were being made, and suggestions to improve the queries as well as the database indexes.
- To speed up Redis calls in task management
“The /tasks function is experiencing Redis calls exceeding 5 seconds in our production environment. Help diagnose what might be causing this slowdown and suggest ways to improve its performance”
This prompt helped identify inefficient Redis calls scattered across the codebase — saving hours of manual investigation and speeding up issue resolution.
- To optimise Elasticsearch integration
“Our application is experiencing intermittent latency and increased resource usage when interacting with Elasticsearch. Specifically, individual Elasticsearch requests sometimes take longer than expected. How can you optimise this?”
And finally, the prompt above was useful to draft a plan to improve Elasticsearch integration with our codebase, with details of which code must be changed
Here’s what the AI helped us uncover — and why it mattered to the business:
- Slow SQL Queries: The AI flagged expensive queries missing proper indexes and recommended changes to the database schema and how to access the data by the application, which could cut execution time. This reduced page load times and helped improve customer retention.
- Heavy Redis Payloads: Some Redis cache entries would be too large and encrypted, leading to delays. Trimming payload sizes and reviewing what really needs encryption can improve performance. This led to faster response times and reduced the risk of timeout errors.
- Inefficient Use of Elasticsearch: Instead of reusing existing connections, the app created a new one for every search — wasting resources and slowing things down. Fixing this saved compute resources and helped lower cloud spend.
- Code Hotspots Identified: It pinpointed specific functions that were slowing the app and explained exactly why, helping us focus our optimisation efforts where they matter most. This improved developer efficiency and sped up performance tuning.
Even more impressively, it generated a full report — human-readable, code-referenced, and actionable without requiring me to deeply understand the business logic or the codebase.
Operational Lessons Learned
This experience taught me a few critical lessons:
- Observability Must Evolve: Infrastructure metrics are no longer enough. We must monitor what the user actually experiences.
- GenAI Empowers, Not Replaces: AI helped me — a seasoned infrastructure professional — operate at a software-debugging level without stepping out of my comfort zone.
- Bridge the Gap: The future of operations isn’t infrastructure or applications. It’s both. GenAI makes it possible to operate across those layers.
- Speed and Clarity: What used to take days of back-and-forth between teams can now be diagnosed in hours — sometimes minutes — with the right AI tools.
The Human Side of AI Operations
GenAI is helping me write this blog post. It’s also helping my team deliver faster incident resolution, clearer root cause analysis, and more confident support — all while keeping humans at the center.
I don’t believe AI will replace us. But I do believe it will replace the way we work. We’re not eliminating jobs — we’re transforming them. Teams like mine won’t become obsolete. We’ll become augmented.
Projects that once took months will take weeks. Diagnosing obscure bugs will feel like asking a colleague rather than assembling a task force. And engineers will finally get to focus on the parts of their job that actually require creativity.
Your Job Is Safe — If You Adapt
I’m not afraid of GenAI. I’m excited by it. If you work in infrastructure, operations, or DevOps and aren’t yet embracing AI in your workflows, you’re already behind.
Human-centric operations in the GenAI era mean faster delivery, better visibility, and fewer escalations. And most importantly, they mean keeping humans — not dashboards — at the heart of every solution.
Let’s stop fearing the tools of tomorrow. Let’s use them to deliver better outcomes today.
Want to See How GenAI Can Power Your Ops Team?
Whether it’s solving hidden performance issues or reducing resolution times, AI isn’t just a future tool — it’s working today.
Talk to us about integrating GenAI into your operations.