Building the Agentic AIOps Engineer: A New Era of Infrastructure Management

A Photo of Matt Glenn and text than says "Written by Matt Glenn"

March 4, 2026

We’ve spent fifteen years automating infrastructure and implementing AIOps platforms: alert correlation, root cause analysis, automated remediation. And it’s worked. Incidents are down. Alert noise is reduced. Detection is faster.

But engineers are still just as busy.

That’s because incident response is only half of what infrastructure engineers actually do. The other half — service requests, ad-hoc troubleshooting, architecture consultations, endless research — has received almost no attention from the industry. It’s invisible work: the Slack message that becomes a three-hour debugging session, the disk resize that takes thirty minutes, the runbook nobody can find. It accumulates silently, and it consumes as much time as incidents do.

Solving that missing half is where the real gains are. And when AI handles that execution work, it frees up capacity to go even deeper on automation and root cause prevention — creating a compounding effect that could reduce overall engineering workload by 75% or more, not just 50%.

At Vervint, we’re building something different: an Agentic AIOps Engineer. An AI system that handles the full spectrum of infrastructure engineering work, from “why did this alert fire?” to “can you help me troubleshoot this performance issue?” This is a critical component of our Human AI Partnership (HAIP) operating model.

The Misconception of What Work Is for Engineers

Incident response represents about 50% of engineering time. Research shows engineers spend equal time on service requests, consultations, and ad-hoc troubleshooting that never generates formal tickets.

What engineers do beyond incidents:

Service requests pile up: “Can you resize my disk?” “Add this user to this group.” “Why is this query running slow?” Each request seems small: fifteen minutes here, thirty minutes there. They accumulate into hours of context switching that fractures deep work time.

That “quick question” in Slack becomes a three-hour debugging session across four engineers, digging through logs, testing hypotheses, researching vendor documentation, and discovering a configuration drift from three weeks ago.

Engineers Google constantly: error messages, command syntax, configuration examples, vendor documentation, Stack Overflow threads. We search past tickets for similar issues. We dig through Confluence for that runbook someone wrote eighteen months ago. Information retrieval is its own full-time job.

Context switching compounds everything. Each interruption costs 15-30 minutes of cognitive reset time, even for simple requests. An engineer handling ten interruptions per day loses 3-4 hours just recovering focus.

Studies show L2 engineers spend 40-60% of time on non-incident work. Gartner reports 70% of infrastructure engineering time goes to “keeping the lights on” versus strategic work. Traditional automation addresses only the repeatable 30% of work; the remaining 70% requires contextual judgment that typical automation cannot provide.

Enter the Agentic AIOps Engineer

We’re building an AI system that handles the complete spectrum of infrastructure engineering: incidents, requests, consultations, and troubleshooting. This embodies the core principle of our Human AI Partnership model: AI as an amplifier of human expertise, not a replacement for human judgment.

The technical architecture uses Claude via API with RAG knowledge management for institutional memory, Model Context Protocol (MCP) infrastructure abstraction for safe system and monitoring access, and multi-agent coordination for specialized domain expertise. Integration points include Jira ServiceDesk for ticket workflows, and a web interface and Teams for conversational interfaces.

The strategy uses progressive autonomy: starting with read-only recommendations, advancing to supervised execution where AI proposes and implements with approval, and reaching autonomous action with comprehensive guardrails over the next two years.

Here’s what this means: An engineer receives a ticket: “Application logs filling disk on PROD-APP-07.” The Agentic AIOps Engineer searches past ticket resolutions and runbooks in the knowledge base, queries current disk usage and log growth patterns via MCP infrastructure proxies, researches log rotation best practices for that application, proposes a solution with confidence scoring based on similar past successes, executes the fix after approval, documents the resolution in Confluence, and updates the ticket. Time elapsed: 3-5 minutes versus 45 minutes for manual engineering. Quality: consistent with documented best practices, includes preventive measures for future occurrences.

This is HAIP in action: human judgment guiding AI capability to produce outcomes neither could achieve alone.

Why We’re Building This

The Industry Inflection Point

The infrastructure management industry stands at a transformation point. Agentic AI represents the next evolution beyond traditional automation and AIOps. With current technology, it’s possible to achieve a 60-80% reduction in L1/L2 engineering workload when done right. By 2028, industry analysts predict autonomous infrastructure management will become table stakes for competitive IT services firms.

This shift changes what infrastructure engineering means. Engineers move from reactive firefighting to proactive architecture and optimization. The focus becomes strategic work: capacity planning, security hardening, infrastructure evolution, and business-aligned technology decisions. Operations achieve 24/7 consistent quality with no 2 AM fatigue errors and no knowledge gaps when experts are unavailable. Institutional knowledge becomes preserved in queryable AI systems rather than lost when engineers leave organizations. Quality improves because the AI agent always follows the process.

What It Means for Our Clients

For our clients, this means tangible benefits:

Lower cost for services. Agentic engineers can offload a significant amount of work from human engineers. Less labor required means less cost.
Increased quality. Agentic engineers can more reliably stay within the rules defined by policies than human engineers. This ensures the right process is followed every time. Additionally, Agentic engineers are constantly measured; we know how successful their actions are and can continue to tune that success.
More in-depth guidance and oversight. Agentic engineers can scale much more cheaply than human engineers, allowing them to scale to handle processes in ways that were previously not affordable. For example, say a human engineer is patching systems. Their process might be to execute patching and watch monitoring to ensure the systems come back online. The process would not check system logs, test applications, or perform vulnerability testing to confirm the patch, as there is not enough staff to handle that depth. However, an agentic engineer could be scaled to handle these at a fraction of the cost.

How We’re Building It: Philosophy and Framework

The Agentic AIOps Engineer represents a practical implementation of our broader Human AI Partnership strategy. HAIP recognizes that AI is not a replacement for human expertise—it’s an amplifier. Like any amplifier, it makes whatever you feed it louder. Feed it genius, get genius faster. Feed it garbage, get garbage at scale. The critical intersection where human judgment guides AI capability produces outcomes neither could achieve alone.

Our Architectural Principles

Four core principles guide every architectural decision we make:

Safety. Unsure agentic processes cannot take unexpected actions or have unintended consequences. LLM outputs are probabilistic by nature and can never be fully deterministic. We can make it close, but it’s never 100%. We need a policy framework that wraps around an Agentic engineer to ensure it does not take any unilateral unauthorized action, and we need processes that monitor and validate what Agentic engineers are doing.
Data Isolation and Privacy. We must ensure that systems are architected in such a way that client data is fully isolated, that we have only the data needed, and it remains private. This means agents must run in the context of a client, with an identity that can only access that clients data and systems, and a partitioning scheme within RAG to ensure reference material outside of the context of that client is unavailable.
Avoiding vendor lock-in. Platforms like ServiceNow offer compelling end-to-end solutions, but create a permanent dependency. Once committed, you inherit their pricing power, lose the ability to adopt better technologies as they emerge, and face competitive disadvantage if the vendor’s innovation stagnates.
Allowing for innovation. Vervint’s culture thrives on engineering creativity and entrepreneurial problem-solving. Rigid enterprise platforms constrain this creativity and risk losing the competitive edge engineers who thrive on innovation.
Avoiding perfect tool syndrome. Every vendor promises comprehensive solutions that “do everything.” Once key leaders or architects have set their sights on a tool, it can become almost a religion for an organization, which can blind us to the reality of its capabilities, and alternative paths.
Focus on the real work. The infrastructure industry has spent fifteen years optimizing incident automation while ignoring the other 50% of engineering work. Even perfect incident resolution doesn’t address “Can you resize my disk?” or “Why is this query slow?” or “How should we architect this new service?” We’re designing agentic workflows for the complete engineering spectrum, not just alert response. This differentiates us: while competitors optimize alerts, we optimize entire engineering capacity.

Framework Implementation

Three frameworks shape our implementation approach:

Progressive Autonomy Stages guide our two-year journey from assisted recommendations to autonomous operations:

Stage 1 (Months 1-6): Foundation Building. We construct the knowledge pipeline, implementing RAG databases populated with past ticket resolutions, root cause analyses, architecture documentation, and runbooks. We build MCP proxy abstractions providing AI-safe access to infrastructure across VMware, AWS, Oracle, Azure, and others. We integrate research capabilities allowing the AI to search vendor documentation, Stack Overflow, and other external knowledge sources.
Stage 2 (Months 6-12): Supervised Execution. The AI proposes solutions and, with explicit approval, executes them. “I’ve identified the disk space issue on PROD-APP-07. The application logs haven’t rotated in six days due to a misconfigured logrotate job. I can fix this by updating the configuration and forcing rotation. May I proceed?” The engineer reviews the proposal, validates the reasoning, and approves execution. Every action requires human oversight.
Stage 3 (Months 12-18): Autonomous L1/L2. For routine, low-risk operations, the AI executes within defined policy boundaries. Disk space management, user provisioning, password resets, routine configuration adjustments—work that matches established patterns with high confidence. Exceptions escalate to humans. Complex scenarios escalate to humans. Anything outside defined safety parameters escalates to humans.
Stage 4 (18+): Advanced L3 Scenarios. Ongoing expansion into complex troubleshooting, capacity planning, and architecture consulting. This stage has no endpoint—it represents continuous capability expansion as the system proves reliability in more sophisticated scenarios. Strategic decisions remain human-owned.

Multi-Agent Coordination provides specialized expertise and failure isolation:

Rather than building one monolithic AI trying to handle all infrastructure domains, we’re architecting specialized agents:

Technology agents. Agents which specialize in specific technologies, such as an agent for Azure, and agent for VMware, etc. Think of these like having a storage engineer, network engineer, etc.
Coordinator agent. The agent that creates a plan of attack, and coordinates with the other agents.
Process agents. Agents that focus on ensuring required processes are being followed, like change management, incident management, etc.

Think of these agents as a cross-functional team of people working to solve every problem, but in a way that is coordinated. Multi-agent approaches reduce the context window for each topic, which is critical to the quality of the process. LLM’s tend to get “dumber” as the data in the context window gets larger.

AEGIS Security Framework ensures autonomous operations remain secure and compliant:

The AEGIS (Agentic AI Guardrails for Information Security) framework defines six domains for securing autonomous AI systems that we’re embedding throughout our architecture: Governance, Identity, Data security, Application security, Threat management, and Zero trust Architecture. This framework ensures AI processes are secure, they stay in their lane, and safeguard client data.

These frameworks work together. Progressive autonomy prevents rushing to autonomous operations before the system proves reliable. Multi-agent coordination provides specialization and safety. AEGIS ensures security throughout.

The Journey Ahead

The future isn’t about AI replacing engineers. It’s about engineers amplified by AI, delivering outcomes neither could achieve alone. It’s about humans providing the domain expertise, contextual judgment, and strategic thinking that AI cannot replicate, while AI provides the tireless consistency, institutional memory, and systematic execution that humans cannot sustain.

Follow this series as we build, measure, fail, iterate, and transform how infrastructure engineering works.

Ready to get started?

Let’s talk