Product

Human Expert Guidance Meets Agentic AI: The Architecture for Scalable Autonomous Operations

How organizations are building, testing, and sharing reusable AI automation assets — agents, skills, runbooks, and approval policies — to autonomously resolve 80% of common operational tasks while keeping humans in control of the remaining 20%.

STSteve Tran
·
vibeopsagenticaiautonomousopsskillsframeworkmultiagentsystemsenterprisesecurityaiops
Cover Image for Human Expert Guidance Meets Agentic AI: The Architecture for Scalable Autonomous Operations

Human Expert Guidance Meets Agentic AI: The Architecture for Scalable Autonomous Operations

How organizations are building, testing, and sharing reusable AI automation assets — agents, skills, runbooks, and approval policies — to autonomously resolve 80% of common operational tasks while keeping humans in control of the remaining 20%.


01 — From "AI That Alerts" to "AI That Acts"

For the past decade, IT operations has been trapped in a paradox: we built increasingly sophisticated monitoring systems that could detect problems faster than ever, but still required humans to investigate, decide, and act. The result? Alert fatigue at an industrial scale — teams drowning in 500+ alerts per day, with 90% being noise.

The industry evolved through predictable stages:

The Evolution of IT Operations
2010–2018Detect only
Manual MonitoringThreshold-based alerts, static dashboards, human-driven investigation and response.
2018–2023Detect + Recommend
AIOps 1.0ML anomaly detection, event correlation, noise reduction — but still human-executed.
2023–NowDetect + Act
Agentic AIOpsAI agents autonomously investigate, decide, and act within governed boundaries.

But here's what most platforms get wrong: they treat automation as a binary switch. Either a human does it, or the machine does it. The reality is far more nuanced. The future isn't about replacing human expertise — it's about encoding human expertise into shareable, testable, governed automation assets that AI agents can execute autonomously across an entire organization.

The goal isn't zero-touch operations. The goal is autonomous execution of the 80% of tasks that are common, well-understood, and low-risk — while escalating the 20% that genuinely need human judgment with full context already assembled.

The VibeOps Principle

This is the architecture of VibeOps: closed-loop, self-healing operations where human experts guide agentic AI through shareable automation primitives — and where every resolution makes the entire system smarter.


02 — Six Building Blocks of Scalable Autonomous Operations

Scalable automation isn't about writing scripts. It's about creating composable, shareable, governed primitives that any team can develop in a workspace, test in isolation, and deploy across the organization. Each primitive encodes a different dimension of operational expertise:

Platform Building Blocks
/create-agentSpecialist AI personas with defined skills, permissions, and sandbox boundaries.
/create-knowledgeVectorized, RAG-indexed knowledge bases from Confluence, Notion, Slack, incidents.
/create-skillModular capability definitions with triggers, prompts, guardrails, output schemas.
/create-topoLive topology maps auto-discovered from infrastructure. Dependency graphs, blast radius.
/create-taskScheduled autonomous operations — cron-based, event-triggered, or chained via skills.
/create-incidentAutomated incident orchestration — severity classification, RCA, war room creation.

The critical insight: each of these primitives is a unit of shareable expertise. When a senior SRE creates a skill for diagnosing Redis connection pool exhaustion, that skill doesn't stay locked in their head or their team's Notion page. It becomes an organization-wide capability that any agent can invoke, any team can benefit from, and the system continuously improves through usage.

The Anatomy of a Skill

Skills are the atomic unit of agentic behavior. Unlike traditional runbooks (static documents that humans read and execute) or scripts (rigid code that breaks when context changes), a skill is a living, contextual, self-evaluating capability definition:

SKILL.mdSSL Certificate Expiration Checkv2.4L3

Trigger Patterns

ssl expiringcertificate renewalTLS checkcert audit

Required Tools

aws-cliacm-apiroute53slack-webhook

Connections

aws-prodaws-stagingcloudflare

Prompt Template

system: certificate auditor
context: {topo + incidents}
task: Scan → Assess → Act

Guardrails

Read-only scans
Alert via Slack
Auto-renew requires approval
No certificate deletion

Output Schema

domain : string
expiry_date : datetime
days_remaining : int
risk_level : enum
issuer : string
action_taken : string

Chain Triggers

IF days_remaining < 14create-jira-ticket
IF days_remaining < 7auto-renew + notify
IF renewal_failedescalate-to-human
LOOP Scoring Feedback Knowledge Graduation

This skill can be developed by one team, tested in their sandbox, and once approved through the governance layer, shared across the entire organization. Every team gets SSL monitoring without writing a single line of code.


03 — Develop, Test, Share, Automate

The path from human expertise to autonomous execution follows a governed, four-stage lifecycle. This isn't "deploy and pray" — it's a structured process where every automation asset is validated before it touches production, and continuously monitored after deployment.

Automation Lifecycle
1.Develop
Build agents, skills, and knowledge with SDK and visual editor.
2.Test
Simulate scenarios in sandboxed environments with eval suites.
3.Share
Publish to marketplace. Version, share, and fork across teams.
4.Automate
Deploy with graduated autonomy. Monitor and continuously improve.

Stage 1: Develop in Workspaces

Within each Organization, teams create Workspaces that serve as isolated development environments segmented by project, department, or environment. Each Workspace maintains its own dedicated knowledge base, skill configurations, agent permissions, connection credentials, and scheduled tasks. A DevOps team's Workspace is completely separate from the Security team's — different knowledge, different connections, different permission boundaries.

The development process itself is conversational. Users talk to @Anna (the AI orchestrator) in natural language: "Create a skill that checks database backup status across all production RDS instances every morning." @Anna generates the complete SKILL.md with trigger patterns, required tools, prompt templates, guardrails, output schemas, and chain triggers. What previously required weeks of engineering now takes minutes of conversation.

Stage 2: Test in Sandboxes

Every AI agent operation runs in an isolated, ephemeral Sandbox — created on demand, destroyed immediately after execution. CloudThinker's proprietary sandbox runtime provides isolated microVMs for compute isolation, per-tenant VPC for network isolation, and ephemeral per-session storage that's destroyed after execution. No data persists. No cross-sandbox access is possible. A full audit trail is captured for every operation.

This isn't just "testing." It's risk-free production simulation. Skills execute against real (or mirrored) infrastructure with their actual tool chains, but within boundaries that guarantee zero blast radius. The evaluation pipeline scores each execution using LLM-as-judge accuracy assessment combined with human feedback, building confidence metrics that feed into the approval workflow.

Stage 3: Share via Governed Approval

Once a skill, agent, or runbook passes testing, it enters the enterprise governance layer — RBAC policies, approval workflows, credential scoping, audit trails, and metering. This is where the "human-guided" part of autonomous operations becomes critical. A senior engineer doesn't just build a skill and push it to production. They submit it through an approval workflow where security reviews credential scope, compliance verifies guardrails, and team leads validate business logic.

The sharing model operates at three tiers: Built-in Skills maintained by the platform (Code Review, Incident Response, HelpDesk, Cloud Ops, Security, FinOps), Partner Skills built by managed service providers and shared across their client base, and Enterprise Custom Skills built internally for domain-specific needs like banking compliance or healthcare regulations.

Stage 4: Automate with Graduated Autonomy

Shared skills don't immediately run with full autonomy. They enter a graduated autonomy model that builds trust incrementally:

Graduated Autonomy Levels
L1Notify
AI detects and alerts. Human investigates and acts.
L2Suggest
AI investigates and recommends. Human approves before execution.
L3Act + Approve
AI executes with approval gates for high-risk. Low-risk auto-resolves.
L4Autonomous
Full closed-loop execution. Self-healing with continuous monitoring.

The 80% target for autonomous resolution isn't arbitrary — it maps precisely to the L3 and L4 levels applied to well-understood, low-risk, high-frequency tasks. Password resets, SSL certificate renewals, disk space alerts, cost anomaly notifications, routine backup validations — these are the tasks that consume 80% of operations time and carry minimal risk when automated properly. The remaining 20% — novel incidents, architectural decisions, security breaches, compliance investigations — get escalated to humans with full context already assembled by the AI, reducing even the human-handled portion's MTTR dramatically.


04 — Three-Tier Isolation: Organizations, Workspaces, Sandboxes

Autonomous AI agents executing operations across your infrastructure is only viable if the isolation model is bulletproof. The architecture implements a three-tier isolation hierarchy that ensures complete tenant, team, and execution-level separation — critical for banking, healthcare, and any enterprise with compliance requirements.

Security Isolation Model
Organization (Tenant Boundary)SSORBACBillingAuditComplianceUser Mgmt
Workspace: Production Ops
Knowledge BaseSkill ConfigAgent PermsCredentials
Sandbox ASandbox B
Workspace: IT Support
Knowledge BaseSkill ConfigAgent PermsCredentials
Sandbox ASandbox B

No data crosses organization boundaries. No data persists in sandboxes. No cross-sandbox access is possible. Every execution produces an immutable audit trail. The technology stack — isolated microVMs, kernel-level syscall filtering, per-tenant VPC, customer-managed encryption keys — provides defense-in-depth that satisfies SOC 2 Type II, banking regulatory compliance, and data residency controls.

This isolation model is what makes "share across organization" safe. When a DevOps team shares a skill with the Security team, the skill definition is shared — but the credentials, execution context, and data access remain scoped to each team's Workspace. The Security team's instance of the skill connects to their tools with their permissions. The same skill blueprint, completely isolated execution.


05 — From Expert Knowledge to Autonomous Resolution

Let's trace the complete lifecycle of how a senior SRE's expertise becomes an organization-wide autonomous capability:

Scenario: Redis Connection Pool Exhaustion → Auto-Remediation
1. Senior SRE identifies a pattern: Redis connection pool exhaustion has caused 12 incidents in the past quarter.
2. Develop the skill: /create-skill "Detect Redis connection pool exhaustion by monitoring active connections vs. max pool size..."
3. @Anna generates SKILL.md with trigger patterns, tool requirements (redis-cli, aws-cloudwatch, k8s-api), guardrails...
4. Test in sandbox: The SRE runs the skill against a mirrored staging environment. Result: 98% accuracy across 50 simulated scenarios.
5. Submit for approval: Security reviews tool access. DevOps lead validates remediation logic. Compliance confirms audit trail.
6. Share across organization: Approved skill published to Skills Library with RBAC tags.
7. Graduated deployment: Starts at L2 for first 2 weeks. After 15 successful resolutions, promoted to L3. After 90 days, eligible for L4.
8. Continuous learning: Every execution feeds back into Knowledge Graph. Next incident is prevented, not just resolved.

One senior SRE's expertise. Twelve incidents per quarter eliminated. Organization-wide capability. Continuous improvement. This is the flywheel effect of human-guided agentic AI.


06 — @Anna: One Conversation, Unlimited Capabilities

The orchestration layer is what makes all of these primitives work together as a coherent system. @Anna serves as the single intelligent entry point — users talk to her in natural language, and she automatically classifies intent, selects the right skills, and delegates to specialist agents when needed.

Execution Pipeline
01Intent Detection
02Guard-In
03Skill Selection
04Context Build
05LLM Routing
06Sandbox Boot
07Tool Execution
08Chain / Sub-Agent
09Guard-Out
10Event Log
11Eval & Trace
12Memory + Deliver

Every user request — whether it's a natural language question, a slash command, or a scheduled task trigger — flows through this standardized 14-stage pipeline. The pipeline ensures every operation is secure (Guard-In + Guard-Out), auditable (Event Log), evaluated (LLM-as-judge tracing), and continuously improving (Memory Write). There are no shortcuts, no backdoors, no unlogged executions.

The Guardrails Engine (Layer 9 in the platform stack) operates as an independent safety agent — it doesn't answer to the orchestrator or the executing agent. It performs PII detection, schema enforcement, injection defense, and output validation on every pipeline execution. This separation of concerns is critical: the agent that executes the action is never the agent that validates the action.


07 — Why Every Automation Makes the Entire System Smarter

The most powerful aspect of this architecture isn't any individual component — it's the compounding flywheel that emerges when all the primitives work together through a shared Knowledge Graph.

Platform Flywheel
LOOPMore skillsMore capabilitiesMore use casesMore teamsSmarter AIHigher value→ ↻

Consider what happens over time: every resolved incident generates a new Knowledge Graph entry. Every skill execution produces evaluation data that improves future executions. Every new team that onboards their runbooks via /create-knowledge enriches the contextual understanding for all agents. Every topology mapping via /create-topo gives the system deeper infrastructure awareness. The AI doesn't just execute — it accumulates operational wisdom.

The Closed-Loop Intelligence Architecture

Traditional platforms operate as open loops: detect, alert, human acts, done. The VibeOps architecture is a closed loop where four modules continuously feed data to each other:

DimensionTraditional AIOpsAgentic VibeOps
Response ModelAlert → Human Investigates → Manual FixDetect → Agent Investigates → Auto-Resolve
KnowledgeStatic runbooks (read by humans)Dynamic Knowledge Graph + RAG (invoked by agents)
LearningManual rule updatesContinuous self-improvement per execution
Cross-DomainSiloed per toolMulti-agent collaboration via orchestrator
Speed (MTTR)4+ hours average< 15 minutes (70% reduction)
Scale ModelLinear with headcountExponential with AI agents

08 — The 80% Autonomous Target

The 80% autonomous target isn't about replacing humans — it's about a precise categorization of operational work by risk profile and repeatability. The architecture achieves this through graduated autonomy applied systematically across operational domains:

70%
IT HelpDesk auto-resolution
85%
DevOps auto-handling
65%
HR onboarding automation
47s
Avg resolution time

Automate (L3–L4): Password resets, VPN troubleshooting, software installation, access provisioning, SSL renewals, disk space management, cost anomaly alerts, routine backup validation, deployment rollbacks for known failure patterns, certificate rotation, DNS changes, security group modifications within approved templates.

Human-guided (L1–L2): Novel incident types, architectural decisions, security breach investigation, compliance audit interpretation, budget allocation, vendor selection, cross-team escalation policies, infrastructure migration planning, regulatory change assessment.

The key architectural decision is that even the human-handled 20% benefits from AI assistance. When an incident escalates to a human, the agent has already assembled full diagnostic context — the topology showing blast radius, the Knowledge Graph entries for similar past incidents, the timeline of related changes, and a proposed remediation plan with confidence scoring. The human doesn't start from zero; they start from a complete briefing.


09 — From Reactive Chaos to Self-Healing Operations

Reaching 80% autonomous resolution doesn't happen on day one. The implementation follows a 12-month, four-phase maturity journey designed to build organizational trust incrementally — with ROI visible from Month 2.

Implementation Journey
M1–3Foundation
Unified data layer, event correlation, Knowledge Graph setup — 80% alert noise eliminated
M3–6Agentic Core
Incident Commander agent, automated RCA, human-in-the-loop — MTTR reduced by 60%
M6–9VibeOps Scale
Multi-agent orchestration, predictive prevention — 70% autonomous resolution
M9–12Self-Healing
Closed-loop remediation, self-improving runbooks — Near-zero-touch operations

Each phase expands both the number of teams using the platform and the autonomy level of deployed skills. The pattern is consistent: start with L1–L2 for new domains, prove accuracy, build confidence, graduate to L3–L4. The organization never takes more risk than the data supports.


10 — The Human Expert Doesn't Disappear — They Scale

The fundamental misconception about autonomous AI operations is that it replaces human expertise. The reality is the opposite: it scales human expertise to an organizational level. One senior SRE's diagnostic methodology doesn't help only when they're on-call — it helps every team, every shift, every incident, continuously improving.

The architecture we've described — shareable automation primitives, workspace-isolated development, sandbox testing, RBAC-governed sharing, graduated autonomy, three-tier security isolation, closed-loop learning — isn't a theoretical framework. It's the mechanical reality of how human knowledge becomes autonomous capability at enterprise scale.

The organizations that get this right won't just operate faster. They'll create a compounding intelligence advantage where every incident resolved, every skill created, and every knowledge base enriched makes the entire system measurably smarter. That's not automation. That's evolution.

Enterprises that skip directly to Agentic VibeOps gain 2–3 years of competitive advantage over those that evolve incrementally through traditional AIOps.


Ready to Build Scalable Autonomous Operations?

Stop drowning in alerts. Start encoding your team's expertise into shareable, governed automation assets that resolve 80% of operational tasks autonomously.

Start your free trial or book a demo to see human-guided agentic AI in action.