Authors: Bryan Barton and Kassandra Svoboda – Precisely Platform Engineering
This post is part of an ongoing series exploring how the Platform Engineering team at Precisely uses AI agents to scale platform operations.
How a Documentation-First Approach Makes Automation More Accessible
Key Takeaways
- Automating platform processes used to require senior-level institutional knowledge. The right documentation-first AI agent framework changes that prerequisite — if you can document a process accurately, you can encode it as an agent.
- Documentation is not the preamble to building an agent. It is the work. An agent enforces exactly what has been made explicit — no more, no less.
- The reusable layer is the reference document, not the skill. Update it once and every agent that consults it picks up the change.
- Centralized safety constraints mean any engineer can contribute without needing to independently know every guardrail
Not long ago, if you wanted to automate a platform process at Precisely, you needed to be a senior engineer. Not because the work was formally restricted — but because doing it safely required accumulated context about what could go wrong, what the conventions were, and where the guardrails needed to be. That knowledge lived in people, not documentation.
That has changed. Any engineer on the platform team can now take a process they run repeatedly, document it, and turn it into an AI agent that benefits the whole team. Here’s how a documentation-first approach made that possible at Precisely — and why it changes who can contribute to platform engineering automation.
Why Is Senior Engineering Knowledge a Bottleneck, and How Do AI Agents Fix It?
In our previous post, we showed how AI-assisted workflows gave engineers a more complete picture of infrastructure changes before implementation begins — and how that completeness directly reduced risk. The core insight was simple: for high-consequence platform changes, failures are almost always a failure of information, not a failure of effort.
Every recurring platform engineering process has the same underlying challenge. The process is well understood in principle. But the knowledge of how to do it right — the edge cases, the gate checks, the things that will silently fail if you miss them — is usually concentrated in the senior engineers who have done it before.
That creates a familiar bottleneck. Senior platform engineers aren’t gatekeeping — they’re simply the only ones who carry enough context to automate safely. Everyone else either waits, attempts it with partial information, or doesn’t attempt it at all.
Why Writing the Runbook First Is Key to Building Reliable AI Agents
One engineer – an intermediate who had run onboardings enough times to know where the friction was and where things went wrong – looked at the service onboarding process and decided it could be done better.
The first thing they did? It wasn’t writing code, but instead, a runbook. That’s the core of a documentation-first approach to AI agent development: the document isn’t prep work. It is the work.
That choice is the reason the resulting agent is reliable. An agent can only enforce what has been made explicit — so before anything could be automated, everything relevant had to be written down:
- What infrastructure needs to exist before a service can run?
- What conditions must be met before a deployment can be promoted?
- What are the common failure scenarios?
Here is what one section of that playbook looks like — the staging promotion checklist that must pass before any change reaches production:
Stg Promotion Checklist
All must pass before promoting to prd:
- All pods in Running state with no restart loops
- Readiness and liveness probes passing
- Integration tests pass against stg endpoints
- Observability monitors show healthy state (no active alerts)
- SLO burn rate within acceptable bounds
- Service team signed off on stg validation
- Platform engineering notified of upcoming prd deployment
Every item represents a failure mode someone had encountered before. External Secrets misconfigurations — a well-known silent failure mode in this operator — had previously been caught only during manual review. Writing it down as an explicit gate check means the AI agent catches it automatically, for every engineer who uses it.
The staging gate caught a misconfigured secret on the first onboarding the agent ran. It would have reached production. That was not clever engineering in the agent — it was a checklist item that existed because someone wrote it down.
How We Built an AI Agent Framework Any Engineer Can Contribute To
What made this accessible to an engineer at any level was not just writing a good runbook. It was having a documentation-first AI agent framework that turned a good runbook into a safe, working agent — without requiring the contributor to independently know every guardrail or convention.
The platform agent kit is a repository of agent definitions, skills, and reference documents built on top of GitHub Copilot’s VS Code agent customization framework. It is a set of markdown and YAML files that, once installed, make a suite of named agents available in the VS Code chat panel.
The structure is deliberately simple:
- A skill is a markdown document that teaches the AI assistant one job. One skill, one responsibility.
- An agent is a short YAML configuration that gives the skill a user-facing name, declares which AI model it runs on, and lists the tools it is permitted to use — terminal access for cluster operations, and MCP integrations for ticketing, source control, and observability.
- References are the shared knowledge layer: Terraform patterns, GitOps manifest structures, input checklists. They contain no opinion about context — just how to do something.
The framework ships with non-negotiable safety constraints every new skill inherits by reference. Every skill that touches infrastructure references a shared set of rules before suggesting a command — ensuring that changes go through the CI pipeline after merge rather than running locally. A new skill author does not rebuild these constraints. They reference them.
This is what changes the prerequisite for contribution to platform engineering automation. The engineer building the onboarding agent needed to understand the process well enough to document it. They did not need to independently know every safety constraint or infrastructure convention. The kit carried that. They contributed the knowledge of the process. The framework handled the rest.
One Skill, One Job: How Modular AI Agent Design Scales Platform Operations
Now we have a single onboarding AI agent with the skill of a coordinator — rather than containing all the knowledge of every infrastructure operation itself, it reads separate reference documents at the moment each operation is needed. The coordinator has no opinion about how to do any individual operation. The references have no opinion about when or why. Each does one job.
That decomposition is the principle the kit is built on. A single reference for opening a cluster session, for example, is shared across onboarding, platform debugging, pod triage, and cluster auditing. Each skill loads it when it needs it. Update it once and every skill that reads it picks up the change automatically.
The reference document is the runbook for that operation. Which means the AI agent architecture enforces the documentation-first principle structurally, not just culturally — and any engineer who updates a reference improves every platform engineering workflow that depends on it at once.
Better Documentation. Better Agents.
Our first post was about improving the quality of context before a change executes.
This work is the next step: encoding that context into something any engineer can contribute to and benefit from — not as a one-time effort, but as a compounding one. Each skill added makes the next contribution easier and the platform more capable.
The unsolved problem is the feedback loop running in the other direction. When an agent surfaces a failure mode that is not yet in the runbook, the engineer who diagnosed it has to make a deliberate choice to document what they found. Under pressure, that choice often does not get made. The knowledge stays in a thread, and the next engineer hits the same failure.
The key next step is building an agent that catches the failure, and can also propose the fix to the runbook. That is the version of this that would make documentation a genuinely live artifact — something that grows when the AI agent encounters something it was not taught, rather than something that quietly decays as the gap between what is written and what is true widens.
