Back to Playbooks
MCPDevOpsPlatform EngineeringAI AgentsDevSecOps

MCP DevOps Governance Rollout Kit

A practical adoption playbook for platform teams (before agent-driven infra changes go sideways)

A practical rollout kit for adopting MCP servers in DevOps safely: server inventory scoring, transport policy (stdio vs streamable HTTP), identity and approval models, SIEM logging schema, pilot use cases by risk tier, and incident runbooks for agent-driven infrastructure mistakes.

22 min read
Verified against MCP / Microsoft / HashiCorp / AWS sources on Feb 25, 2026

Download the Playbook

Use the standalone version for architecture review meetings, governance boards, and internal enablement sessions. The PDF is generated from the Markdown source and both were updated on February 25, 2026.

What This Rollout Kit Gives You

A server inventory and risk-tiering framework for MCP adoption decisions

A transport policy matrix for stdio vs streamable HTTP by action tier

Identity and approval patterns for safe infrastructure operations

A structured SIEM logging schema for MCP tool invocations

Pilot use-case sequencing from read-only to controlled infra changes

Incident runbooks and due-diligence checklists for production readiness

Non-Negotiables

Tier 3 and Tier 4 actions require human approval and a documented execution authority (CI/control plane or break-glass path).

Every MCP server has an owner, version pin, and rollback / kill-switch procedure.

Tool scope is intentionally limited for pilot phases; no “enable everything” defaults.

Auth and authorization are defined before expanding tool coverage.

MCP actions are logged with actor, tool, target, credentials context, approval status, and result.

Production credentials are short-lived and tightly scoped; avoid long-lived static secrets in local configs.

Start Here: Inventory and Risk Tiering

Do not start with prompts. Start with a server inventory. The biggest governance failures happen when teams expose tools before they understand the action surface and credential scope.

MCP server inventory worksheetmarkdown
# MCP server inventory worksheet (fill this before rollout)

| Server | Owner | Hosting | Transport | AuthN/AuthZ | Tool Scope | Risk Tier | Prod Enabled | Logging | Version Pin | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| terraform-mcp | Platform IaC | local (pilot) | stdio | local cloud creds + role | plan/apply/workspace | Tier 2/3 | No | local -> SIEM | 0.x pinned | apply blocked in prod |
| azure-mcp-server | Platform Cloud | internal VM/App | streamable-http | Entra app + RBAC | inventory/deploy/read ops | Tier 1/2 | Non-prod | central logs | 1.x pinned | per-subscription scopes |
| azure-devops-mcp | DevEx | local | stdio | PAT / Entra (scoped) | Boards/Repos/Pipelines/Wiki | Tier 1/2 | N/A | local -> SIEM | pinned | tool domains scoped |
| billing-mcp | FinOps | internal service | streamable-http | IAM role + read-only | cost/budgets/anomalies | Tier 1 | Yes | central logs | pinned | read-only only |

Risk tiers:
- Tier 1: read-only / no state mutation
- Tier 2: reversible writes (tickets, PR drafts, comments, staging actions)
- Tier 3: infra mutations / deploys / database or cluster changes
- Tier 4: destructive or high-blast-radius operations (delete, destroy, prod failover)

Transport Policy by Action Tier

Stdio: High-trust local workflows

Use stdio when minimizing network exposure matters most, especially for pilot-stage infra operations. Pair it with local logging export, short-lived credentials, and strict policy that production mutations run through CI/control planes.

Streamable HTTP: Shared services

Use streamable HTTP for centrally hosted MCP services only after auth, authorization scopes, network controls, and structured audit logging are in place. Remote hosting simplifies standardization and upgrades, but increases systemic blast radius if misconfigured.

Transport policy matrix (example)markdown
# Transport policy matrix (example)

| Action Tier | Default Transport | Allowed Remote? | Extra Controls Required |
|---|---|---|---|
| Tier 1 (read-only) | stdio or streamable-http | Yes | auth, logging, rate limits |
| Tier 2 (reversible write) | stdio preferred | Yes, after review | authz scopes, owner routing, audit logs |
| Tier 3 (infra mutation) | stdio for pilot; CI executor for prod | Limited | human approval, change record, least privilege, rollback path |
| Tier 4 (destructive/prod critical) | no direct agent execution by default | Exceptional only | break-glass workflow, multi-party approval, recorded session |

Rule of thumb:
- Use stdio when reducing network exposure and keeping credentials local matters most.
- Use streamable HTTP when shared services, central policy, and auditability matter most.
- Never let transport choice be the only control. Transport is one layer, not the policy.

Client Configuration Patterns (Conceptual)

Local pilot config

Use read-only defaults and minimal tool scopes first. Local config is convenient, but becomes a governance problem if every engineer has different scopes and versions.

Local MCP client config (conceptual)json
{
  "mcpServers": {
    "terraform": {
      "command": "terraform-mcp-server",
      "args": ["serve"],
      "env": {
        "TF_MCP_READ_ONLY": "true",
        "TF_MCP_LOG_LEVEL": "info"
      }
    },
    "azure-devops": {
      "command": "azdo-mcp",
      "args": ["serve"],
      "env": {
        "AZDO_ORG_URL": "https://dev.azure.com/your-org",
        "AZDO_TOOL_DOMAINS": "repos,boards,pipelines"
      }
    }
  }
}

# Conceptual local pilot config
# - Prefer read-only defaults first
# - Scope domains/tools per server
# - Export logs to a local file/collector consumed by your SIEM

Remote service config

Centralized servers enable stronger standardization and easier auditing, but only if token handling and authorization boundaries are explicit and tested.

Remote MCP client config (conceptual)json
{
  "mcpServers": {
    "platform-azure": {
      "transport": {
        "type": "streamable-http",
        "url": "https://mcp.platform.example.com/azure/mcp"
      },
      "headers": {
        "Authorization": "Bearer <short-lived-token>"
      }
    },
    "billing": {
      "transport": {
        "type": "streamable-http",
        "url": "https://mcp.platform.example.com/billing/mcp"
      }
    }
  }
}

# Conceptual remote config
# - Use enterprise auth (e.g., Entra/IAM)
# - Prefer short-lived tokens over static secrets
# - Gate Tier 3/4 actions on server and executor side

Identity and Authorization Design

AuthN first

Decide how agents and humans authenticate to the server before exposing any write-capable tools. Microsoft guidance now includes built-in auth and delegated/on-behalf-of patterns in Azure Functions MCP scenarios.

AuthZ by domain

Scope tools by domain (repos, boards, pipelines, subscriptions, namespaces). Avoid broad “platform-admin” roles that make every tool call a privileged action.

Short-lived creds

Prefer short-lived, federated credentials and managed identities. Static long-lived secrets in local MCP configs turn experimentation into a latent incident.

Action Governance and Approval Gates

Tier 1 (Read-only)

Allow broadly after auth + logging are verified.

Use rate limiting to protect backing systems.

Alert on unusual query spikes or broad enumerations.

Tier 2 (Reversible writes)

Require owner scope checks and issue/change references.

Prefer writing to systems with native audit trails (tickets, PRs, comments).

Keep rollback / undo actions documented and tested.

Tier 3 (Infra mutations)

Human approval mandatory.

Execution should happen in CI/control plane, not ad hoc local shells, for production targets.

Require plan/dry-run artifacts and explicit target confirmation before apply.

Tier 4 (Destructive)

Deny by default.

Enable only via break-glass runbook with multi-party approval.

Require recorded steps, incident ticket, and post-incident review.

This policy file is conceptual. Implement the same logic in your control plane, CI workflows, or policy engine. The key is consistent enforcement, not the syntax.

Approval policy pattern (conceptual)yaml
# Approval policy pattern (conceptual)

policies:
  - name: tier1-read-only
    match:
      server: ["billing", "inventory", "docs", "metrics"]
      actionTier: [1]
    effect: allow
    requirements:
      - log_to_siem

  - name: tier2-reversible-writes
    match:
      actionTier: [2]
    effect: allow_with_conditions
    requirements:
      - log_to_siem
      - owner_scope_check
      - change_ticket_or_issue_link

  - name: tier3-infra-mutation
    match:
      actionTier: [3]
    effect: require_human_approval
    requirements:
      - approved_change_request
      - executor_is_ci_or_control_plane
      - plan_or_dry_run_artifact
      - rollback_path_documented
      - log_to_siem

  - name: tier4-destructive
    match:
      actionTier: [4]
    effect: deny_by_default
    exceptions:
      - break_glass_runbook
      - multi_party_approval
      - recorded_session

Logging and SIEM Integration

Minimum viable telemetry

Human identity and agent/client identity

MCP server name/version and transport

Tool name, target system, and risk tier

Credentials/principal context (without leaking secrets)

Approval status and change reference

Result status, duration, and artifact references

Alerting ideas

Tier 3/4 action attempted without approval metadata

MCP server version drift outside approved list

Repeated failed tool invocations against protected targets

Sudden spike in broad inventory or secret-adjacent queries

Local-only servers executing production-scope actions

Structured SIEM event schema (conceptual)json
{
  "timestamp": "2026-02-25T14:21:30Z",
  "eventType": "mcp.tool_invocation",
  "traceId": "01HS...",
  "sessionId": "agent-session-abc123",
  "actor": {
    "type": "human+agent",
    "humanId": "jane.doe@example.com",
    "agentClient": "ide-agent",
    "agentModel": "vendor/model"
  },
  "server": {
    "name": "terraform-mcp",
    "version": "0.9.2",
    "host": "laptop-123",
    "transport": "stdio"
  },
  "tool": {
    "name": "terraform.plan",
    "riskTier": 3,
    "target": "live/staging/services/payments-api"
  },
  "auth": {
    "principal": "arn:aws:iam::123456789012:role/staging-terraform-plan",
    "authType": "federated-short-lived"
  },
  "approval": {
    "required": true,
    "status": "approved",
    "approver": "platform-oncall@example.com",
    "changeRef": "CHG-4821"
  },
  "result": {
    "status": "success",
    "durationMs": 8430,
    "artifactRef": "s3://.../tfplan-4821"
  }
}

# Minimum goal: enough context to answer
# who ran what, against which system, with which credentials, under which approval

Pilot Use Cases by Risk Tier

Tier 1

Read-only cloud and platform inventory

List resources by tag/owner

Query cost/budget status

Retrieve pipeline status

Fetch cluster health summaries

Tier 2

Reversible change assistance

Create incident tickets

Draft PRs for config changes

Rerun failed non-prod pipelines

Annotate alerts with runbook context

Tier 3

Infra mutation with human gate

Terraform plan and CI-mediated apply

Non-prod deploys via approved pipeline

Kubernetes rollout restart in staging

Tier 4

Destructive / high blast radius operations

terraform destroy

Delete production resources

Failover actions

Database schema/data destructive changes

Incident Runbook (Wrong Target / Unsafe Action)

Contain First, Then Recover

MCP incidents combine platform operations risk with agent behavior risk. Treat them like infrastructure incidents with an added evidence-preservation requirement for prompts, tool scopes, and approvals.

Incident response runbookmarkdown
# Runbook: Agent initiated wrong-target or unsafe infrastructure action

1. Contain
- Disable the affected MCP server or revoke its credentials immediately
- Freeze related CI/CD applies/deployments for the impacted domain
- Announce incident channel and single incident commander

2. Preserve evidence
- Collect MCP request/response logs, tool invocation logs, and approvals
- Capture server version, config, and tool scope at incident time
- Preserve cloud/platform audit logs and CI run artifacts

3. Assess impact
- What changed? (resources, configs, tickets, pipelines, secrets references)
- Which environment was affected? (dev/staging/prod)
- Is this reversible with standard workflow, or do you need break-glass?

4. Recover safely
- Prefer forward-fix via reviewed change in CI/control plane
- Use provider-native emergency actions only if incident severity requires it
- Document every manual step and reconcile back to IaC/state after stabilization

5. Prevent recurrence
- Tighten tool scope, credentials, transport policy, or approval requirements
- Add wrong-target checks (env confirmation, resource tags, prod hard-stop)
- Update runbooks and operator training

Vendor / Server Due Diligence Checklist

Use This Before Enabling Any Server for Production Workflows

MCP server due diligence checklistmarkdown
# MCP server due diligence checklist (before production use)

Identity & access
- Does the server support strong authentication for your environment?
- Can you scope authorization by tool/domain/resource?
- Can credentials be short-lived and rotated automatically?

Transport & networking
- Which transports are supported (stdio, streamable HTTP)?
- Is there hardening guidance for remote hosting?
- Can you restrict inbound access (private network, firewall, mTLS, gateway)?

Safety & operations
- Are tool actions clearly documented, including destructive operations?
- Is there dry-run/plan support for risky actions?
- Are logs structured and exportable?
- Are versions pin-able with release notes/changelogs?
- Is failure behavior documented (timeouts/retries/idempotency)?

Governance
- Can you limit exposed tools to a subset for pilot?
- Can you map actions to risk tiers and approval policy?
- Do you have a rollback/kill-switch procedure for this server?

Rollout Phases (Suggested)

Phase 0: Inventory and Policy Baseline

Inventory every MCP server and classify tools by risk tier before rollout.

Define transport policy by tier (stdio vs streamable HTTP).

Define mandatory log fields and where they land (SIEM, retention, alerting).

Decide what is CI/control-plane-only vs what agents may execute directly.

Phase 1: Read-Only Pilot

Start with Tier 1 servers: inventory, docs, metrics, cost read-only workflows.

Validate auth, latency, usefulness, and audit completeness.

Train platform engineers on prompt boundaries and target verification habits.

Phase 2: Reversible Writes with Approval

Enable Tier 2 actions (ticket updates, PR drafts, pipeline reruns in non-prod).

Require owner scoping and change references.

Run tabletop scenarios for wrong target, stale context, and over-broad tool use.

Phase 3: Controlled Infra Mutations

Allow Tier 3 actions only through human approval and CI/control-plane execution.

Require plan/dry-run artifacts and explicit target confirmation.

Track success, rollback rate, and near-misses before broader enablement.

Phase 4: Production Expansion and Continuous Review

Review scope monthly and remove tools that add risk without measurable value.

Pin versions and upgrade on a scheduled cadence with test environments.

Treat MCP governance as part of platform engineering, not a one-time security review.

Weekly Governance Review Template

Measure Safety and Value Together

Teams often measure only speed gains. That is how unsafe patterns get normalized. Review safety, audit completeness, and rollback rates in the same meeting as time saved and adoption growth.

Weekly rollout scorecardmarkdown
# MCP rollout scorecard (weekly)

## Adoption
- Active users this week:
- Active MCP servers:
- Top 5 workflows by volume:

## Safety
- % of tool invocations logged with complete fields:
- Tier 3/4 actions without approval (target = 0):
- Wrong-target or near-miss events:
- Rollbacks triggered:

## Quality
- Useful outcome rate (agent action reduced manual work):
- False positive / hallucinated action proposals:
- Mean time to approved change (with MCP vs baseline):

## Cost / Operations
- MCP server uptime and error rate:
- Token/model spend (if applicable):
- Support tickets caused by MCP tooling:

## Decisions
- Promote to next phase? (yes/no)
- Which server/tool scopes expand next week?
- Which scopes get frozen or rolled back?

Sources

Companion Deep Dive

Read the companion article for the strategic framing, corrected claims from the original hot take, and the argument for treating MCP governance as a platform engineering capability rather than a one-off AI experiment.

Read the Blog Deep Dive