If you’ve run AI agents against real infrastructure — Azure deployments, Terraform plans, network provisioning — you’ve hit the wall. Your agent fires off a long-running CLI command, and the entire main loop goes dark. No responses, no status updates, no way to cancel. Just silence until that network call decides to come back. Or doesn’t.

This is the single biggest reliability problem in production AI agent workflows, and it’s why I’ve been pushing hard on the Agent Client Protocol (ACP) inside OpenClaw. ACP doesn’t magically make slow operations fast. What it does is isolate them so they can’t take your agent down with them.

Here’s how it works and how to set it up.

What ACP Actually Is

The Agent Client Protocol is a standard for making coding agents interoperable across different clients — IDEs, editors, orchestrators, whatever. At its core, it standardizes communication between a host application and coding agents using JSON-RPC over stdio. Local agents get launched as subprocesses and talk back through that pipe.

That’s the spec-level description. Here’s what matters for us: ACP gives OpenClaw a way to run external coding harnesses — Claude Code, Codex, Gemini CLI — as supervised child processes instead of doing everything inline in the main agent loop.

OpenClaw uses ACP in two directions:

  1. ACP sessions inside OpenClaw — Run external harnesses via the acpx backend plugin. OpenClaw supervises the process, manages its lifecycle, and can kill it if things go sideways.
  2. The OpenClaw ACP bridge (openclaw acp) — A CLI command that speaks ACP over stdio so IDEs can forward prompts to an OpenClaw Gateway session.

For this article, we care about the first one.

The Core Problem: Your Agent Loop Is Single-Threaded

Here’s the failure mode I see constantly in production. The OpenClaw main agent turn directly invokes a long-running Azure networking call — say, provisioning a VNet peering or waiting on a Terraform apply. That agent loop is serialized. One operation at a time. While it’s waiting on that Azure API to respond, the entire agent is unresponsive.

No chat responses. No ability to check on other work. No cancel button. If that API call hangs for 10 minutes, your agent hangs for 10 minutes. If it hangs forever — and Azure networking calls absolutely can — your agent hangs forever.

The temptation is to just increase timeouts. That’s not a solution; it’s a prayer with a deadline.

How ACP Fixes This

When people say “ACP prevents OpenClaw from hanging,” what they mean is: run the risky, slow, failure-prone work in an external harness process that OpenClaw supervises, rather than doing it inline.

The acpx backend spawns work into supervised child processes. The runtime handles transport, queueing, cancellation, and reconnection. Even if Claude Code gets stuck waiting on an Azure call that will never return, that stuck work lives in a different process. OpenClaw’s main loop stays responsive. You can:

  • Cancel the stuck session
  • Close it and let the TTL reaper clean up
  • Spawn a new session to do something else
  • Keep chatting with the agent while background work runs

This is process isolation applied to AI agent orchestration. It’s not novel computer science — it’s the same reason we don’t run web servers as single-threaded event loops without worker processes. But it’s surprisingly absent from most agent frameworks.

Setting It Up

Prerequisites

Claude Code needs to be installed independently. On Windows, use the PowerShell install script; on Linux/macOS, use curl. The first run will prompt you for authentication — get that sorted before you try to wire it into ACP.

Install the acpx Plugin

openclaw plugins install acpx
openclaw config set plugins.entries.acpx.enabled true

Verify everything is wired up:

/acp doctor

This runs a backend health check. If it fails, you’re usually looking at a missing Claude Code install or an auth issue.

Configure ACP

The key config fields:

openclaw config set acp.enabled true
openclaw config set acp.backend "acpx"
openclaw config set acp.defaultAgent "claude-code"
openclaw config set acp.maxConcurrentSessions 8

You can also set acp.allowedAgents to restrict which harnesses are available, and runtime.ttlMinutes to auto-reap sessions that outlive their usefulness.

Handle Permissions (This Will Bite You)

ACP sessions run non-interactively — there’s no TTY. This means the default permission model (approve reads, fail on writes) will throw AcpRuntimeError the moment the harness tries to write a file or execute a command.

You need to configure permissionMode and nonInteractivePermissions to match your security posture. For infrastructure automation, you’ll typically need write and exec permissions enabled. Be deliberate about this — ACP sessions run on the host runtime, not in a sandbox.

Day-to-Day Usage

Slash Commands

The operator workflow is straightforward:

  • /acp spawn — Start a new ACP session
  • /acp status — Check running sessions
  • /acp timeout <seconds> — Set session timeout
  • /acp steer — Send additional instructions to a running session
  • /acp cancel — Kill a stuck session
  • /acp close — Clean shutdown

Programmatic Spawning

For automated workflows, use sessions_spawn with runtime: "acp". The agentId parameter picks which harness to use, and mode can be either run (one-shot, execute and return) or session (persistent, keeps the harness alive for follow-up work).

Thread binding works too — Discord threads and Telegram forum topics can be tied to specific ACP sessions, so your team gets isolated conversation contexts per task.

Non-Blocking Patterns

ACP sessions are one tool in a broader toolkit. Here are the four patterns I use for keeping agent workflows responsive:

Option A: Exec Backgrounding. The exec tool supports a timeout parameter (default 1800 seconds) and automatic backgrounding via yieldMs. Fire off a command, let it background after a few seconds, and poll for results. Simple, effective for known-duration operations.

Option B: ACP Sessions. For complex multi-step work that might involve tool use, file operations, and decision-making — not just a single shell command — ACP sessions give you full harness isolation. The harness can think, act, and get stuck without affecting the main loop.

Option C: Cron Reconciliation. Set up OpenClaw cron jobs to periodically check infrastructure state. Instead of waiting for a deployment to complete, schedule a reconcile check every few minutes. The agent wakes up, checks status, and either acts or goes back to sleep.

Option D: Webhooks. For event-driven workflows, use external cron or infrastructure callbacks hitting OpenClaw’s webhook endpoint (/hooks/agent). When your Terraform apply completes or your Azure deployment finishes, fire a webhook to wake the agent. No polling, no hanging.

In practice, I combine all four. ACP sessions for the heavy lifting, exec backgrounding for quick commands, cron for reconciliation, and webhooks for state-change notification.

Troubleshooting

Permission Failures

If you see AcpRuntimeError immediately after spawning a session, it’s almost always the non-interactive permission defaults. Check your permissionMode and nonInteractivePermissions settings.

Zombie Sessions

Sessions that complete their work but don’t properly close will eat your concurrency slots. Monitor with /acp status and set runtime.ttlMinutes to auto-reap stale sessions. If you’re hitting acp.maxConcurrentSessions limits, check for zombies first.

Distinguishing Hang Types

Not all hangs are the same, and the fix depends on which kind you’re dealing with:

  1. Main loop blocked on shell — The agent itself is stuck waiting on an inline exec call. This is the problem ACP solves. Move the work to a session.
  2. ACP session stalls after completion — The harness finished but the session didn’t close cleanly. Use /acp close or let TTL handle it.
  3. Harness genuinely hung on network — Claude Code itself is stuck waiting on an external API. Use /acp cancel to kill the session, then investigate the underlying network issue.

Claude Code Install and Auth

This catches people more often than you’d think. Claude Code is a real dependency — if it’s not installed, not authenticated, or the auth token expired, ACP sessions will fail at spawn time. Run /acp doctor first. Always.

Security Considerations

ACP sessions run on the host runtime, not in a sandbox. This is by design — they need access to your actual infrastructure tooling. But it means you need to think about security boundaries.

For webhooks, keep endpoints on loopback or behind a Tailnet/proxy. Use dedicated tokens for webhook auth, and restrict agent and session routing so external callers can’t spawn arbitrary work.

The Bottom Line

Agent hangs aren’t a minor inconvenience — they’re a reliability problem that makes AI automation untrustworthy for production infrastructure work. ACP gives you process isolation, lifecycle control, and timeout management for the operations most likely to hang.

It’s not complicated to set up. It’s not magic. It’s good engineering practice applied to agent orchestration: don’t let one slow operation take down your whole system.

If you’re building AI agent workflows for infrastructure automation and running into reliability issues, let’s talk. This is exactly the kind of production hardening work we do at Big Hat Group.