AI harness engineering

Guardrails and memory
for your AI coding agent.

Your AI coding agent will skip the test, leak the secret, and forget yesterday. That's not a prompting issue - it's a harness problem. goat-flow is the opinionated harness for teams shipping with Claude Code, Codex, Gemini CLI, and Copilot CLI - not just demoing them.

$ npx @blundergoat/goat-flow@latest dashboard

Terminal output showing goat-flow audit results: 17 of 17 build checks passing, harness scores for Claude Code (94%), Codex (91%), Gemini CLI (87%), and Copilot CLI (85%), plus five-concern coverage for context, constraints, verification, recovery, and feedback loop.

Why harness engineering?

Agents need better control systems.

Files it can read. Commands it can run. Rules it must obey. Memory it keeps across sessions. That's the harness - and it matters more than which model you pick. goat-flow gives you one, opinionated, out of the box.

Supports Claude Code Codex Gemini CLI Copilot CLI
The system

Four pieces. One harness.

Audit tells you what's missing. Skills give the agent workflows. Hooks stop dangerous actions. The learning loop remembers what happened.

01 / Audit

Pass/fail checks, no wiggle room

Validates every file, skill, and hook the agent needs. Either it's installed or it isn't. Scores each agent's harness completeness across the five concerns.

goat-flow audit --harness
02 / Skills

Structured slash commands

Seven workflows with defined phases, named artefacts, and stopping points. Debug, plan, review, critique, security, QA - plus a dispatcher that routes your intent to the right skill.

/goat, /goat-debug, /goat-plan...
03 / Hooks

Safety nets that can't be skipped

Pre-action guards block dangerous commands before they run. Post-action guards catch silent breakage after. deny-dangerous ships by default, blocking destructive filesystem commands, all git push, secret exfiltration, and risky subshells.

.goat-flow/hooks/
04 / Learning loop

Persistent memory across sessions

Footguns, lessons, decisions, session logs. Every mistake becomes next session's context. The compounding bet: every session that hits a problem makes the next one harder to trip.

.goat-flow/lessons, /footguns, /decisions
Under the hood

The execution loop

Every agent action follows four steps. Each one prevents a specific failure mode that free-running agents reliably hit.

READ

Load the files first

Pull in the actual code before reasoning about it.

Prevents fabrication - inventing APIs that don't exist.
SCOPE

Declare what changes

List files that will be touched, and files that won't.

Prevents surprise blast radius - changing files nobody agreed to.
ACT

Make the change

Edit only within the declared scope. Nothing else.

Prevents drift - refactoring that seemed related while the agent was in there.
VERIFY

Prove it works

Run linters, re-read changed files, confirm nothing drifted.

Prevents silent breakage - passing the task but breaking the build.
Seven skills

Workflows, not suggestions.

Free-form prompting is how agents get lost. Skills are structured slash commands with defined phases and clear stopping points. Use /goat as the default entry point and it routes to the right one.

/goat-debug Diagnose bugs without accidentally rewriting the codebase end to end Debug
/goat-plan Plan features, refactors, and milestones - scales from hotfix to system change Plan
/goat-review Review diffs and verify what shouldn't be there, not just what should Review
/goat-critique Surface blind spots from multiple angles before shipping Critique
/goat-security Threat model, dependency audit, and compliance checks Security
/goat-qa Generate test plans with automated, AI-verified, and manual steps QA
Hooks

Block dangerous actions before they run.

A system prompt is a suggestion. A hardcoded boundary is a rule. Hooks enforce boundaries at a layer the model cannot talk its way past.

Ships with sensible defaults

deny-dangerous catches the patterns agents hit most often when they go off-script: destructive filesystem commands, all git push, secret file reads, subshell escapes, and database truncation.

Extend with your own

Drop linters, format-on-save, custom validators, or project-specific rules into the hooks directory. They register automatically and run in parallel with the defaults.

deny-dangerous Pre-action
βœ—rm -rfdestructive
βœ—git pushall pushes blocked
βœ—cat .envsecret read
βœ—curl | shexfiltration
βœ—eval, bash -csubshell escape
βœ—DROP TABLEdata loss
βœ—> filetruncation
βœ—$(...)recursive sub
Learning loop

The harness gets smarter every session.

Two things failed. Nothing remembered, and nothing stopped them. The learning loop fixes both.

Footguns

Architectural traps captured with semantic-anchor evidence. Stops the agent from hitting the same code landmine twice.

.goat-flow/footguns/

Lessons

Behavioural mistakes the agent made - logged so the same error pattern is recognised and avoided next time.

.goat-flow/lessons/

Decisions

Architecture Decision Records. Captures why a choice was made so future agents don't quietly reverse it.

.goat-flow/decisions/

Session logs

End-of-session summaries provide continuity between work sessions - across agents, across days, across context compactions.

.goat-flow/logs/sessions/
The framework

The five concerns of AI harness engineering.

The common ground across the public harness engineering literature. goat-flow scores every installed harness against these five.

Context Give the agent a map, not a 1,000-page manual. Concise instructions, the right files, progress notes across sessions.
Constraints Deterministic rules that steer before the agent acts. Linters, deny-hooks, permissions, required sections.
Verification Structural checks the agent runs to prove its own work. Tests, typecheck, post-action hooks, back-pressure.
Recovery Session durability and restart paths. Checkpoint and resume, compaction handlers, milestone checkboxes, loop detection.
Feedback loop Capture every mistake as persistent context so the next session doesn't repeat it. Footguns, lessons, decisions, logs.

Sources: Mitchell Hashimoto, Birgitta BΓΆckeler (martinfowler.com), Anthropic engineering, and HumanLayer. goat-flow synthesises these into a working system with strong defaults.

Get started

From zero to passing audit in two commands.

Set up on any project, verify the harness, then start running skills through your agent of choice.

1 npx @blundergoat/goat-flow@latest dashboard
2 npx @blundergoat/goat-flow@latest audit --harness

Supports Claude Code, Codex, Gemini CLI, and Copilot CLI. Read the CLI docs β†’