C/D — QA / review
QA for Claude Code: Reviewing Skills, MCPs & Agent Output
If you ship Claude Code or Cowork implementations — skills, MCP servers, sub-agents, plugins — to a team or a client, you run into a problem traditional software QA doesn't have: the behavior isn't fully deterministic, and it drifts. A skill that worked beautifully last month can quietly degrade when a model, a prompt, or a connector changes. So "does it still work?" isn't a one-time test — it's an ongoing question. This page covers how to QA agent output in practice.
Why QA-ing agent output is different
- Non-deterministic output. The same prompt can produce different paths. Pass/fail unit tests don't capture "was this a good answer?"
- It drifts over time. Skill versions change, MCP servers update, and the underlying model evolves. A green checkmark today doesn't guarantee tomorrow.
- The interesting failures are mid-chain. Agent problems usually aren't a single bad call — they're a wrong turn three steps in, in a multi-step chain you can only see if you can review the whole session.
- You often have to prove it to someone else. A client or stakeholder wants evidence the deliverable works, not your word for it.
This is why QA for Claude Code is less about test suites and more about reviewable sessions plus a quality signal.
A practical QA loop for Claude Code & Cowork
- Capture every session as a structured, replayable record — not just final output, but each turn, tool call, MCP invocation, and skill run.
- Review the session, not just the result. Step through what the agent did so you can see where quality came from or broke down.
- Rate and annotate. Mark turns or sessions good/wrong, leave a reason, and tag the ones worth revisiting. This builds a feedback record over time.
- Track skills, MCPs, and versions. Know which skill version and which connector produced a given result, so you can spot regressions when something changes.
- Prove it still works. Keep evidence you can show a client: here's a recent session, here's exactly what the skill did, here's that it's still healthy.
How Argus supports Claude Code QA
Argus is built for exactly this loop. A lightweight plugin captures each Claude Code or Cowork session into your own workspace, where you can:
- Replay sessions turn by turn, with tool and MCP calls resolved to real names and attached to the right step.
- Rate and annotate quality directly on a session or a specific turn.
- Catalog skills, MCP servers, and plugins that ran — and track versions, so a regression in a skill or connector is visible instead of silent.
- Keep everything private and controlled — encrypted, workspace-scoped, visible only to invited reviewers, with redaction and per-session opt-out.
For agencies and forward-deployed engineers, that turns "trust me, it works" into a replayable, rateable record you can stand behind — and revisit the next time something upstream changes.
FAQ
What does QA mean for Claude Code or Claude Cowork sessions?
Reviewing what the agent actually produced, turn by turn, against the implicit standard the user expected. Did the skill fire? Did it produce the right shape? Did the user accept it on the first turn or did they have to re-ask? QA at this layer is qualitative — it's the work no automated metric captures.
How do I QA a skill across many Claude Cowork sessions at once?
Read across them. Pick a single skill, pull every session where it fired, look at the first-turn acceptance rate over the last 30 days. If it's drifting, diff the skill's version hashes against the period of decline. Argus does this for you; without a tool it's a manual SQL session.
Can I review what an MCP server actually returned during a session?
Yes — if you have session capture. The raw MCP request and response are visible in Argus's replay view, redacted to your workspace's privacy settings. Without capture you only have the OTel trace, which records that a call happened but not its content.
What's the difference between rating a turn and rating a session?
Turn-level QA marks one specific assistant response as good or wrong. Session-level QA is a summary judgement of the whole interaction. Both feed the per-skill rollups in Argus — turn ratings are the precise signal, session ratings the rough overall read.
Does QA for AI sessions need a human in the loop?
For now, yes. LLM-as-judge can flag obvious failures but misses the "was the user happy" signal that's the whole point of QA. The right design is: human ratings as ground truth, automated checks as a cheap pre-filter that prioritizes which sessions a human should look at.