Skip to content

What we kept, dropped, and invented

FromPatternOur adoption
SuperpowersTDD iron lawValidator HARD RULE: behavioural assertions need failing-then-passing test
gstackBoil the lakeAdded to debugger / ui-qa / curator / product / architect prompts
claudecode-orchestratorQuality through truthValidator EVIDENCE RULE: quote source output to claim PASS
claudecode-orchestratorService smoke-testbin/service-smoke-test.sh + smokeTest.onDone/onFeaturePass
HermesNamed checkpointsCHECKPOINT <name> decision verb + triggerOn: post-planner auto-fire
ComposioGit worktree per agentbranchIsolation.useWorktrees + worktree_* fns
ConductorTwo-mode parallelismparallelWorkers.mode: "lane"|"competition"
Agent SwarmPer-agent IDENTITY.md~/autonomous-harness/identity/<role>.md cross-mission append-only
MOLTRONSelf-evolving learningsWorker TRICK: convention promoted by curator
Ralph Loop4-layer memoryL0 runtime / L1 raw / L2 summary / L3 MEMORY / L4 identity
GSDPer-phase orchestrators with state-to-diskFresh-context per role per iteration
AgentwiseReal-time dashboardNext.js /harness UI
FromPatternWhy rejected
OpenSwarmLinear/Notion task sourceUser explicit: no external task source
OpenSwarmLanceDB vector memoryOverkill at our scale
Claude-SwarmTmux-based agent messagingFile-based simpler
AgentwiseDiscord/Slack controlUI-coupled
MOLTRONWorkers rewrite their own promptsAudit / determinism preferred
AutoGPTInfinite loop, no decision verbsDrift catastrophe
LangChainHeavyweight Python runtimeBash + claude won
All “skill packs”Human as orchestratorWe needed autonomous loop

What we invented (or at least: didn’t find elsewhere)

Section titled “What we invented (or at least: didn’t find elsewhere)”
PatternWhy we needed itWhere it lives
Decisions timeline + ghost rateDetect orchestrator parser regressions/api/harness/:slug/decisions + dashboard panel
Cost-per-feature attributionPer-session cost tracking is too coarseFilename tag pattern: <ts>-<role>-<fid>.jsonl/usage byFeature
Plan-reviewer as autonomous gategstack’s /plan-ceo-review is human-triggered; ours fires automaticallyprompts/plan-reviewer.md + run.sh gate
Autonomous Product roleMOLTRON evolves capabilities; we wanted scope expansionprompts/product.md + GOAL.md vs SPEC.md diff + proposals CRUD
Proposals CRUD with bulk accept/rejectOperator triage workflow.harness/proposals/*.md + dashboard 🪄 tab
Mission cost cap with SIGSTOP auto-pause (later removed)Soft warn before hard capRemoved at user request after $135 mission
Per-feature timeline aggregating snapshots + agent runs + debug + PRForensics replacement for log scrolling/api/harness/:slug/features/:id/timeline
Health endpoint with 9 composite checksOne-glance system health/api/harness/:slug/health

We bet that:

  1. A small bash supervisor with no agent-runtime opinion is the right unit of generality.
  2. Stable per-role markdown prompts give us prompt-cache-dominance (99.94% hit rate).
  3. Files-on-disk are the right state model — any role can re-derive from them.
  4. A curator role doing memory compaction prevents unbounded raw.md growth.
  5. Observability beats autonomy — we’d rather see the orchestrator’s parse-ghost rate than have a more autonomous orchestrator we can’t observe.

So far the bet is paying off: 99.94% cache hit, 22/37 features green at iteration 5 of the live mission, $135/mission, $0.05/role-invocation effective cost.