Everything I Learned About Harness Engineering and AI Factories in San Francisco (April 2026)

Antoine Carossio

Apr 3, 2026 • 15 min read

I spent the last week of March 2026 in San Francisco talking to CTOs, CPOs, and engineering leaders from companies of every size about how they actually build with AI agents today. I've met solo founders of pre-series A startups, I attended Y Combinator DevTool Day on March 27 and All Things Dev on March 31, sat down with our advisors, and had dozens of conversations with founders and tool builders working at the frontier.

This document is what I brought back. It is a field report: what I learned, what I think matters, and where the industry seems to be heading. It is also the reference document my team and I will use to structure how we adopt these practices ourselves.

The audience are startup Founders, CTOs, CPOs, and senior engineers/product managers who are already past the "what is an LLM" stage and want to know what actually works in production. San Francisco is not the whole market, but it is often a leading indicator, and right now, the signal is strong.

The terms below are overloaded, so I use them narrowly:

Model / LLM: The base intelligence layer: tokens in, tokens out. On its own it does not remember sessions, read your repo, run commands, or verify its work. LLM is a specific technology of models.
Harness: Everything around the model: instructions, context, tools, runtime, permissions, review loops, verification.
Agent: A harnessed loop that can decide, act, observe, and continue until done or blocked.
Vibe coding: A low-structure accept-and-iterate workflow. Useful for exploration and prototypes. Weak for correctness, repeatable delivery, and regulated workflows.
AI factory: The org-level system that repeatedly turns intent into shipped work: issue framing, execution, review, deployment, telemetry, feedback. Partly engineering, partly product operations. AI Factory enables Vibe Coding at Scale.

1. What's Happening and What It Means — Tech and Product Hot Takes from the Bay Area

This section is intentionally opinionated. These are not consensus statements. They are recurring arguments, observed shifts, and directional predictions heard across both conferences and in every conversation I had that week.

Productivity x10 since December 2025

This was a common framing, but it should not be presented as an audited universal benchmark.

The charitable and defensible version is:

The comparison several aggressive teams make is against December 2025 workflows, not against the pre-AI era.
In one quarter, models improved, harnesses improved, and orchestration improved at the same time.
The operating ceiling for one engineer with good agents feels materially different than it did a few months earlier.

Treat "10x" as a directional claim from fast adopters, not as settled measurement science.

Startups that don't adopt will die

This is rhetorical, but the underlying claim is serious.

What the statement is really pointing at:

The compounding advantage is not only code generation speed.
It is shorter build-review-ship-learn loops.
Teams that delay adoption entirely are not just slower at implementation; they are slower at learning.

The real decision is not "AI or no AI." The real decision is how much of the delivery loop remains human-led, and which work becomes agent-native now.

The rise of the "Builder"

The distinction between UI designer, UX researcher, product owner, and developer is collapsing. The recurring claim is that a new profile is emerging: the Builder, someone who owns the problem end-to-end and uses agents to cover the skills they lack.

A PM with no frontend experience ships a working UI change.
A designer pushes code, not just mockups.
A founder prototypes a full feature before involving the team.

The threshold for producing a first-pass pull request dropped so sharply that role boundaries stopped being the constraint. What matters now is not your job title but whether you can judge the output: does this diff belong in the product, is it correct, and is it coherent with everything else?

The bottleneck is moving to product strategy

When implementation gets cheaper, bad strategy gets more expensive.

The reason is simple:

Slow implementation used to absorb weak decisions.
Fast implementation removes that buffer.
Teams can now ship low-quality strategy much faster than before.

This is why product quality now depends more on prioritization discipline, not less.

The startup lifecycle is compressing

Agent-driven development compresses the time between:

hypothesis
first product
early traction
version-two confusion

You reach "the first vision is basically built, now what?" much faster.

That creates a new failure mode:

the company has engineering leverage
but it does not yet have strategic clarity for what to do with it

The result is feature volume without product direction.

The IDE is dead

Also rhetorical.

The stronger version is:

The center of gravity is moving from the editor to the agent console.
Editors still matter.
But for multi-step work, the critical surface is now orchestration, visibility, review, status, and control over parallel sessions.

The terminal wins whenever the work looks more like operating a system than typing code line by line.

There is no excuse not to run 24 hours a day

This follows directly from the previous point. If the compounding advantage is loop speed, then leaving agents idle overnight is a deliberate choice to slow that loop.

The argument is not about developer working hours. It is about asset utilization. Agents are infrastructure. Leaving them idle from 7pm to 9am is the equivalent of shutting down your CI pipeline every evening and restarting it in the morning.

The technical capability is no longer in question. Rakuten engineers ran Claude Code autonomously for seven hours on a 12.5-million-line codebase, achieving 99.9% accuracy. OpenAI published a Codex stress test that ran for 25 hours uninterrupted. These are logged runs, not demos.

What the strongest teams described:

Engineers push work at end of day. Agents pick up test writing, code review, refactoring, and security scans overnight.
By morning, the codebase has been tested, reviewed, and flagged. The engineer's first task is triage, not implementation.
Nothing merges without human approval. The overnight cycle produces candidates, not commits.

Do we need fewer PMs or more?

This is still the wrong framing. Three product people for fifteen engineers is more than enough: possibly too many. The old ratio of 1 PM per 5-7 engineers assumed the PM was the translation layer between business intent and technical execution. When agents eliminate most of that translation cost, the PM's value shifts entirely upstream.

What changes is not mainly the headcount math. It is the job shape.

Work that shrinks:

detailed ticket translation
backlog grooming as a communication bridge
implementation-level handholding

Work that grows:

market understanding
synthesis of customer signal
prioritization under much faster engineering throughput
deciding what not to build

The PM role moves upstream. Less project management. More judgment.

Tasks for me or for the agent?

Usually better delegated to agents	Usually still human-led
Correctness sweeps	Where to start
Testing	Architecture
Error handling	Design direction and consistency
Debugging after reproduction	Abstraction boundaries
Boilerplate	Data model and API shape
Translation	Refactoring intent
Thoroughness	Product judgment
Repetitive implementation	Priority tradeoffs

The practical question is not "can the model do this?" It is "what is the cost of a silent mistake here, and how cheaply can I detect it?"

Model choice: Claude 4.6 vs GPT-5.4? You should use both

Claude Opus 4.6	GPT-5.4 in Codex
Better first-pass writing tone	Better implementation reliability
Better exploratory docs and explanation	Better verification, testing and final passes
Strong for frontend and UI taste	Strong for correctness-sensitive backend work
Strong for interactive computer use	Strong for long, tool-heavy execution in Codex

This is a heuristic, not a law. The real point is to stop treating model choice as a religion and start treating it as task routing.

The strongest proof point: on March 30, 2026, OpenAI open-sourced codex-plugin-cc: an official plugin that lets you invoke Codex directly from Claude Code. OpenAI shipping a plugin inside a competitor's tool confirms the moat is the harness, not the model. They'd rather have Codex running inside Claude Code (collecting API charges per review) than have users not use Codex at all. The ecosystem is converging on interoperability, not lock-in.

The category is still moving fast. Overbuilding orchestration too early is an easy way to create your own internal product to maintain.

2. Harness Engineering Pillars

Harness engineering is not "writing a better prompt." It is the design of the system around the model so output quality depends less on raw model brilliance and more on structure.

Minimal AI Factory Architecture

If you strip the category down to its minimum useful shape, an AI factory has seven layers:

Intent capture: Product request, bug, support signal, roadmap item, or internal need.
Spec or issue framing: A bounded instruction with constraints, acceptance criteria, and links to context.
Context and instruction layer: Repo guidance, scoped rules, skills, docs, APIs, and environment facts.
Execution layer: One or more agents editing code, calling tools, and running commands.
Verification layer: Tests, static analysis, review agents, CI, and human sign-off.
Isolation and permission layer: Worktrees, sandboxes, runtime isolation, secret boundaries, and approval flows.
Feedback layer: Production telemetry, customer signal, review outcomes, and repeated failures fed back into rules, prompts, or process.

If one of these layers is weak, the whole system regresses:

No issue framing: fast implementation of vague intent.
No context discipline: expensive wandering.
No verification: vibe coding at scale.
No isolation: parallelism without control.
No feedback loop: repeated mistakes with better marketing.

Instructions, rules, plugins and skills

The important instruction artifacts are:

Artifact	Primary use	Notes
`AGENTS.md`	Shared project instructions across agent tools, auto-imported by Codex.	Standard format used by all providers but Anthropic
`CLAUDE.md`	Same as `AGENTS.md` auto-imported by Claude.	Can symlink `AGENTS.md`
`SKILL.md`	Narrow, on-demand workflow or capability	Use for reusable task methods, not global policy
`.cursor/rules/*.md`	Cursor-specific structured rules	Useful when you need metadata or path scoping

Plugin vs. Skill:

A skill is a single SKILL.md file invoked via slash command (/deploy). A plugin is a directory with a .claude-plugin/plugin.json manifest that bundles multiple skills, hooks, agents, and MCP configs into a distributable package (/plugin-name:command). Use skills for personal workflows. Use plugins when sharing across teams.

ℹ️ Avoiding duplication between Claude Code and Codex: If you use both tools on the same repo, pick one source of truth:

Symlink (simplest): ln -sf AGENTS.md CLAUDE.md. Both filenames point to the same content. Zero drift.
Reference: Put @AGENTS.md inside your CLAUDE.md. Claude Code reads the referenced file inline. Add Claude-specific instructions below.
Pointer: Keep all shared instructions in AGENTS.md. Make CLAUDE.md a one-liner: READ AGENTS.md FIRST. Add overrides below.

Concrete architecture: multi-tool project

my-project/
├── AGENTS.md                          # Source of truth (shared instructions)
├── CLAUDE.md -> AGENTS.md             # Symlink for Claude Code
├── .claude/
│   ├── CLAUDE.md                      # Claude-specific overrides (optional)
│   ├── rules/
│   │   ├── testing.md                 # "Always run pytest before committing"
│   │   └── frontend.md               # "Use Tailwind, no inline styles"
│   └── skills/
│       ├── deploy/
│       │   └── SKILL.md              # /deploy: push to prod workflow
│       └── review/
│           └── SKILL.md              # /review: pre-landing PR checks
├── .cursor/
│   └── rules/
│       ├── base.md                    # Cursor-specific conventions
│       └── api.md                     # Path-gated to src/api/**
└── src/
    └── api/
        └── AGENTS.md                  # Directory-scoped: "All endpoints need auth"

What happens at session start:

Claude Code loads: CLAUDE.md (-> AGENTS.md via symlink) + .claude/CLAUDE.md + .claude/rules/*.md + skill names from .claude/skills/. When you type /deploy, the full deploy/SKILL.md loads into context.
Codex loads: AGENTS.md at root. When working in src/api/, also loads src/api/AGENTS.md. The .claude/ directory is ignored.
Cursor loads: .cursor/rules/*.md + AGENTS.md at root. The .claude/ directory is ignored.

Keep root context lean

The best recent corrective on context-file enthusiasm came from ETH Zurich: detailed repository context often increases cost and can reduce task success when it adds unnecessary requirements.

Use the root file for	Do not use the root file for
Build, test, and lint commands	Generic clean-code slogans
Dangerous areas and non-obvious constraints	Style rules your formatter already enforces
Generated-code boundaries	README duplication
Migration or deployment cautions	Long architecture tutorials the agent can read elsewhere
Review and verification expectations

What matters in practice:

Keep one shared source of truth for durable project instructions.
Put tool-specific behavior only where it belongs.
Put local or path-specific constraints in narrower scopes, not in the root file.
Prefer on-demand skills for workflows that are occasionally needed, not always needed.

Verification beats advice

The rule of thumb is simple: if an error class recurs, stop describing it and start preventing it.

Failure mode	Better fix
Agent stops too early	Explicit build-verify-fix loop
Agent forgets tests	Pre-completion verification hook plus CI
Agent edits the wrong area	Scoped instructions and path-specific rules
Agent repeats the same bug class	Linter, static rule, or regression test
Agent misses architectural context	Better issue framing and smaller task boundaries

Example: LangChain published one of the clearest public examples of this pattern in February 2026: their coding agent moved from 52.8% to 66.5% on Terminal Bench 2.0 by changing the harness, not the model.

Review loops and context drift

Over time, agent-generated code drifts:

Conventions soften
Dead code accumulates
Review comments repeat
Context files become stale

Useful mitigations:

Automated review on every meaningful PR
A second model for high-stakes review when possible
Periodic cleanup of root instruction files
Tracing and postmortems on agent failures
Converting recurring review comments into deterministic checks

Example: coding standards in AGENTS.md

# Global Coding Standards

1. **YAGNI**: Don't build it until you need it
2. **DRY**: Extract patterns after second duplication, not before
3. **Fail Fast**: Explicit errors beat silent failures
4. **Simple First**: Write the obvious solution, optimize only if needed
5. **Delete Aggressively**: Less code = fewer bugs
6. **Semantic Naming**: Always name variables, parameters, and API endpoints with verbose, self-documenting names that optimize for comprehension by both humans and LLMs, not brevity (e.g., `wait_until_obs_is_saved=true` vs `wait=true`)

Source: All Things Web @ WorkOS, 31st of March 2026

3. Engineering and Product Playbook for Founders and teams

As mentioned in the hot takes, adopting Harness Engineering rapidly is a matter of life or death for companies, whatever their size is. As stated by Y Combinator, the trend show come from the top, the Founders, specifically those owning the Technical and the Product Roles, summarized as the CTO and CPO in the rest of the document. With that framing, the CTO controls how fast the org can ship. The CPO controls whether what ships is worth shipping. When agents make the CTO side 10x faster, every CPO mistake compounds 10x faster too.

First 30 days

Don't standardize on day one. Run agents on real work for two weeks and log every revert, rework, and rejection. Then build guardrails around the failure modes you actually saw: not hypothetical ones.

CTO: pick one harness (Claude Code or Codex, not both), add a minimal instruction file, require CI + automated review on all agent PRs, set a per-session cost alert.
CPO: rewrite issue templates around intent and success criteria (agents execute literally), define an explicit "do not build" list for the quarter, pull customer signal into written artifacts.
Together: review merged agent-assisted PRs weekly. Update process from real failures, not theory.

Autonomy tiers

Not all PRs need the same scrutiny. Start everything at full review. Promote downward only with evidence.

Tier	Examples	Required before merge
Full autonomy	Typo fixes, test additions, dependency bumps, boilerplate	CI + automated review
Light review	Feature work within established patterns, bug fixes with clear repro	CI + automated review + human skim (< 5 min)
Full review	New endpoints, data model changes, auth/payment flows	CI + automated review + thorough human review
Human-led	Schema migrations, infra changes, security-critical paths	Human writes or co-writes. Agent assists.

Cadence

Weekly: review agent-authored regressions. Convert the top recurring mistake into a deterministic rule. Check whether issues were specific enough for agents to act without churn.
Monthly: reclassify work across autonomy tiers. Remove dead rules and stale instructions. Audit feature velocity vs. feature impact: are we shipping noise?
Quarterly: revisit the stack, permission model, cost structure, and PM staffing ratio.

Metrics

Lead time from issue to merged PR
Agent autonomy rate (% of tasks without human intervention)
Reopen and rollback rate on agent-authored changes
Wasted work rate (features reverted or unused within 30 days)
Issue clarity (% of issues agents can act on without clarification)
Monthly agent API cost per engineer
Cycle time from customer signal to shipped outcome

4. Agent Factory Tooling

The point is not to install everything below. The point is to identify the bottleneck you actually have.

The winning stack pattern

This is the stack pattern I would describe as convergent, not mandatory:

Layer	Standard choice	Why it keeps showing up
Source of truth	GitHub	Claude Code authors ~4% of all public commits (~135K/day). Every agent tool produces PRs against GitHub repos. The entire agent factory pattern assumes Git and GitHub as the substrate.
Planning	Linear	Declared "issue tracking is dead" (March 2026). Coding agents installed in 75% of enterprise workspaces. Deeplinks send issue context directly into Claude Code, Cursor, or Copilot as prefilled prompts. Agent work volume up 5x in three months.
Trigger and coordination	Slack	Non-engineers describe a problem or request in Slack; an MCP integration routes it to an agent that opens a PR. The barrier drops from "file a ticket" to "describe it in a message."
Thinking and notes	Obsidian	Local markdown files that agents can read via MCP. Where intent gets structured before it becomes an issue or a prompt.
Runtime	Cloudflare Agents	Agents SDK, Durable Objects for state, Workflows for long-running tasks. Workers AI runs frontier models on-platform with 77% cost reduction on 7B token/day workloads vs. external API calls.
Observability	Sentry	Error tracking plus LLM-specific monitoring: agent runs, tool calls, token usage, conversation replay. Also maintains Claude Code agent skills (iterate-pr, code review): sits on both sides of the workflow.
Business signal	HubSpot	Customer feedback, support tickets, and sales conversations flow into the planning layer, giving agents business context for what to build next.

Terminal & orchestration

Tool	Bottleneck it solves	Why it matters
cmux / repo	5+ agent sessions with no status visibility: constant tab-switching	macOS-native terminal with GPU-accelerated rendering (libghostty), per-agent green/yellow/red status indicators, git branch + PR status per workspace. Works with Claude Code, Codex, Gemini CLI.
Superset / repo	Parallel agents stepping on each other's files and git state	Git worktree isolation per agent. Each agent gets its own sandbox with no shared mutable state. Launched March 2026.
Conductor	Running agents sequentially: throughput capped at 1x	Orchestration layer from gstack. Runs multiple Claude Code sessions in parallel, each in its own isolated workspace. Garry Tan regularly runs 10-15 parallel sprints.
Claude Manager	Losing track of which Claude session is running, waiting, or finished	Rust TUI that organizes sessions by project/task hierarchy. Live status indicators, diff preview without attaching, worktree lifecycle management. First published March 2026.

Spec & planning

Tool	Bottleneck it solves	Why it matters
OpenSpec	Agents coding before the problem is well-defined: expensive iterations on work that doesn't match intent	Three-phase state machine (proposal, apply, archive). Agent must produce a ~250-line spec before writing code. Supports Claude Code, Cursor, Copilot, and 20+ tools. 27K+ stars, YC-backed.

Quality & review

Tool	Bottleneck it solves	Why it matters
Codex plugin for Claude Code	Want a second opinion from a different model without leaving Claude Code	OpenAI's official plugin (open-sourced March 30, 2026). Adds `/codex:review` and `/codex:adversarial-review`. Uses the same harness as Codex itself. Runs in background using your ChatGPT subscription.
CodeRabbit	PR reviews are slow (waiting for humans) or shallow (humans skim large diffs)	Always-on AI review on every PR. 13M+ PRs reviewed, 2-3M connected repos, 75M defects found. GitHub/GitLab/Azure DevOps/Bitbucket. Free tier available, SOC 2 Type II.
Taskless	Agent keeps making the same class of mistake: you fix it once but nothing prevents it from reappearing	Converts code review corrections into deterministic syntax-tree rules (tree-sitter). Tag @taskless on a PR or file an issue; it creates a pass/fail rule that runs on every PR, in every IDE, on every run. Same result every time: not AI opinions, not prompt engineering. 25+ languages, zero instrumentation.
Sentry iterate-pr	Manual PR-fix-CI loops: developer re-runs checks, reads logs, applies fix, resubmits	Encodes the fix-CI-resubmit loop as a reusable skill. Agent detects failures, applies fixes, and re-runs checks without human intervention. Good reference for encoding any mechanical review iteration as a skill.
gstack	No structured review/QA patterns beyond basic linting	Pattern library, not a package: role-based review, directory freezes, visual QA, pre-landing checks. Steal the patterns that match your failure mode, ignore the rest.

Context & memory

	Bottleneck it solves	Why it matters
Claude-Mem	Sessions are stateless: everything the agent learned is lost when the session ends	Auto-captures session activity, compresses it with AI (agent-sdk), injects relevant context into future sessions. Adds dynamic, session-derived memory on top of static CLAUDE.md files. 44K+ stars.

Runtime isolation

Tool	Bottleneck it solves	Why it matters
Coasts / repo	Two agents both running `localhost:3000`: port collisions block parallel testing	Each worktree gets its own containerized runtime with dynamic port assignment. Agnostic to AI providers. Single config file.
Docker-in-Docker / Docker Sandboxes	Need N isolated full-stack copies (app, database, workers) per agent	Docker Compose with per-agent port mappings. Docker Desktop 4.60+ supports Sandboxes in dedicated microVMs with network isolation. Heavier than Coasts but gives full stack isolation.

Other tools worth watching

Not all of these belong in a default stack. They are still worth tracking because they attack real bottlenecks.

Tool	What it does	Why it's interesting
Ghost	Instant, ephemeral Postgres databases: agents spin them up like git branches. MCP/CLI only, no UI.	Standard SQL, no proprietary SDK. 100 hrs/month free. Pairs with Memory Engine, TigerFS, and Ox (sandboxed execution), all Postgres-native.
fp	CLI-first, local-first issue tracking for Claude Code. `/fp-plan`, `/fp-execute`, `/fp-review`.	Local code review interface that sends inline comments back to the agent. No external service required. Mac desktop app.
GitButler	Parallel branches in a single working directory via virtual branching: no worktree directories.	Assign file changes to different branches visually. All branches start from the same state, guaranteed to merge cleanly. Lighter than worktree-based isolation.
FinalRun	Vision-based mobile testing on real iOS/Android devices. Test cases written in plain English.	76.7% on Android World Benchmark (116 tasks): ahead of DeepSeek, Alibaba, ByteDance agents. ~99% flaky-free. 2-person startup.
SuperBuilder	Mac-native command center for Claude Code with per-message cost tracking, rate-limit queuing, and Branch Battle.	Free, BYOK. Tracks cost per thread/project, queues tasks through rate limits, compares two approaches side by side.
AgentsMesh	Remote AgentPods for running multiple coding agents (Claude Code, Codex, Gemini CLI, Aider, OpenCode).	Self-hosted runners, gRPC + mTLS control plane, Kanban with ticket-to-pod binding. One dev built 965K lines in 52 days using it.
Ghostgres	Experimental Postgres fork from Timescale: "there are no dumb queries, only dumb databases."	Early-stage (32 stars), but Timescale's broader push includes pgai (embeddings + NL-to-SQL in Postgres) and Ox (agent sandbox TUI).

5. References

Y Combinator DevTool Day (March 27): https://www.ycombinator.com/
All Things Dev (March 31): https://allthingsweb.dev/2026-03-31-all-things-web-workos
Anthropic: Manage Claude's memory: https://docs.anthropic.com/en/docs/claude-code/memory
Anthropic: Claude Opus 4.6: https://www.anthropic.com/news/claude-opus-4-6
Anthropic: Claude Sonnet 4.6: https://www.anthropic.com/news/claude-sonnet-4-6
OpenAI: Introducing Codex: https://openai.com/index/introducing-codex/
OpenAI: Introducing GPT-5.4: https://openai.com/index/introducing-gpt-5-4/
AGENTS.md standard: https://agents.md/
Cursor docs on rules and AGENTS.md: https://docs.cursor.com/en/context
ETH Zurich / SRI Lab: Evaluating AGENTS.md: https://www.sri.inf.ethz.ch/publications/gloaguen2026agentsmd
LangChain: Improving Deep Agents with harness engineering: https://blog.langchain.com/improving-deep-agents-with-harness-engineering/
Linear changelog, deeplinks to coding tools: https://linear.app/changelog/2026-02-26-deeplink-to-ai-coding-tools
Cloudflare Agents docs: https://developers.cloudflare.com/agents/