Interactive tour // v0.1

Pick the lowest rung that gets the job done.

Six tiers of automation, animated by tiny pixel agents. Click a rung to watch how the work flows. The further you climb, the more autonomy you hand over — and the more eval surface you take on.

RUNGS 06
SCENARIOS 02
AGENTS 05
FOR RAHUL_S
User
Agent
Tool · DB
Reviewer
Reply

Six rungs from "if-then" to autonomous agents.

Click a rung. Watch the pixel agents act it out. Every step up adds power and adds failure modes — the rule is to start at the bottom and only climb when the data forces you to.

STAGE_01 / RUNG_01.SCN LIVE
User
Script
Output
User
LLM
User
LLM
Tool · CRM
User
LLM · Drafts
Human · Approves
Send
User
Agent · Plans
Search
Calc
Done
User
Orchestrator
Researcher
Writer
Brief
USE WHEN
Rules are stable, inputs are structured
SLACK
STACK WHY THIS RUNG

One picks the steps. The other decides them.

Same task: "respond to a customer asking about their order." On the left, you wrote the steps and the LLM fills in the blanks. On the right, the agent is given a goal + tools and figures the path out itself. Watch them go.

LLM Workflow

FIXED PATH · 1 STEP
TASK "Tag this support email with category, urgency, and language."
User
LLM
  • You define: read message → draft reply
  • One round trip. Predictable cost.
  • Works when path is knowable.
  • Fails when input is unbounded.
Real example · Notion AI summarize. One prompt: "summarize this page in 3 bullets." No tools, no loops. Round-trip cost: ~$0.01. Ships in a day.

Agent Workflow

OPEN PATH · N STEPS
TASK "Resolve this customer's refund request end-to-end. Look up the order, check policy, decide eligibility, draft the reply, escalate if > $500."
User
Agent
CRM
Orders
Refund
Reply
  • You define: goal + toolbox + when to stop
  • N round trips. Variable cost.
  • Works when path is unknowable.
  • Fails open without strict eval coverage.
Real example · Cursor "fix this bug" agent. Given a failing test, it reads files, edits code, re-runs the test, iterates until green. Path unknowable in advance — can't write it as steps.
Rule of thumb: start at LLM. Climb to agent only when you can prove an LLM can't get there.

Same workflow, very different error tolerance.

Customer support is the canonical case. Internal users forgive a wrong answer; external customers don't. Toggle the audience and watch the workflow's checkpoints rearrange themselves.

STAGE_03 / INTERNAL.SCN LIVE
Employee
Agent
Knowledge Base
Reviewer
Reply
Audience
Internal Employee
Error Tolerance
High
A wrong answer is annoying. The user can DM you for a correction.
Recommended Rung
L3 · Agentic loop
Let the agent search, reason, and answer end-to-end. Log everything.
Mandatory Guardrails
Logging + weekly review
Cheap to ship, cheap to fix. Iterate from real failure logs.
Real-world example
Notion HR helpdesk
An employee asks "how many vacation days do I have left?" The agent queries the HRIS, replies in Slack. A wrong answer is a quick correction in DM. Worth automating end-to-end.

You don't ship and walk away. You close the loop.

Every automation that survives contact with reality has the same four-stage rhythm. Agents move around the ring. Failures from the field become tomorrow's eval cases.

STAGE_04 / FEEDBACK_LOOP.SCN LIVE
Discover
Validate
Implement
Review
The loop is the product. Failures get logged, fed back into the eval set, and the system learns next cycle.
STAGE 01

Discover

Map the workflow. Score each task by volume × time × error cost. Pick the highest-leverage 1–2.

Example AP team mapped 6 invoice-processing tasks. Picked extraction first: 200 invoices/day × 8 min each = 26 hrs/week of ops time. Structured input, low error cost early. Skipped GL coding (judgment-heavy, big error cost) for v2.
STAGE 02

Validate

Hand-build v0 in Claude. Run 20–50 real examples. Write the eval rubric BEFORE building.

Example Built v0 in Claude.ai with 50 real invoices. Rubric: extract date, amount, vendor, line items. v0 hit 92% on clean PDFs, 60% on scans. Decision: ship clean-PDF flow, queue scans for v2.
STAGE 03

Implement

Smallest scope that ships value. Ship behind a flag or HITL. Log everything: success rate, latency, cost, override rate.

Example Shipped behind a flag for the AP team. HITL for 2 weeks — every extraction reviewed before posting to NetSuite. Logged: success rate, latency, cost, override rate, per-vendor accuracy.
STAGE 04

Review

Weekly: read failures, add to eval set, iterate prompt + tools, re-run evals. Loosen autonomy only when metrics earn it.

Example Override rate started at 12%. Weekly review found 3 failure patterns: handwritten amounts, multi-currency, new vendor names. Added 50 cases to eval set. Tightened prompt + added a vendor lookup tool. Override fell to 1.5% by week 6 — then we removed HITL for known vendors.

The LLM is the computer. Your job is to direct it.

Karpathy's framing: programming has shifted from writing code to writing prompts. The context window is your RAM. The model is the CPU. Your specs are the program. Here's what changes for PMs — and what to actually do about it.

Three eras, one direction of travel

Each era keeps the previous one — you still write code, you still train models. But the leading edge of leverage moves up the stack.

Software 1.0

1950s · ONGOING
Coder
Code
CPU

Explicit rules. The human writes every branch. Predictable, debuggable, brittle when the world doesn't fit the rules.

ProgrammerHuman
ProgramCode
ComputerCPU

Software 2.0

2012 · ONGOING
Data
Data
Data
NN · Weights
Predict

Neural networks. You curate datasets and let gradient descent find the program for you. The "code" is now millions of weights.

ProgrammerDatasets
ProgramWeights
ComputerGPU

Software 3.0

2023 · NOW
Spec
CONTEXT.RAM
Output

The LLM is the computer. You program it in English. The context window is RAM. Spec quality, not syntax, is the bottleneck.

ProgrammerPM / Director
ProgramPrompt + Spec
ComputerLLM

Vibe coding vs agentic engineering

Two different speeds. Both are useful. Knowing which one you're in is the PM job.

Vibe Coding

SPEED · PROTOTYPE · ANYONE
WHEN You want to test if an idea is even worth building. Cost of being wrong is low.
  • Describe the vibe → model ships v0 in minutes
  • Anyone can do it. No framework knowledge needed
  • Great for prototypes, demos, side projects
  • Don't ship to paying customers without a rewrite
Real example · Karpathy's MenuGen. He built an OCR pipeline for restaurant menus. Months later, one prompt to Gemini replaced the entire stack — overlaying generated images directly onto the menu pixels. The original code became dead weight.

Agentic Engineering

QUALITY · PRODUCTION · DISCIPLINE
WHEN The output ships. Brand, security, money or compliance is on the line.
  • Coordinate stochastic agents under a quality bar
  • Evals, observability, version control, code review
  • Output is reviewed, tested, gradually trusted
  • The new "10x engineer" is fluent at this
The new 10x. Engineers who master agentic tools (multi-agent orchestration, eval harnesses, prompt versioning) are seeing productivity gains far past the historical 10x ceiling. The floor moved up; so did the ceiling.
Rule: vibe-code to learn, agentic-engineer to ship. Don't confuse the two.

Jagged intelligence — same model, wildly uneven

Models peak where output is verifiable (math, code) because that's where RL has been pointed. They cliff in places that look "obvious" to humans. Treat capability as terrain, not altitude.

^_^
Refactor 100k LOC
^_^
Math proofs
^_^
SQL queries
o_o
Long-form writing
o_o
Tool selection
x_x
Spatial reasoning
x_x
Common-sense logic

Don't blindly trust the output. A model that just refactored a 100k-line codebase may also fail to decide whether to walk or drive 50 metres to a car wash. Stay in the loop. Eval the domain, not the model.

"An agent tried to link user accounts using emails from different services rather than a persistent user ID — a classic intern-level mistake that needed human engineering judgment to catch."

Your role: Director, not Doer

As intelligence gets cheaper, the premium moves to taste, judgment, and oversight. The PMs who win in Software 3.0 are the ones who can write a sharp spec and review agent output ruthlessly.

You're not writing tickets.
You're directing agents.

Outsource the thinking — data crunching, code synthesis, draft generation. Don't outsource the understanding. Know your business, know your users, know what "good" looks like. The agents fill in the API details. You decide what's worth building and whether it works.

Taste  · what's worth building
Judgment  · when output is wrong
Oversight  · evals, guardrails, escalation
PM

Three things to change this quarter

Concrete moves Karpathy calls out. Steal them.

Refactor hiring

Stop testing syntax puzzles. Hand candidates a real-scale spec — "build a secure social-media clone" — and watch how they decompose the problem and orchestrate agents to ship it.

SIGNAL Agentic fluency · spec quality · review reflex

Focus on the spec

Your role is Director. Spend your time on engineering design, architecture decisions, and detailed specs. Let the agents fill in API details, boilerplate, and tests.

YOUR OUTPUT Tight specs · clear interfaces · explicit constraints

Invest in understanding

You can outsource the thinking — data crunching, code synthesis. You can't outsource the understanding. Go deep on the business, the users, and the technical fundamentals you're directing.

DON'T OUTSOURCE Why · what good looks like · the goal