You're Absolutely Right (and Other Lies My AI Told Me)
Abstract
Your agent agrees with you. Every time. It removes tests “for clarity.” It changes floor to round because that’s “more accurate.” It says “you’re absolutely right” — and then ships the bug you were about to talk yourself out of. The failure has two missing pieces. The agent can’t see the constraints that would let it push back (your company’s rules aren’t in any training set, and they never will be). And the agent has no rule empowering pushback when the operator is wrong (RLHF trained it to agree by default). Bigger models won’t fix this. The two largest labs in the world keep redirecting you to write a rules file — for the half they can fix and the half they can’t. This talk treats context as an engineering problem: packaged, versioned, distributed, evaluated. Three primitives — skills (reusable workflows), rules (always-loaded invariants and pushback gates), scripts (deterministic transformations) — composed into context artifacts the agent installs like dependencies. The chapter on rules-vs-scripts-vs-skills is a decision frame, not a taxonomy: when a rule beats a prompt, when a script beats a rule, when a hook beats a script. The chapter on evaluation kills the vibes-eval antipattern and gives you skill / plugin / project tiers that map onto unit / integration / system tests you already know how to write. And then — the meta reveal. The talk itself is a plugin. Every prescription on stage is a real rule in jbaruch/coding-policy on the Tessl registry. The medium is the message.
Resources
The Receipts (Cold Open & Inoculation)
- Anthropic Claude Code Issue #3382 — “you’re absolutely right” — closed as “completed” August 2025, no fix commit, recommended workaround: write a rules file
- Sycophancy in GPT-4o: What happened and what we’re doing about it — OpenAI — April 2025, five-day rollback after the model praised a “shit on a stick” business idea and validated a user who’d stopped his meds
- Expanding on what we missed with sycophancy — OpenAI — follow-up postmortem: short-term thumbs-up training drowned out the anti-sycophancy reward signal
Public Catastrophes (Slide 15 — context starvation in production)
- Replit AI deletes Jason Lemkin’s production database — The Register — July 2025, “told it not to ELEVEN TIMES IN ALL CAPS”, agent admitted “catastrophic error in judgment”, 1,206 executives + 1,196 companies wiped
- Replit CEO: What really happened when AI agent wiped Lemkin’s database — Fast Company
- Cursor + Claude Opus 4.6 wipes PocketOS production database AND backups in 9 seconds — The Register — April 2026, agent’s own postmortem opened with “NEVER FUCKING GUESS!” while describing itself guessing
- ‘I violated every principle I was given’ — Cursor agent deletes PocketOS database — Fast Company
- Google Gemini CLI deletes user’s files, confesses “gross incompetence” — Slashdot — July 2025, “I have failed you completely and catastrophically. My gross incompetence…”
The Plugins
- jbaruch/coding-policy on Tessl Registry — the talk itself, packaged. Every prescriptive claim in chapters 3, 4, and 5 is a real rule in this plugin. Sixteen rules, skills with delegated scripts, eval scenarios, versioned, peer-reviewed.
- Install:
tessl install jbaruch/coding-policy
- Install:
- jbaruch/kotlin-tutor on Tessl Registry — the running example throughout the talk: a teaching plugin for idiomatic Kotlin. Skill
kotlinify-tests, rules K-1..K-6 (prefer-val,nullable-question-mark,use-data-class,kotest-over-junit,prefer-stdlib-scope,extension-over-util), scriptverify-no-junit-assertions.- Install:
tessl install jbaruch/kotlin-tutor
- Install:
Context Engineering
- Tessl — Agent Enablement Platform — versioned, distributed plugins for AI agents
- Tessl Registry — the package manager for agent skills
- Agent Skills Standard — the modern de-facto standard for giving agents instructions (yes skills, no prompts)
- Anthropic Agent Skills announcement
- Model Context Protocol (MCP) — plumbing for tools and context
- Context Engineering — Tobi Lütke — “the term I much prefer over prompt engineering”
Tessl Blog — Context Engineering
- The Context Development Lifecycle (CDLC): Better Context for AI Coding Agents — context as an engineering artifact: generate, distribute, test, observe
- The Context Flywheel: Why the Best AI Coding Teams Will Win on Context — better context produces better signals produces better context; context doesn’t commoditize
- Context Maturity for AI Coding Teams — three dimensions maturing together (Agents & Tools / Context / People & Organization)
- CI/CD for Context in Agentic Coding: Same Pipeline, Different Rules — evals are to context what tests are to code
- Context-Bench: Benchmarking AI’s Context Engineering Proficiency — how efficiently a model manages memory, revisits prior context, and what it costs
- Making Claude Good at Go using Context Engineering with Tessl — applied example
Tessl Blog — Skills
- Announcing Skills on Tessl: the package manager for agent skills — skills as software with a lifecycle (versioned, tested, reusable, composable)
- What Are Agent Skills? (And Why You’ll Never Want to Push Code Without One Again)
- My Coding Agent Needed a Package Manager for Its Own Brain (And I Gave It One Using a Skills Registry)
- Do Agent Skills Actually Help? A Controlled Experiment — the lift-not-attainment proof
- Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7) — skills lift every configuration tested
- Best Agent Skills for AI Code Review: 8 Evaluated Skills
- Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter? — skills still matter as models get better
Tessl Blog — Evals
- Your AGENTS.md file isn’t the problem. Your lack of AI Agent Evaluations is. — unvalidated context is useless and often harmful
- If agents use your tool, you need evals — for maintainers
- Three Context Eval Methodologies at Tessl — Skill Review, Task and Repo Evals — the three eval surfaces (maps to the talk’s skill / plugin / project tiers)
- Introducing Task Evals: Measure Whether Your Skills Actually Work — baseline vs with-skill, the lift methodology
- Improving your skills with Tessl evals —
tessl skill lint,tessl skill review,tessl skill eval - Evaluate skill quality using scenarios — Tessl Docs
- How to Evaluate AI Agents: An Introduction to Harbor — Terminal-Bench heritage; containerized agent eval
- Anthropic brings evals to skill-creator. Here’s why that’s a big deal — Create / Eval / Improve / Benchmark
- Evaluating context compression in AI agents — structured state beats compressed text
Prompt Caching (Q&A — cost objection)
Agents Referenced
- Claude Code — the agent used in all demos
- Cursor
- OpenAI Codex
- Aider
- Gemini CLI
Vendor Plugin Systems (Slide 31 — “Claude plugins, Codex plugins, equivalents from every vendor”)
- Plugins for Claude Code and Cowork — Anthropic
- anthropics/claude-plugins-official — official Claude Code plugin directory
- Agent Skills — OpenAI Codex — Codex’s skill packaging mechanism (2% context-window cap for the initial skills list)
- AGENTS.md — Codex custom instructions — the rules-file convention Codex reads on session start
Conventional Commits & Commitlint (DEMO 02, R-10)
- Conventional Commits specification — the format R-10 prescribes
- commitlint — the deterministic gate that rejects the agent’s “fix stuff” commit and forces a retry
- Semantic Versioning — the versioning row in the package-management-generalized table
The Three Primitives
| Primitive | What it is | When to reach for it |
|---|---|---|
| Skill | Labeled procedure the agent invokes by name — lazily loaded, versioned, packaged | Reusable workflow that requires reasoning |
| Rule | Always-loaded text constraint — invariant or pushback gate | Always-true / never-do behavioral constraint |
| Script | Deterministic transformation — same input, same output | Computation or check that must not be left to agent judgment |
Two Kinds of Rules
| Flavor | Shape | Example |
|---|---|---|
| Invariant | “Never X / Always Y” | K-1 prefer-val — properties default to immutable |
| Pushback gate | “Before doing X, stop and ask” | R-7 — ask before removing any guard, feature flag, or rate-limiter |
Pushback gates belong on work that’s hard to reverse. Cheap reversible work flows freely. Over-scope them and your day turns into “yes proceed, yes proceed, yes proceed.”
The Eval Hierarchy
| Tier | Maps to | What it catches |
|---|---|---|
| Skill eval | Unit test | Does one skill produce the expected shape on its scenarios? |
| Plugin eval | Integration test | Do the skills + rules + scripts inside one plugin compose without contradicting each other? |
| Project eval | System test | With multiple plugins installed, do the rules conflict across plugins? |
Watch out for bleeding (criterion value appearing verbatim in the task description) and leaking (criteria referencing tile-internal implementation details). Always run a baseline — attainment without lift is a vanity metric.
Operating Principles
- 3-1 — Package your context. Skills, rules, scripts. Versioned, tested, distributed.
- 4-1 — When in doubt, assume not judgment. Default toward the deterministic primitives.
- 4-2 (hook principle) — When the agent has motive to skip the script, move the script outside the agent’s control (pre-commit hook, CI gate, server-side enforcement).
- 4-3 (the wisdom prayer) — Grant me the rules for what’s always true, the scripts for what mustn’t be judged, the skills for what must be — and the wisdom to know which is which.
- 5-1 — You’re either measuring or you’re waving. Stop waving.
- 6-1 — If your agent keeps agreeing with you even when you’re wrong — the problem isn’t the agent.
Three Monday Actions
- Write a skill — pick one procedure you keep rewriting in chat
- Add a rule — one invariant or one pushback gate
- Write an eval — fixed task, deterministic criteria, comparable across runs
- Package it —
tile.json, version, CHANGELOG - Share it with one teammate — registries exist for a reason
If you do all five by next Friday, you’ll have replaced about 30% of your team’s vibes-based evaluation with structure.
Speaker
- Baruch Sadogursky — @jbaruch — Context Sommelier (self-certified)