You're Absolutely Right (and Other Lies My AI Told Me)

GeeCON 2026 May 14, 2026 Video Available

Slides

Video

A presentation at GeeCON in May 2026 in Kraków, Poland by Baruch Sadogursky

Abstract

Your agent agrees with you. Every time. It removes tests “for clarity.” It changes floor to round because that’s “more accurate.” It says “you’re absolutely right” — and then ships the bug you were about to talk yourself out of. The failure has two missing pieces. The agent can’t see the constraints that would let it push back (your company’s rules aren’t in any training set, and they never will be). And the agent has no rule empowering pushback when the operator is wrong (RLHF trained it to agree by default). Bigger models won’t fix this. The two largest labs in the world keep redirecting you to write a rules file — for the half they can fix and the half they can’t. This talk treats context as an engineering problem: packaged, versioned, distributed, evaluated. Three primitives — skills (reusable workflows), rules (always-loaded invariants and pushback gates), scripts (deterministic transformations) — composed into context artifacts the agent installs like dependencies. The chapter on rules-vs-scripts-vs-skills is a decision frame, not a taxonomy: when a rule beats a prompt, when a script beats a rule, when a hook beats a script. The chapter on evaluation kills the vibes-eval antipattern and gives you skill / plugin / project tiers that map onto unit / integration / system tests you already know how to write. And then — the meta reveal. The talk itself is a plugin. Every prescription on stage is a real rule in jbaruch/coding-policy on the Tessl registry. The medium is the message.

Resources

AI Native DevCon London — June 1, 2026 — 50% off with code BARUCH50

The Receipts (Cold Open & Inoculation)

Anthropic Claude Code Issue #3382 — “you’re absolutely right” — closed as “completed” August 2025, no fix commit, recommended workaround: write a rules file
Sycophancy in GPT-4o: What happened and what we’re doing about it — OpenAI — April 2025, five-day rollback after the model praised a “shit on a stick” business idea and validated a user who’d stopped his meds
Expanding on what we missed with sycophancy — OpenAI — follow-up postmortem: short-term thumbs-up training drowned out the anti-sycophancy reward signal

Public Catastrophes (Slide 15 — context starvation in production)

Replit AI deletes Jason Lemkin’s production database — The Register — July 2025, “told it not to ELEVEN TIMES IN ALL CAPS”, agent admitted “catastrophic error in judgment”, 1,206 executives + 1,196 companies wiped
Replit CEO: What really happened when AI agent wiped Lemkin’s database — Fast Company
Cursor + Claude Opus 4.6 wipes PocketOS production database AND backups in 9 seconds — The Register — April 2026, agent’s own postmortem opened with “NEVER FUCKING GUESS!” while describing itself guessing
‘I violated every principle I was given’ — Cursor agent deletes PocketOS database — Fast Company
Google Gemini CLI deletes user’s files, confesses “gross incompetence” — Slashdot — July 2025, “I have failed you completely and catastrophically. My gross incompetence…”

The Plugins

jbaruch/coding-policy on Tessl Registry — the talk itself, packaged. Every prescriptive claim in chapters 3, 4, and 5 is a real rule in this plugin. Sixteen rules, skills with delegated scripts, eval scenarios, versioned, peer-reviewed.
- Install: tessl install jbaruch/coding-policy
jbaruch/kotlin-tutor on Tessl Registry — the running example throughout the talk: a teaching plugin for idiomatic Kotlin. Skill kotlinify-tests, rules K-1..K-6 (prefer-val, nullable-question-mark, use-data-class, kotest-over-junit, prefer-stdlib-scope, extension-over-util), script verify-no-junit-assertions.
- Install: tessl install jbaruch/kotlin-tutor

Context Engineering

Tessl — Agent Enablement Platform — versioned, distributed plugins for AI agents
Tessl Registry — the package manager for agent skills
Agent Skills Standard — the modern de-facto standard for giving agents instructions (yes skills, no prompts)
Anthropic Agent Skills announcement
Model Context Protocol (MCP) — plumbing for tools and context
Context Engineering — Tobi Lütke — “the term I much prefer over prompt engineering”

Tessl Blog — Context Engineering

The Context Development Lifecycle (CDLC): Better Context for AI Coding Agents — context as an engineering artifact: generate, distribute, test, observe
The Context Flywheel: Why the Best AI Coding Teams Will Win on Context — better context produces better signals produces better context; context doesn’t commoditize
Context Maturity for AI Coding Teams — three dimensions maturing together (Agents & Tools / Context / People & Organization)
CI/CD for Context in Agentic Coding: Same Pipeline, Different Rules — evals are to context what tests are to code
Context-Bench: Benchmarking AI’s Context Engineering Proficiency — how efficiently a model manages memory, revisits prior context, and what it costs
Making Claude Good at Go using Context Engineering with Tessl — applied example

Tessl Blog — Skills

Announcing Skills on Tessl: the package manager for agent skills — skills as software with a lifecycle (versioned, tested, reusable, composable)
What Are Agent Skills? (And Why You’ll Never Want to Push Code Without One Again)
My Coding Agent Needed a Package Manager for Its Own Brain (And I Gave It One Using a Skills Registry)
Do Agent Skills Actually Help? A Controlled Experiment — the lift-not-attainment proof
Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7) — skills lift every configuration tested
Best Agent Skills for AI Code Review: 8 Evaluated Skills
Evaluating Kimi 2.5 vs Kimi 2.6: What happens to agent skills when the model gets smarter? — skills still matter as models get better

Tessl Blog — Evals

Your AGENTS.md file isn’t the problem. Your lack of AI Agent Evaluations is. — unvalidated context is useless and often harmful
If agents use your tool, you need evals — for maintainers
Three Context Eval Methodologies at Tessl — Skill Review, Task and Repo Evals — the three eval surfaces (maps to the talk’s skill / plugin / project tiers)
Introducing Task Evals: Measure Whether Your Skills Actually Work — baseline vs with-skill, the lift methodology
Improving your skills with Tessl evals — tessl skill lint, tessl skill review, tessl skill eval
Evaluate skill quality using scenarios — Tessl Docs
How to Evaluate AI Agents: An Introduction to Harbor — Terminal-Bench heritage; containerized agent eval
Anthropic brings evals to skill-creator. Here’s why that’s a big deal — Create / Eval / Improve / Benchmark
Evaluating context compression in AI agents — structured state beats compressed text

Prompt Caching (Q&A — cost objection)

Agents Referenced

Claude Code — the agent used in all demos
Cursor
OpenAI Codex
Aider
Gemini CLI

Vendor Plugin Systems (Slide 31 — “Claude plugins, Codex plugins, equivalents from every vendor”)

Plugins for Claude Code and Cowork — Anthropic
anthropics/claude-plugins-official — official Claude Code plugin directory
Agent Skills — OpenAI Codex — Codex’s skill packaging mechanism (2% context-window cap for the initial skills list)
AGENTS.md — Codex custom instructions — the rules-file convention Codex reads on session start

Conventional Commits & Commitlint (DEMO 02, R-10)

Conventional Commits specification — the format R-10 prescribes
commitlint — the deterministic gate that rejects the agent’s “fix stuff” commit and forces a retry
Semantic Versioning — the versioning row in the package-management-generalized table

The Three Primitives

Primitive	What it is	When to reach for it
Skill	Labeled procedure the agent invokes by name — lazily loaded, versioned, packaged	Reusable workflow that requires reasoning
Rule	Always-loaded text constraint — invariant or pushback gate	Always-true / never-do behavioral constraint
Script	Deterministic transformation — same input, same output	Computation or check that must not be left to agent judgment

Two Kinds of Rules

Flavor	Shape	Example
Invariant	“Never X / Always Y”	`K-1 prefer-val` — properties default to immutable
Pushback gate	“Before doing X, stop and ask”	`R-7` — ask before removing any guard, feature flag, or rate-limiter

Pushback gates belong on work that’s hard to reverse. Cheap reversible work flows freely. Over-scope them and your day turns into “yes proceed, yes proceed, yes proceed.”

The Eval Hierarchy

Tier	Maps to	What it catches
Skill eval	Unit test	Does one skill produce the expected shape on its scenarios?
Plugin eval	Integration test	Do the skills + rules + scripts inside one plugin compose without contradicting each other?
Project eval	System test	With multiple plugins installed, do the rules conflict across plugins?

Watch out for bleeding (criterion value appearing verbatim in the task description) and leaking (criteria referencing tile-internal implementation details). Always run a baseline — attainment without lift is a vanity metric.

Operating Principles

3-1 — Package your context. Skills, rules, scripts. Versioned, tested, distributed.
4-1 — When in doubt, assume not judgment. Default toward the deterministic primitives.
4-2 (hook principle) — When the agent has motive to skip the script, move the script outside the agent’s control (pre-commit hook, CI gate, server-side enforcement).
4-3 (the wisdom prayer) — Grant me the rules for what’s always true, the scripts for what mustn’t be judged, the skills for what must be — and the wisdom to know which is which.
5-1 — You’re either measuring or you’re waving. Stop waving.
6-1 — If your agent keeps agreeing with you even when you’re wrong — the problem isn’t the agent.

Three Monday Actions

Write a skill — pick one procedure you keep rewriting in chat
Add a rule — one invariant or one pushback gate
Write an eval — fixed task, deterministic criteria, comparable across runs
Package it — tile.json, version, CHANGELOG
Share it with one teammate — registries exist for a reason

If you do all five by next Friday, you’ll have replaced about 30% of your team’s vibes-based evaluation with structure.

Speaker

Baruch Sadogursky — @jbaruch — Context Sommelier (self-certified)