You're Absolutely Right (and Other Lies My AI Told Me)

GeeCON 2026 Video Coming Soon
A presentation at GeeCON in May 2026 in Kraków, Poland by Baruch Sadogursky

Abstract

Your agent agrees with you. Every time. It removes tests “for clarity.” It changes floor to round because that’s “more accurate.” It says “you’re absolutely right” — and then ships the bug you were about to talk yourself out of. The failure has two missing pieces. The agent can’t see the constraints that would let it push back (your company’s rules aren’t in any training set, and they never will be). And the agent has no rule empowering pushback when the operator is wrong (RLHF trained it to agree by default). Bigger models won’t fix this. The two largest labs in the world keep redirecting you to write a rules file — for the half they can fix and the half they can’t. This talk treats context as an engineering problem: packaged, versioned, distributed, evaluated. Three primitives — skills (reusable workflows), rules (always-loaded invariants and pushback gates), scripts (deterministic transformations) — composed into context artifacts the agent installs like dependencies. The chapter on rules-vs-scripts-vs-skills is a decision frame, not a taxonomy: when a rule beats a prompt, when a script beats a rule, when a hook beats a script. The chapter on evaluation kills the vibes-eval antipattern and gives you skill / plugin / project tiers that map onto unit / integration / system tests you already know how to write. And then — the meta reveal. The talk itself is a plugin. Every prescription on stage is a real rule in jbaruch/coding-policy on the Tessl registry. The medium is the message.

Resources

The Receipts (Cold Open & Inoculation)

Public Catastrophes (Slide 15 — context starvation in production)

The Plugins

  • jbaruch/coding-policy on Tessl Registrythe talk itself, packaged. Every prescriptive claim in chapters 3, 4, and 5 is a real rule in this plugin. Sixteen rules, skills with delegated scripts, eval scenarios, versioned, peer-reviewed.
    • Install: tessl install jbaruch/coding-policy
  • jbaruch/kotlin-tutor on Tessl Registry — the running example throughout the talk: a teaching plugin for idiomatic Kotlin. Skill kotlinify-tests, rules K-1..K-6 (prefer-val, nullable-question-mark, use-data-class, kotest-over-junit, prefer-stdlib-scope, extension-over-util), script verify-no-junit-assertions.
    • Install: tessl install jbaruch/kotlin-tutor

Context Engineering

Tessl Blog — Context Engineering

Tessl Blog — Skills

Tessl Blog — Evals

Prompt Caching (Q&A — cost objection)

Agents Referenced

Vendor Plugin Systems (Slide 31 — “Claude plugins, Codex plugins, equivalents from every vendor”)

Conventional Commits & Commitlint (DEMO 02, R-10)

The Three Primitives

Primitive What it is When to reach for it
Skill Labeled procedure the agent invokes by name — lazily loaded, versioned, packaged Reusable workflow that requires reasoning
Rule Always-loaded text constraint — invariant or pushback gate Always-true / never-do behavioral constraint
Script Deterministic transformation — same input, same output Computation or check that must not be left to agent judgment

Two Kinds of Rules

Flavor Shape Example
Invariant “Never X / Always Y” K-1 prefer-val — properties default to immutable
Pushback gate “Before doing X, stop and ask” R-7 — ask before removing any guard, feature flag, or rate-limiter

Pushback gates belong on work that’s hard to reverse. Cheap reversible work flows freely. Over-scope them and your day turns into “yes proceed, yes proceed, yes proceed.”

The Eval Hierarchy

Tier Maps to What it catches
Skill eval Unit test Does one skill produce the expected shape on its scenarios?
Plugin eval Integration test Do the skills + rules + scripts inside one plugin compose without contradicting each other?
Project eval System test With multiple plugins installed, do the rules conflict across plugins?

Watch out for bleeding (criterion value appearing verbatim in the task description) and leaking (criteria referencing tile-internal implementation details). Always run a baseline — attainment without lift is a vanity metric.

Operating Principles

  • 3-1 — Package your context. Skills, rules, scripts. Versioned, tested, distributed.
  • 4-1 — When in doubt, assume not judgment. Default toward the deterministic primitives.
  • 4-2 (hook principle) — When the agent has motive to skip the script, move the script outside the agent’s control (pre-commit hook, CI gate, server-side enforcement).
  • 4-3 (the wisdom prayer)Grant me the rules for what’s always true, the scripts for what mustn’t be judged, the skills for what must be — and the wisdom to know which is which.
  • 5-1 — You’re either measuring or you’re waving. Stop waving.
  • 6-1 — If your agent keeps agreeing with you even when you’re wrong — the problem isn’t the agent.

Three Monday Actions

  1. Write a skill — pick one procedure you keep rewriting in chat
  2. Add a rule — one invariant or one pushback gate
  3. Write an eval — fixed task, deterministic criteria, comparable across runs
  4. Package ittile.json, version, CHANGELOG
  5. Share it with one teammate — registries exist for a reason

If you do all five by next Friday, you’ll have replaced about 30% of your team’s vibes-based evaluation with structure.

Speaker