BS
aitutorialhermeslygos

$12,000 a year for an AI that didn't know our code

Ben Schroth ยท April 20, 2026 ยท 10 min read

$12,000 a year for an AI that didn't know our code

Lygos was paying CodeRabbit over $1,000 a month. We also evaluated Greptile and Codex. Three real tools with real engineering behind them, all failing the same test: they don't know our code, and they don't give us the controls to teach them.

The reviews were coherent. They were also shallow in the ways that matter. The bot didn't know our loan state machine has invariants worth flagging. Didn't know we use Decimal.js for financial precision and that raw JS arithmetic on a money value should be a hair-on-fire moment. Had no idea which React Native screens are frozen by design and which are fair game to refactor. Every review read like a first-year intern with Google and no access to us.

That was the part that cost $12,000 a year. The harder part was control.

Three control problems

Flagging. Feedback. Memory.

You can't tell CodeRabbit to always flag changes to your DLC signing flow and never comment on import ordering. The dashboards let you toggle broad categories. They don't let you say "flag this pattern, always; never flag that pattern, ever." The pattern you care about is the one nobody on their side has written a knob for.

You can't audit what Greptile learned from your team's accept/dismiss signal. Feedback disappears into their product roadmap. Maybe your pattern gets added to their model in six months. Probably it doesn't.

You can't version-control Codex's understanding of your state machine. Institutional memory (the decision two years ago to use Objection.js instead of Prisma, the reason the onboarding flow is frozen, the convention that financial values never touch raw JS arithmetic) lives in their database. You rent access.

That's a bad trade at $12,000 a year. It's a worse trade at any price if you care about the reviewer getting better over time.

Why Hermes

We already run Hermes for other agent work. It's a framework we own, our code, our infra, no vendor in the middle. It has a webhook platform, a skill system, and a GitHub App mode that mints scoped installation tokens per event. All the primitives were already there. The only question was whether we could wire them into a code reviewer that earned its opinions.

The answer was yes, and it took a weekend. What follows is the full setup. Copy it, adapt it, swap in your own conventions. The system is ours to shape because we built it. That's the whole point.

Architecture

Four moving parts.

  1. A GitHub App that acts as the bot identity
  2. A Hermes gateway with the webhook platform enabled
  3. Webhook subscriptions that map GitHub events to agent runs
  4. Skills that tell the agent what to actually do

Webhooks from GitHub hit the gateway. The gateway fans them out to subscriptions. Each subscription loads a stack of skills and dispatches an agent run. The agent reviews the PR, posts comments under the bot's identity, exits. That's the whole story.

Part 1: register the GitHub App

GitHub Apps are the right abstraction here. Not a personal access token. Not a machine user. An App gives the bot its own identity with a [bot] suffix, its own avatar, scoped permissions per install, short-lived credentials, and first-class Check Run support.

Go to https://github.com/organizations/YOURORG/settings/apps/new for org-owned, or https://github.com/settings/apps/new for personal.

The form asks for too much stuff. Most of it is for OAuth user-login flows you don't need. Skip anything labeled "callback URL," "user authorization," "device flow," or "setup URL." Those are for Apps that log your users in. You're building a bot that acts on its own behalf.

What actually matters:

  • Webhook URL: point it at https://YOUR-HERMES-HOST/webhooks/app/YOUR-APP-SLUG. The /app/ segment matters. That's the fan-out endpoint.
  • Webhook secret: openssl rand -hex 32, save it both in GitHub and in your Hermes config.
  • Permissions: Contents read, Pull requests read+write, Issues read+write, Checks read+write, Metadata read. Nothing else for now.
  • Events: pull_request, pull_request_review, issue_comment, check_run. Skip everything else.

After creation, scroll to "Private keys" and generate one. You get a .pem file. Save it to ~/.hermes/secrets/yourapp.pem with chmod 600.

Install the App on the repos you care about. Write down the App ID (top of the settings page) and the Installation ID (tail of the redirect URL after install). You need both.

Part 2: wire up the gateway

Enable the webhook platform in ~/.hermes/config.yaml:

platforms:
  webhook:
    enabled: true
    extra:
      host: "0.0.0.0"
      port: 8644
      github_apps:
        yourapp:
          app_id: 1234567
          private_key_path: ~/.hermes/secrets/yourapp.pem
          webhook_secret: ${GITHUB_APP_WEBHOOK_SECRET}

Restart the gateway. The github_apps block loads on boot. It isn't hot-reloaded like the subscription file.

Verify the App is reachable:

hermes github-app list
hermes github-app installations yourapp
hermes github-app token yourapp   # prints a fresh installation token

If those three commands work, your JWT signing is correct, the private key is loadable, and the Installation ID is resolvable. That's the auth pipeline proven end to end before you touch webhooks.

Part 3: build the three-layer skill stack

This is the part that matters. Skills are how you teach the agent what to do and what you care about.

Three layers. A mechanics skill that handles GitHub plumbing. A per-repo context skill that holds your conventions. A feedback skill that patches the context skill based on reactions.

The mechanics skill

Create ~/.hermes/skills/github/github-app-review/SKILL.md. This one is repo-agnostic. It tells the agent how to structure a review: one inline comment per finding via POST /repos/{owner}/{repo}/pulls/{pr}/comments, one top-level summary returned as the agent's final response, never use gh pr review (it breaks on self-PRs and creates PENDING review states nobody sees).

The output contract is non-negotiable and goes at the top of the skill:

## Required output
 
1. Check out the PR locally, read the full diff plus surrounding
   context in each changed file.
2. Apply the review checklist: correctness, security, code quality,
   testing, performance, documentation.
3. Post one inline review comment on GitHub for each finding.
   Use severity emojis: ๐Ÿ”ด critical, โš ๏ธ warning, ๐Ÿ’ก suggestion.
4. Return a top-level summary as your final response.
   The webhook will deliver it as a PR comment.
5. Do NOT use gh pr review --approve / --request-changes / --comment.

That last rule matters. Early versions of our setup called gh pr review and lost to GitHub's restriction on reviewing your own PRs (our bot opens some automated PRs). The POST /pulls/{pr}/comments path works regardless of authorship.

The per-repo context skill

Create ~/.hermes/skills/github-repos/YOUR-REPO-review/SKILL.md. This is where your opinions live.

Frontmatter:

---
name: your-repo-review
description: Review context for YOUR-ORG/YOUR-REPO
version: 1.0.0
metadata:
  hermes:
    tags: [github, code-review]
    related_skills: [github-app-review, github-app-review-feedback]
---

Then five sections:

  1. Repo overview โ€” stack, purpose, criticality. Is this production code handling money, or a marketing site?
  2. Things to flag โ€” organized by domain. For our backend: DLC lifecycle invariants, loan state transitions, financial precision with Decimal.js, Fordefi signing patterns, auth scoping on organizationId. For our mobile app: protected screens, Redux slice conventions, Fordefi QR flow, React Native performance patterns.
  3. Things NOT to flag โ€” the stylistic calls you've already made. We use Biome not ESLint, so don't flag ESLint-style complaints. We use const arrays not TS enums on purpose, so don't flag that. Every codebase has these. Write them down.
  4. Reinforce list โ€” empty stub. Patterns the team has ๐Ÿ‘'d. Auto-maintained by the feedback skill.
  5. Suppress list โ€” empty stub. Patterns the team has ๐Ÿ‘Ž'd. Auto-maintained by the feedback skill.

The reinforce/suppress stubs are load-bearing. That's where the feedback loop writes.

The feedback skill

Create ~/.hermes/skills/github/github-app-review-feedback/SKILL.md. Fires after a PR merges. Reads the bot's inline comments, checks reactions, patches the per-repo skill.

The critical part is the reconciliation logic. For each inline comment the bot posted, look at reactions:

  • ๐Ÿ‘ โ†’ extract the pattern, append to reinforce list in the per-repo skill
  • ๐Ÿ‘Ž โ†’ extract the pattern, append to suppress list
  • No reaction โ†’ ignore (neutral signal)

Also write to a JSONL ledger at ~/.hermes/review-feedback/YOUR-REPO.jsonl so you have persistent history beyond what fits in the skill file.

The skill's output is a terse ack comment on the merged PR: "Logged 3 reactions. 1 reinforce, 2 suppress. Next review will adapt." That goes on the closed PR so the team sees the loop closed.

Part 4: subscribe the webhooks

Each subscription is one event shape. Don't try to combine opened and synchronize and closed in a single route. The filter engine does scalar equality, not list membership. Three subscriptions per repo is correct.

Review on PR opened:

hermes webhook subscribe yourrepo-review-opened \
  --events pull_request \
  --github-app yourapp \
  --filter "action=opened" \
  --filter "repository.full_name=yourorg/yourrepo" \
  --filter "pull_request.base.ref=dev" \
  --skills github-app-review,yourrepo-review \
  --deliver github_comment \
  --auto-approve

Review on each push to the PR:

Same but --filter "action=synchronize". The mechanics skill knows to diff only pull_request.before..pull_request.after so you don't re-flag unchanged code.

Feedback on merge:

hermes webhook subscribe yourrepo-feedback \
  --events pull_request \
  --github-app yourapp \
  --filter "action=closed" \
  --filter "pull_request.merged=true" \
  --filter "repository.full_name=yourorg/yourrepo" \
  --filter "pull_request.base.ref=dev" \
  --skills github-app-review-feedback,yourrepo-review \
  --deliver github_comment \
  --auto-approve

Two things to notice.

--skills takes a comma-separated list. Both skills load into the agent's context in order. The mechanics skill first, the per-repo skill second. The agent stacks them.

--auto-approve bypasses the dangerous-command approval gate for this subscription's unattended runs. Webhook runs have no user to prompt, so without this flag the agent posts an "approval required" message as its response instead of the actual review.

Filter semantics are dot-notation into the event payload. pull_request.base.ref=dev means the agent only fires on PRs targeting your dev branch. pull_request.merged=true means only the merge path, not close-without-merge. Every filter you add is a cost-control lever. The filter engine gates the event before any tokens are spent.

Part 5: Check Runs for live status

When a webhook fires, the gateway creates a GitHub Check Run on the PR at in_progress before the agent starts. When the agent finishes, the check transitions to success, action_required, neutral, or failure based on what the agent said.

This is what the team sees on the PR: a live status in the Checks tab that reads "Hermes review in progress" and flips to a result when the agent completes. Same mental model as CI. No guessing whether the bot is running.

Conclusion mapping:

  • Summary contains "Verdict: Changes Requested" or ๐Ÿ”ด or "Critical" โ†’ action_required
  • Summary contains "LGTM" or "no issues found" โ†’ success
  • Everything else โ†’ neutral
  • Agent errored or returned nothing โ†’ failure

You get this for free when you set --github-app. No extra config.

Part 6: beyond code review

Once the pipeline exists, the agent isn't just for code review. Webhook-event-to-skill-to-delivery is a general shape.

A merge from dev to master triggers a release checklist agent instead of a code review agent. Same webhook mechanics, different skill. You write the skill, you describe the flow.

A push to any branch can trigger a cron-scheduled security scan against the week's diffs. A PR opened with a specific label can trigger a deeper-than-usual architecture review (--filter "pull_request.labels.*.name=needs-arch-review" plus a dedicated skill). A merge to master can kick off a production deployment checklist that verifies migrations, secrets hygiene, and Sentry breadcrumb coverage before the image builds.

All of these are minutes of work once the infrastructure exists. The infrastructure is the same infrastructure you set up to replace CodeRabbit.

The weekend breakdown

  • Register the App: 15 minutes
  • Wire up the gateway: 30 minutes
  • Write the mechanics skill: 1 hour
  • Write the first per-repo context skill: 2 hours if you have conventions docs to pull from. Most of the time is organizing what you already know.
  • Write the feedback skill: 1 hour
  • Subscribe the webhooks: 15 minutes
  • Test on a live PR and iterate: 30 minutes

A full Saturday. Half a Sunday onboarding the next teammate.

$12,000 a year was paying for a reviewer that didn't know our code. The weekend paid for a reviewer that does, and one that gets sharper every time the team pushes a thumb up or down on what it said.

Your reviewer should know your code. Most engineering orgs are stuck with reviewers that don't. The time investment is smaller than you think. The control you get on the other side is the whole point.

$12,000 a year for an AI that didn't know our code - benny b | benny b