Guide, topic: production ready open source ai agents, 2026
Production-ready is a property of your repo, not a property of the framework you picked.
Every guide on production ready open source AI agents in 2026 ranks LangGraph, CrewAI, AutoGen, Pydantic AI, and AutoGPT by GitHub stars and monthly downloads. None of those counts answer the only question a CTO actually has on Monday morning: is the agent we are running in production today actually production-ready, or are we one bus accident from a stuck pilot. The answer is in your repo, in 7 files with exact paths, measurable in 5 minutes by a shell script that has no dependencies beyond git and grep. This guide is the rubric, the inspection script, and the read on what each finding actually means for the on-call rotation.
Framework-neutral. The same rubric passes on Pydantic AI, LangGraph, custom orchestration, and a multi-model DAG.
Monetizy.ai / Upstate Remedial / OpenLaw / PriceFox / OpenArt
What every other guide is grading
Open any of the dozen guides on this topic that landed in the last quarter and you will see the same shape. A table of frameworks: LangGraph, CrewAI, AutoGen, Pydantic AI, OpenAI Agents SDK, AG2, AutoGPT. A column for GitHub stars. A column for monthly downloads. A vague checklist of production traits in prose: state persistence, OTEL support, human-in-the-loop, structured outputs. A closing recommendation that boils down to LangGraph plus LangSmith if you can stomach a 4 to 8 week ramp, CrewAI if you cannot.
That round-up is useful as a reading list. It does not answer the question your board is going to ask in the next QBR. The question is not whether LangGraph supports checkpointing. The question is whether the deployment that your team actually shipped has a rubric file, a case set, a CI gate, a runbook, a failure playbook, and an architecture doc with at least one human commit on each in the last 30 days. A framework cannot ship those files for you. An engineer does. And the framework comparison post does not measure that engineer's output, because that is your repo, not the framework's.
The 95 percent of pilots that stall in 2026 are not stalling on the wrong import. They are stalling on missing artifacts. The MIT NANDA stat your boardroom quotes is real. The cause is not framework selection. The cause is that production-ready was never defined in a way that returns a yes or a no on the actual deployment.
Two ways to grade an open-source agent in 2026
You read three framework comparison posts on production ready open source AI agents. They each rank LangGraph, CrewAI, AutoGen, Pydantic AI, and AutoGPT by GitHub stars, monthly downloads, and a vague checklist of production traits. You pick the framework with the highest score. Six months later your team has a stuck pilot, and you cannot tell whether the framework was wrong or whether the deployment was incomplete. The reviewer at procurement asks you to define production ready in a sentence, and you cannot.
- Star count and monthly downloads
- Vague production checklist in prose
- No path to a yes or no on your repo
- Procurement still cannot define production-ready
Anchor fact: the seven files plus three git-log invariants
This is the entire rubric. Seven artifacts in your repo, three things that have to be true about the git history of those artifacts, and zero appeals to authority. Each file has a name, a path, and a thing it does that no other file does. If a file is missing, you have a finding. If a file is present but not maintained, you also have a finding. The fourth column on a star-count comparison cannot tell you any of this.
Anchor fact
7 files. 3 invariants. 5 minutes from clone to verdict.
- rubric.yaml at the root names model_primary, model_fallback, and the thresholds your week-0 scoping call agreed to. One line per release.
- eval/cases.yaml is the case set drawn from your real traffic plus an adversarial set the engineer wrote. The only benchmark that predicts incidents on your workload.
- .github/workflows/eval.yml is the CI gate that runs the rubric against the case set on every PR and posts a scorecard.
- agents/contracts.py holds the pydantic input and output models for every tool the agent calls. Schema regressions show up as red CI, not 3am pages.
- ops/runbook.md is what an on-call engineer reads at 3am Saturday. Not a Notion doc. A file in your repo, with a git history, that the rotation links to.
- ops/failure_playbook.md names the primary, the fallback, the fallback-of-fallback, and the trigger conditions. Updated in the same PR every time model_primary changes.
- ARCHITECTURE.md at the root is a page or two of plain prose: what the agent decides, what it does not decide, where retrieval lives, where deterministic checks override the LLM. The thing a new hire reads on day one without having to ask anyone.
The three invariants on top of those seven files: each file has a human commit author in the last 30 days, model_primary appears exactly once in rubric.yaml, and no vendor-attached runtime strings appear in pyproject.toml, package.json, or go.mod.
One card per file, one job per card
A view of the seven artifacts as a single grid. The accented card is the one most stuck pilots are missing first. The two-wide card on the bottom row is the one teams most often skip and regret three releases later.
rubric.yaml
The one file that defines what correct means for your workload. model_primary, model_fallback, rubric_min_score, ragas thresholds, max_per_case_regression. A new release is a one-line edit. If this file does not exist on main, your agent is not production-ready, regardless of whose framework you used.
eval/cases.yaml
The 80 to 200 cases your week-2 gate was scored against, drawn from your real traffic plus an adversarial set the engineer wrote with your product lead. It is the only benchmark that actually predicts incidents on your workload. A vendor leaderboard cannot replace it.
.github/workflows/eval.yml
The CI job that runs the rubric on every PR that touches agents/, eval/, rubric.yaml, or ops/failure_playbook.md. Posts a scorecard comment to the PR thread. Fails the build on regression. Without this file, your eval harness is a manual spreadsheet, not a gate.
agents/contracts.py
The pydantic input and output models for every tool the agent calls. Type-checked on every PR. The single point where a model-output schema regression turns into a red CI light instead of a 3am incident on Saturday.
ops/runbook.md
What an on-call engineer reads at 3am when the agent is paging. Links to dashboards, name-and-rate-limit of every model the agent calls, the one-line restart command, and the canary rollback procedure. Not generated by ChatGPT. Written by the engineer who shipped the system.
ops/failure_playbook.md
Primary model, fallback model, fallback-of-fallback, and the trigger conditions for each. Updated in the same PR every time model_primary changes. The discipline that keeps the on-call story true after a model swap.
ARCHITECTURE.md
Why this stack and not the others. A page or two of plain prose: what the agent decides, what the agent does not decide, where retrieval lives, where deterministic checks override the LLM, and which decisions were costed against alternatives. The thing a new hire reads on day one and does not have to ask the original engineer about.
rubric.yaml in full, the file the inspection asserts on
This is the exact shape we ship in week 6 across every named production agent. The inspection script reads it on step 5: model_primary must appear exactly once, the thresholds must be present, and the ownership block names a real engineer with a real email, because the audit trail starts there.
inspect-production-ready.sh, the 5-minute job
Sixty lines of bash, no dependencies beyond git and grep. Run it from the root of your agent repo on the same checkout your CI uses. Exits 0 on a pass, 1 on any finding. Your CTO can run it during a coffee, your procurement reviewer can run it during a security review, and the script is part of the leave-behind on every PIAS engagement.
What a pass looks like on a shipped agent
A literal run against the Monetizy.ai-shape repo at the end of a 6-week engagement. Seven files present, every file with at least one human commit in the last 30 days, zero vendor-runtime strings in the manifest, exactly one model_primary line.
What a stuck pilot actually looks like
A run against a representative pre-engagement repo we triaged earlier this year. The findings are not academic. Each one names the next PR the team needs to write. The framework on this checkout is fine; the deployment is incomplete.
Six steps, in the order the script runs them
The inspection runs sequentially because each step gates the next. There is no point checking the git history of a file that does not exist, and no point grepping for vendor lock-in if the manifest itself is missing.
Step 1: clone the repo and cd in
Use the same checkout your CI runs against. Not a fork, not last week's tag. The inspection reads what is on main right now. If main is not the source of truth in your repo, that is the first finding before you run anything.
Step 2: ls every required path, fail loudly if any are missing
Seven paths. Each one is either present or absent, no in between, no partial credit. A missing rubric.yaml is the most common finding on a stuck pilot, followed by a missing .github/workflows/eval.yml. Both are silent in framework comparison posts because neither is a framework feature.
Step 3: git log the last 30 days for each file, require a human author
git log --since='30 days ago' --pretty=format:'%ae' on each of the seven paths. If any file shows zero human commits in the last 30 days, the file exists but is not maintained. The most subtle failure: ARCHITECTURE.md was committed once at the kickoff and nothing has changed since the model was swapped twice.
Step 4: grep the dependency manifest for vendor-attached runtime strings
platform_license_key, vendor_runtime, lock_in_token, hosted-orchestration-sdk, and the named SKUs of platforms that make the agent un-runnable on your own keys. If any of those land in pyproject.toml or package.json, your agent is not portable, no matter what the README says.
Step 5: assert model_primary is a one-line edit in rubric.yaml
grep -c '^model_primary:' rubric.yaml must equal 1. A repo where the model name is hard-coded in three Python files and one YAML is a repo where switching to the next provider release is a four-file change, not a one-line one. Most stuck pilots fail this exact line.
Step 6: print the verdict
Pass requires all seven files present, all seven with at least one human commit in the last 30 days, zero vendor-runtime strings in manifest, and exactly one model_primary line. Six out of seven is not production-ready. It is one bus accident from being a stuck pilot.
How the inspection wires together
The rubric is the hub. Upstream, the seven files in your repo. Downstream, the three invariants the script enforces, and the verdict that goes back to procurement and the on-call rotation. Nothing vendor-hosted in the diagram, on purpose.
7 files → rubric → 3 invariants → verdict
Repo lens vs framework lens, row by row
Left: what the 7-file rubric measures and produces. Right: what the framework comparison posts you might already have read produce. The rows are not a caricature. They are the difference between a deliverable a CTO can hand to procurement on Tuesday and a reading list a CTO has to read three times before it commits to a verdict.
| Feature | Framework comparison post | 7-file repo rubric |
|---|---|---|
| What is being graded | A framework as a category: LangGraph vs CrewAI vs AutoGen by GitHub stars and monthly downloads | The artifacts in your repo: 7 files, their git history, their dependency manifest signatures |
| Time to a verdict | A reading list of 10 framework comparisons, each with a slightly different rubric | 5 minutes, one shell script, on a checkout you already have |
| What a yes means | The framework supports state persistence, has SDKs for OTEL, and is used at Cisco or Uber | Your workload has a rubric, a case set, a CI gate, a runbook, a failure playbook, an architecture doc, and at least one human commit on each in the last 30 days |
| What a no means | An abstract concern: your team should consider adopting structured outputs and human-in-the-loop controls | A specific finding with a file path: rubric.yaml is missing, eval.yml is empty, ARCHITECTURE.md was last touched in February |
| Vendor lock-in surface | Acknowledged in a paragraph, not measured; the post is sponsored by one of the platforms in the comparison anyway | Detected by greping pyproject.toml / package.json / go.mod for named runtime strings; pass / fail |
| Translates to the on-call rotation | Production checklist mentions observability and error handling, never specifies what file to open or who wrote it | ops/runbook.md is the artifact your on-call engineer reads at 3am Saturday; the inspection asserts it exists and is current |
| Audit trail for procurement | A vendor security packet, refreshed quarterly, scoped to the platform itself | git log on each file, with author email and commit dates, exportable as a CSV in 30 seconds |
“Seven files, three invariants, five minutes. Same rubric across Pydantic AI on Monetizy, LangGraph + Bedrock on Upstate Remedial, custom orchestration on OpenLaw, an automated pipeline on PriceFox, and a multi-model DAG on OpenArt. Production-ready is not a property of the framework. It is what an engineer leaves behind.”
PIAS rubric, day-42 leave-behind across 5 named production agents
Receipts
File counts and engagement-level facts, not invented benchmarks. Per-client production metrics are on /wins.
The 0-file rule is the discipline that keeps the audit trail readable and the on-call rotation honest. A repo that passes on 0 of the seven files is not production-ready in any sense that procurement will accept, and 0 USD in platform license is the right cost to pay to verify it on a checkout you already own.
Want a senior engineer to run the 7-file inspection on your repo and write the missing files?
60-minute scoping call with the engineer who would own the build. You leave with the inspection script run against your repo, a written one-pager with the missing files named, and a fixed weekly rate to land each one.
Production-ready, by the 7-file rubric, answered
Why grade the repo and not the framework?
Because the framework is the easy choice and the deployment is the hard one. LangGraph, CrewAI, AutoGen, and Pydantic AI are all capable of getting to production on most workloads. The 95 percent of pilots that stall stall on missing artifacts, not on the wrong import. A team can wire LangGraph correctly and still not have a rubric, a case set, a CI gate, a runbook, or a failure playbook in their repo. That team has a framework. They do not have a production-ready agent. The 7-file rubric distinguishes those two states; a star count does not.
What are the 7 files exactly, with the paths I should grep for?
rubric.yaml at the repo root, eval/cases.yaml, .github/workflows/eval.yml, agents/contracts.py (or wherever your tool input/output pydantic models live), ops/runbook.md, ops/failure_playbook.md, and ARCHITECTURE.md at the repo root. The paths are not sacred but the existence of each artifact is. If your team uses different conventions (eval/test_cases.json, .gitlab-ci/eval.yml, docs/architecture/00-overview.md), edit the script. The discipline is the same: name the file, commit it, link it from the on-call rotation.
What are the 3 git-log invariants and why those three?
First, every file has at least one commit by a human author email in the last 30 days. A frozen architecture doc that has not changed since kickoff is a signal that the system has drifted past what the doc describes. Second, model_primary appears exactly once in rubric.yaml, so swapping providers is a one-line PR. Third, no vendor-attached runtime strings appear in pyproject.toml, package.json, or go.mod. Those three together rule out the failure modes most stuck pilots share: an outdated mental model, a hard-coded model name, and a license-attached runtime that cannot be moved.
Does this rubric prefer LangGraph, CrewAI, or any specific framework?
No, intentionally. PIAS has shipped agents on Pydantic AI (Monetizy.ai), LangGraph + Bedrock (Upstate Remedial), Anthropic with custom orchestration (OpenLaw), and a custom DAG with multi-model inference (OpenArt). All four pass the 7-file rubric on their respective repos. The framework choice is downstream of the workload. The artifacts in the repo are upstream of every framework choice. Treating production-ready as a property of the framework gets the question backwards.
What is wrong with grading on GitHub stars or monthly downloads?
Stars and downloads measure adoption among hobbyists and pilots, not survival rate to production. A framework can have 24,800 stars and a 4 to 8 week median time-to-production for the teams that adopt it, which is not a slot a CTO can plan against without a rubric on her own deployment. The metric you want is whether the seven files land in your repo, with human commits, on the timeline you accepted at scoping. Stars are downstream of marketing budget. Files are downstream of an engineer.
Where do the thresholds in rubric.yaml come from?
From the week-0 scoping call with your product lead, written into a one-page memo. rubric_min_score 0.82 and ragas faithfulness 0.78 are the defaults we ship with because they correspond to acceptance levels that have held up across five named production agents on different stacks. Your numbers may differ. The discipline is that the threshold is named, written down, and signed off before the engineer touches the agent. Picking it after the eval is what makes a project unfalsifiable.
Can I use this rubric on an agent that already runs in production but predates a PIAS engagement?
Yes, and a chunk of our engagements start there. We run the 5-minute inspection on day zero and treat each finding as a backlog item. Most pre-existing agents pass on agents/contracts.py and ARCHITECTURE.md and fail on rubric.yaml, eval/cases.yaml, and the failure playbook. A typical first PR in a remediation engagement adds rubric.yaml plus a starter eval/cases.yaml drawn from the last 30 days of real traffic. Week 2 is when the CI gate actually starts blocking regressions.
How does this relate to OWASP, SOC 2, or the EU AI Act?
The 7-file rubric is the technical-substance layer underneath each of those compliance frames. SOC 2 wants change-management evidence; git log on rubric.yaml and ops/failure_playbook.md is exactly that. OWASP agentic AI risks live in eval/cases.yaml as adversarial rows and in agents/contracts.py as schema validation. The EU AI Act asks you to document risk management; ARCHITECTURE.md plus the rubric thresholds is the human-readable version of that document. Compliance frameworks describe what evidence to produce. The rubric describes where in the repo the evidence lives.
What if I run the script and only 5 of 7 files pass?
You have a stuck pilot, even if traffic is hitting it. The two missing files name the next two PRs. In the wild, the most common pair to be missing is rubric.yaml plus .github/workflows/eval.yml, which together mean every framework upgrade or model swap is unscored. The second most common pair is ops/runbook.md plus ops/failure_playbook.md, which together mean the on-call rotation is reading a Notion doc that was written by a contractor who left. Both gaps are addressable in a single engagement week if you have a senior engineer in the repo.
Why is the inspection script 60 lines of bash and not a SaaS?
Because the rubric is supposed to live in your repo, run on your CI, and be readable by your engineers. A SaaS that scores production readiness is itself a vendor-attached runtime, which the rubric is designed to flag. Sixty lines of bash does not need a license, does not change between releases, and runs on the same checkout your CI runs on. We open-source it on every engagement. The discipline is the rubric, not the tool that runs it.
Adjacent guides
More on the leave-behind that defines production-ready
Evaluating new LLM releases for production agents (April 2026)
When a new LLM release lands, qualify it with a 3-file PR against the same rubric your week-2 gate used. The mechanism the 7-file rubric makes possible.
AI agents in production: the 6-week contract rubric
First PR in 7 days, week-2 prototype gate, week-6 leave-behind. The contract that produces the seven files this page inspects.
Multi-agent orchestration frameworks, scored on portability
Seven named frameworks scored on whether they survive the engineer leaving. The framework half of the production-ready question.