Guide / open-weight model deployment / April 2026
Deploying a new Hugging Face model is a four-file pull request, not a model swap.
When Qwen3.6-27B, Gemma-4-21B-REAP, MiniMax-M2.7, Tencent Hy3-preview, or Multiverse's LittleLamb family lands on Hugging Face, the bottleneck is not vLLM versus TGI. It is the eval harness that scores the candidate on your case set, the deploy artifact that pins the container digest plus the GPU SKU plus the quantization, and the runbook that tells your on-call which path survived. This guide is the four-artifact PR that makes a self-hosted weights swap auditable, comparable to the closed-API model you are running today, and rollback-safe.
Direct answer (verified 2026-04-29)
How to deploy a new Hugging Face model to production.
Open a pull request that changes four files, no more, no less:
- rubric.yaml: pin model_primary to the hf:// URI plus weight_sha256 plus the HF revision.
- deploy/inference.yaml: pin the inference container image digest, tensor parallel size, quantization, GPU SKU, and KV-cache fraction.
- .github/workflows/eval.yml: already on main from the week-6 leave-behind, runs the case rubric and ragas against the same eval/cases.yaml the closed-API incumbent was scored on.
- ops/failure_playbook.md: on a pass, the closed-API incumbent demotes to fallback, in the same PR.
The hard work is the eval harness, not the inference server. vLLM, TGI, TensorRT-LLM, and SGLang are all viable in April 2026. Source on the open-source state, including which models are dominating downloads: Hugging Face, State of Open Source, Spring 2026.
Same case set, ephemeral inference pod stood up from deploy/inference.yaml, scorecard posted to the PR. April 2026 candidates are scored on the same rubric the closed-API incumbent was accepted under in week 2.
Monetizy.ai / Upstate Remedial Management / OpenLaw / PriceFox / OpenArt
Candidates and infrastructure named in real qualification PRs during April 2026:
What every other guide on this gets wrong
The pages that currently come up for this question split into three groups. Group one benchmarks vLLM against Text Generation Inference against TensorRT-LLM. Group two walks through Hugging Face Inference Endpoints clicks. Group three wraps a Transformers pipeline in FastAPI and calls it a deployment guide. Each group is useful for what it covers. None of them tell you what a CTO actually has to decide on release day, which is whether the new open-weight model is better than the closed-API model their team is already running, on the workload the team actually ships.
That decision is not an inference-server question. It is an eval question, and the answer lives in a file most teams have not written: eval/cases.yaml with the failure modes that matter for your product, plus a CI job that scores both the open-weight candidate and the closed-API incumbent against the same rows. Without that file, picking between Qwen3.6-27B and Claude 4.7 Opus comes down to whoever has the loudest opinion in the room.
The point of this page is the four artifacts that turn the decision into a pull request. The first three exist on a closed-API qualification too. The fourth, deploy/inference.yaml, is the open-weight specific one, and it is the one every other guide skips.
Release day: how an HF launch lands in two engagement shapes
Qwen3.6-27B lands on Hugging Face on a Tuesday. The platform team picks a vLLM container, runs a notebook against ten prompts, says it looks good. Security opens a ticket about the new license file. Procurement has to log a new GPU spend line. Finance asks whether anyone has measured cost-per-token against the existing closed-API spend. The rubric the agent was accepted on lives in a Slack thread someone has to find. Six weeks of meetings later, the model is still not in production, and a newer release just shipped.
- Notebook on a laptop, not CI
- Container the eval ran is not the container production runs
- No comparison against the closed-API incumbent on the same cases
- Verdict arrives weeks after the release, sometimes never
Anchor fact: the four files a qualifying PR touches
On the closed-API path, a qualification PR touches three files: rubric.yaml, .github/workflows/eval.yml (already on main), and ops/failure_playbook.md. On the Hugging Face path, the PR touches four. The new file is deploy/inference.yaml, and pinning it correctly is the difference between a reproducible deploy and a quiet drift incident two months later.
Anchor fact
Four files. One scorecard. One decision.
- rubric.yaml: flips model_primary from a closed-API string to hf://<author>/<repo>, adds weight_sha256, and records the HF revision. Two lines change, not one.
- deploy/inference.yaml: the open-weight specific artifact. Pins the container image digest, tensor parallel size, quantization scheme, GPU SKU, and KV-cache fraction. The CI runner uses this file to stand up an ephemeral inference pod for the eval step.
- .github/workflows/eval.yml: not edited by this PR. Triggered by it. Already on main from the week-6 leave-behind. Reads deploy/inference.yaml, brings up the pod, scores the candidate against eval/cases.yaml, and tears the pod down.
- ops/failure_playbook.md: updated in the same PR on a pass. Closed-API incumbent demotes to fallback. Rate limits and timeouts preserved. Scorecard URL recorded. The on-call file still names a working route at 3am.
File 1: rubric.yaml, with the two lines a Hugging Face PR changes
On a closed-API qualification, only model_primary moves. On an open-weight qualification, the weight hash is part of the identity, because tags on Hugging Face can be force-pushed and a silent re-upload at the same revision is a different artifact.
File 2: deploy/inference.yaml, the artifact every other guide skips
This is the file that does not exist on the closed-API path. It pins the parts of the deploy that are easy to drift on: the container image digest, the tensor parallel size, the quantization scheme, and the GPU SKU. The CI runner uses it to stand up the same pod the eval will score, and production reuses the same digest. If the file is not pinned, the container the eval scored is not the container production runs.
File 3: eval.yml, with the new bring-up step
The workflow on main is unchanged from the closed-API path. The only addition on a Hugging Face PR is the bring-up step, which reads deploy/inference.yaml, spins up the ephemeral pod on a GPU runner, exports the endpoint URL, and tears the pod down at the end. The case rubric, ragas, and per-case regression check are reused as is.
File 4: ops/failure_playbook.md, in the same PR on a pass
The trick on the open-weight path is that the closed-API model still wants to live somewhere. Demoting it to fallback in the same PR keeps the on-call runbook honest. On a real production failure, the playbook should route to the most reliable thing, even if that is the more expensive closed API. We deliberately keep the fallback more expensive than the primary, so the cost incentive aligns with the reliability incentive instead of fighting it.
What a real qualification PR looks like end-to-end
A literal session from a Qwen3.6-27B qualification on a self-hosted H200 runner. Cold-load on a 27B model is minutes, not seconds, but the eval harness gives the bring-up step its own timeout so the cold load does not pollute the inference latency in the scorecard.
Seven steps, plain order
The qualification run as it is written on the engineer's checklist. This is the order we teach your team during the week-6 handoff so the next Hugging Face release does not require us in the room.
Step 0: confirm the HF repo card actually says what you need
Open the model card on huggingface.co. Confirm three things: license is one your legal already cleared (Apache-2.0, MIT, or a vetted Gemma or Qwen license), parameter count and minimum GPU memory match a SKU you can already provision, and the tokenizer is one your orchestration code reads. If any of those are unclear, do not branch yet. A scoping note is faster than a reverted PR.
Step 1: open a branch named model/hf-<author>-<repo>-<short-sha>
Branch off main. The branch name carries the author (for example Qwen, google, MiniMaxAI), the repo, and the short HF revision SHA. The eval harness tags the scorecard with the branch name, so the artifact in the PR thread already names the exact weights revision a future engineer needs to reproduce the decision.
Step 2: pin the candidate in rubric.yaml
Change model_primary from a closed-API string (claude-4-7-opus, gpt-4.1) to a hf:// URI plus weight_sha256 plus revision. Two lines change, not one, because for self-hosted weights the hash is part of the identity. Commit message includes the HF release date and a link to the model card so the audit trail names the source.
Step 3: write deploy/inference.yaml in the same branch
This file does not exist on the closed-API path. It pins the container image digest (vllm/vllm-openai@sha256:..., or your internal mirror), tensor_parallel size, quantization scheme (NVFP4 for Gemma-4-31B-NVFP4-turbo, AWQ for many Qwen variants, or none for full-precision Hy3-preview), the GPU SKU your runner can schedule on (H100, H200, MI300X, GB200), and max_num_seqs plus KV-cache fraction. The CI runner needs this file to spin up an ephemeral inference pod against which eval.yml scores the candidate.
Step 4: push the branch, let GitHub Actions run eval.yml
The workflow already exists on main from the week-6 leave-behind. The new step on this branch is an inference-pod step that reads deploy/inference.yaml, brings up the container on a GPU runner, and points the eval target URL at it. Then it runs the case rubric, ragas, and the per-case regression check against the closed-API incumbent. The scorecard posts to the PR thread.
Step 5: read the scorecard, decide
Three outcomes. (a) Open-weight candidate clears every threshold and no case regressed by more than 3 points: merge, demote the closed-API incumbent to fallback, ship. (b) Candidate clears thresholds but regresses one case: add it to the per-case allowlist with a justification and request a second reviewer. (c) Any threshold fails: PR cannot merge, the incumbent keeps serving traffic, the branch stays as a record. None of the three requires a meeting.
Step 6: on a pass, update ops/failure_playbook.md in the same PR
Demote the closed-API incumbent to fallback. Preserve its rate-limit and timeout settings, record the scorecard URL and date, name the engineer on call. The fallback path should still cost more per token than the new self-hosted primary, on purpose: a real production failure should automatically route to the most reliable thing in the playbook, even if it is the more expensive one.
Wiring: HF release, your CI runner, your scorecard
The hub is rubric.yaml. Upstream, an HF release plus the agent traffic that fed eval/cases.yaml. Downstream, the four files the PR writes plus the scorecard. Nothing vendor-hosted in between, on purpose.
HF release → rubric.yaml → scorecard on your PR
Four files vs. the common HF deployment playbook
Left: what the PIAS leave-behind commits to on day 42. Right: what most teams default to when an April 2026 HF release lands and there is no eval harness in the repo. The rows are not a caricature; they are the difference between merging on Tuesday afternoon and scheduling a follow-up call for next quarter.
| Feature | Common HF deployment playbook | PIAS leave-behind |
|---|---|---|
| What rubric.yaml pins | Most guides pin the model name and tag only. A re-upload at main passes through unnoticed and breaks reproducibility three releases later. | model_primary plus weight_sha256 plus revision (commit on the HF repo). The weight hash is load bearing: a silent re-upload at the same tag is a different artifact and must reopen the PR. |
| What the deploy artifact pins | A docker run command in a README or a Helm chart that pulls latest. The container the eval scored is not the container production runs. | deploy/inference.yaml: container image digest (sha256), tensor_parallel size, quantization (NVFP4 or AWQ or none), GPU SKU (H100, H200, MI300X, GB200), max_num_seqs, KV-cache memory fraction. |
| Where the eval harness lives | Either a notebook the engineer ran once on a laptop, or a vendor evaluation platform that scored against its own benchmark set. Neither survives the next release. | On main as .github/workflows/eval.yml. It runs the same case rubric used for the closed-API incumbent, so the new open-weight candidate is scored on the same eval/cases.yaml that accepted the agent in week 2. |
| What happens to the closed-API incumbent on a pass | Disabled in code, sometimes commented out, sometimes left dual-routing without a documented switch. The fallback URL the on-call expects no longer matches the file. | Demoted to fallback inside the same PR. Rate limits, timeouts, and secret-manager keys are preserved. The on-call runbook still names a working route at 3am on Saturday. |
| What happens on a regression | Manual rollback on the gateway or the GPU pod. Post-mortem doc drafted later. Unclear which cases regressed because the eval was a one-off. | Build fails on rubric delta or per-case regression. PR cannot merge. The incumbent keeps serving. The branch stays on origin as the record that the release was evaluated and did not pass on this workload. |
| Where the inference engine lives | Inference engine names appear in agent code paths. Swapping vLLM for TGI is a feature PR with a model change bolted on, and is not auditable as a model qualification. | Behind an internal HTTP route the rubric does not care about. vLLM, TGI, TensorRT-LLM, or SGLang are interchangeable. The PR only changes the deploy artifact, not the orchestration code. |
| Licensing cost per qualification | Per-seat evaluation platform fee, sometimes per-model-tested fee, plus an inference-endpoint minimum on the managed route. | Zero in platform license. Cost is CI minutes plus GPU hours for the eval slice plus the candidate's HF storage egress. |
Two arguments people lose when they skip the rubric
The first argument is “our open-weight model is cheaper”. Sometimes true on token cost, often false on total cost when you include GPU idle time, cold-load amortization, and the platform engineer time the closed-API path does not need. The rubric does not have an opinion. It scores the candidate on quality, surfaces the cost-per-1k-tokens at a representative concurrency, and lets the engineering and finance leads read the same number. Most of the “Hugging Face is cheaper” arguments evaporate when the GPU is idle six hours a day. Most of them survive when the inference profile is steady-state.
The second argument is “we want IP ownership of the weights”. Real, but the IP that matters most for an enterprise agent is not the weights. It is the case set in eval/cases.yaml, the rubric thresholds in rubric.yaml, and the failure playbook in ops/failure_playbook.md. Those three artifacts are what makes any future swap, open or closed, cheap and reversible. The weights are replaceable. The rubric is not. We hand both over by day 42.
“Four files, one scorecard, one decision. The Hugging Face path adds deploy/inference.yaml to the closed-API rubric, and the same eval.yml scores both. By day 42 the team can qualify any future open-weight or closed-API release in a single afternoon, on the same case set, without us in the room.”
PIAS rubric, applied to April 2026 candidates including Qwen3.6-27B, Gemma-4 variants, MiniMax-M2.7, Tencent Hy3-preview, and Multiverse LittleLamb
Receipts
File counts and engagement-level facts, not invented benchmarks. Per-client production metrics are on /wins.
The 0 -file rule is what keeps the audit trail readable three releases later. A qualification PR that edits more than four files is a feature PR with a model swap inside, and we split it. The discipline is dull. It is what makes 0 USD in platform license cost actually translate into a process the team can repeat without us.
Want every Hugging Face release to be a four-file PR on your repo?
60-minute scoping call with the senior engineer who would own the build. You leave with a one-page week-0 memo: the rubric, the case set shape, the deploy artifact format, the GPU SKU and quantization choices, and the weekly rate.
Deploying new Hugging Face models, answered
Is this guide about Hugging Face Inference Endpoints or self-hosted vLLM?
Self-hosted, by default, because the differentiator is auditability and IP ownership. Inference Endpoints is a fine path when you genuinely do not want to run GPUs, but the qualification PR shape is the same: rubric.yaml pins the endpoint URI, deploy/inference.yaml pins the endpoint configuration (instance type, replicas, scale-to-zero behavior), and the eval harness scores the candidate on your case set. The harness does not care which side of the wire the GPU is on. It cares that the artifact running in CI is the artifact running in production, which is why the digest pin is load bearing on either path.
vLLM, TGI, TensorRT-LLM, SGLang. Which inference engine should we use?
All four are production grade in April 2026 and the pages that rank for engine choice will tell you each one wins a different benchmark. The honest answer for an enterprise workload is: pick whichever your platform team can already debug at 3am on Saturday, pin its image digest in deploy/inference.yaml, and let the eval harness score the candidate end-to-end. We have shipped on vLLM most often because the OpenAI-compatible HTTP surface keeps agents/*.py untouched on a swap. TGI is the right call when you want HF's batching plus structured-output features and your team already runs HF-native tooling. TensorRT-LLM is the right call when latency-per-token on NVIDIA is the binding constraint. SGLang is the right call for tool-call-heavy agents on smaller open-weight models. The qualification PR is the same on all four.
Why does the deploy artifact pin a container digest instead of a tag?
Because tags move and digests do not. vLLM's stable line ships a new patch every few weeks, and a tag like vllm/vllm-openai:0.17 may point at a different image next week than it did when your eval scored the candidate. If the production pod pulls a different binary than the CI pod tested, the scorecard is a lie. Pinning the sha256 digest of the image, alongside the sha256 of the weights, makes the deploy artifact reproducible. It is the smallest change that prevents the most expensive class of incident on self-hosted inference.
How do you actually compare an open-weight candidate to a closed-API incumbent on the same case set?
The case set is a list of input fixtures plus expected outcomes plus a rubric definition, all in eval/cases.yaml. The MODEL_ENDPOINT environment variable is what the rubric runner uses to send each case to whichever model is being scored. On the closed-API path, MODEL_ENDPOINT is the provider URL plus the model id (claude-4-7-opus, gpt-4.1). On the Hugging Face path, MODEL_ENDPOINT is the URL of the ephemeral inference pod that eval.yml just stood up from deploy/inference.yaml. Same cases, same scoring code, same thresholds, different endpoints. The per-case regression step then compares against main, where main is the incumbent. That comparison is the artifact you cannot get from a vendor evaluation platform.
What about quantization? Should we run NVFP4, AWQ, GPTQ, or full precision?
Quantization belongs in deploy/inference.yaml, not in agents/*.py. The point of putting it in the deploy artifact is that the eval harness scores the quantized variant the production pod will actually serve. It is common for a 27B candidate to clear the rubric at AWQ on H200 but fail at full precision on a smaller GPU footprint, and the qualification PR should reflect the variant you intend to ship. For Gemma-4-31B-NVFP4-turbo specifically, the NVFP4 weights are the artifact, so the rubric is scored against NVFP4 and that is what production runs. The harness does not let you score one variant and ship another.
Cold-load on a 27B+ model is minutes, not seconds. Does the eval harness handle that?
Yes, by giving the bring-up step its own timeout (startup_timeout_s in deploy/inference.yaml, default 10 minutes) and a healthcheck on /v1/models. The runner waits until the endpoint reports ready before the rubric step begins, so cold-load time does not get attributed to inference latency in the scorecard. We track cold-load separately and surface it in the scorecard, because if cold-load is over your SLO for a fail-over, the on-call playbook needs a warm-pool note in ops/failure_playbook.md.
What if the new HF release changes the tokenizer or chat template?
Then it is not a one-PR qualification, it is a feature PR with a model swap inside. Tokenizer changes and chat-template changes break orchestration assumptions inside agents/*.py: tool-call schemas, structured-output mode, system-prompt boundaries. We split: a feature PR adapts the orchestration, lands on main, becomes the new baseline. Then a separate four-file qualification PR runs against that new baseline. Mixing the two is how regressions hide. The discipline is dull, but it is what makes the audit trail readable three releases later.
Which April 2026 Hugging Face releases is this rubric actually being run on?
The April 2026 candidates we have seen named on trending lists or model cards include Qwen3.6-27B, Gemma-4-21B-REAP, Gemma-4-31B-NVFP4-turbo, MiniMax-M2.7, Tencent's Hy3-preview (open-weight preview posted on April 23 2026), and Multiverse Computing's LittleLamb family (announced on April 28 2026). The rubric does not care about the release schedule. It cares that the candidate clears your case set on the variant you intend to ship. Which it might. Or might not.
What if we have a production agent already and no rubric file?
That is most engagements. Week 0 of a PIAS engagement is reading your existing agent and the failure modes that are not yet under any test. Week 1 is the first PR that writes eval/cases.yaml from your last 30 days of real traffic plus an adversarial set the engineer captures with your product lead. Week 2 is the gate against that case set. From week 6 forward, every new closed-API release is a three-file PR and every new Hugging Face release is the four-file PR on this page. Both PR shapes share .github/workflows/eval.yml and eval/cases.yaml. The leave-behind is what makes them possible.
Adjacent guides
More on the same eval rubric
Evaluating new LLM releases for production agents (April 2026)
The three-file PR that qualifies a closed-API release on the same eval/cases.yaml. This page is the sibling on the open-weight path.
AI agents in production: the 6-week contract rubric
First PR in 7 days, week-2 prototype gate, week-6 leave-behind. The rubric that produces the files this page deploys against.
Agent eval set, model swap, trust
Why the case set in eval/cases.yaml is the IP that matters most on a model swap, open or closed.