Sitemap
Every page on PIAS, grouped by section. 67 pages total.
About1
Alternatives4
Best2
- agents/graph.pyOrchestration definition: LangGraph, Pydantic AI, or a custom DAG. Picked on the scoping call with written rationale checked in next to the file.
- Model layerAfter the engineer leaves, can you swap the model? Anthropic to OpenAI to Bedrock to Vertex without rewriting the agent? Or is the engagement structurally tied to one vendor
Contact1
Faq1
How It Works1
Sitemap1
Guides55
- 1. policy.yamlThe single source of truth for what every tool can do. Scope, allowed args, rate limits, confirm-token requirements, blast radius. Read by the gateway at runtime AND by the CI replay test on every PR.2026-05-11
- Forward Deployed Engineer (FDE)2026-05-09
- T+0. The trigger.Vendor announces model_primary deprecation in 90 days, raises per-million-token pricing 2.4x, gets acquihired by a hyperscaler with a different SLA, or the customer2026-05-09
- Calendar day 0: scoping callOne-pager with the production behavior to ship, the rubric axes, and the staging environment the engineer will push to. No statement of work without a one-pager.2026-05-08
- Forward Deployed Engineer, Applied AI2026-05-08
- What is a forward deployed engineer (the role, the three flavors, and the four-file test)2026-05-08
- Per-case floor lives next to the caseStored in eval/cases.yaml as case_floor on each row. A case that was added because of an incident gets a tighter floor than a case from the POC seed. The floor is the contract for that one case; the aggregate is reporting, not gating.2026-05-07
- Re-grade the week 2 cases as full trajectoriesTake the same eval/cases.yaml that scored 0.86 in week 2 and re-run it with full multi-turn rollouts capped at the realistic step ceiling for each case class. Expect the average to drop. The drop is the size of the horizon blindness, not a regression.2026-05-06
- Rubric driftThe rubric file gets edited and old scores stop being comparable to new ones. The fix is a changelog and a SHA stamp on every score record.2026-05-06
- Step 1. The reviewer asks for the rubric.You hand them a path: rubric.yaml. Not a PDF, not a slide. The same file the engineer reads.2026-05-06
- Week 0 -- scoping call and one-pager30 minute call. Output is a written one-pager that names which of the seven services your shipped agent has to rebuild, which the prototype already needed (almost always: compaction and identity), and which are deferred to v2 (almost always: signed updates, until you have v0.5).2026-05-06
- Flavor 1. Platform FDE.The Palantir-original. The engineer is sent to land a platform (Foundry, Gotham). The work they leave behind runs inside the platform. When the engagement ends the customer keeps paying for the seat or the system stops running.2026-05-05
- The seven PRs a forward deployed engineer lands in week 22026-05-05
- 2003. Palantir founded.Peter Thiel, Alex Karp, Stephen Cohen, Joe Lonsdale and Nathan Gettings start Palantir to sell software to intelligence agencies. The customers cannot openly describe what they need. There is no normal product discovery loop. The role does not exist yet.2026-05-02
- rubric: <agent> v0 (skeleton, week 2 row of ratchet)2026-05-02
- 1. Faithfulness (weight 0.30, floor 0.74)The single most-watched axis. Did the output stay grounded in the inputs the agent was actually given? Ragas-judged on retrieval-augmented work, rubric-judged elsewhere. A weighted-average score that hides one hallucinated claim per ten cases is worse than a lower score that does not.2026-05-01
- tenant_id, the only axis where leak_tolerance is 0.0 by law, not by tasteThe customer organization. Hard boundary.2026-04-27
- 1. .claude/agents/<role>.mdOne Markdown file per subagent, checked into the client
- 1. Command(goto=\An agent function returns langgraph.types.Command and the framework quietly adds an edge that is not in the static graph. Your CI graph walker counts edges by reading graph.py and misses every runtime-only edge. A PR that adds three Command(goto=) returns looks like a 0-edge diff on the review screen.
- 1. graph.py or pipeline.pyOne file that defines the orchestration: nodes, edges, conditional fallback, entrypoint. LangGraph StateGraph, a Pydantic AI pipeline, or a custom scene DAG. Your engineers read this file first on every incident.
- 1. supervisor + capped turn budgetOne supervisor node routes to N worker nodes via add_conditional_edges. Workers return data. Routing lives in a pure function over state. A typed turn_budget field counts re-routes and terminates when it hits zero.
- 1. The orchestration source itselfgraph.py or pipeline.py, in your repo, on main. Model adapters sit behind a small pick_primary / pick_fallback function so swapping Bedrock for Vertex is a one-line change. No imports from a vendor runtime. No decorators tied to a control plane we keep after we leave.
- A new source lands in intake/.A paper, a regulation, an incident write-up, a transcript. The compiler agent reads it and drafts a candidate article in wiki-staging/. Nothing is promoted to wiki/ yet. The article is allowed to be wrong; the safety net comes next.
- Ai Agent Blast Radius Database Write Scope
- Annex IV §2(g)Validation and testing procedures used, including information about the validation and testing data used and their main characteristics; metrics used to measure accuracy, robustness. The artifact in your repo: rubric.yaml (the metrics and thresholds) plus eval/cases.yaml (the validation data and its characteristics, with id, source, expected_traits, and rubric_weight per row).
- Artifact 1: per-alert evidence packOne YAML file per disposition in eval/_artifacts/, committed on every pipeline run. Carries alert_id, agent_version_sha, model_primary_pin, the input_snapshot, the agent_output, the human_review block, and a one-command rerun_command. The examiner reads any one file and reconstructs the decision. Independent testing draws its sample from the same directory; it does not need a vendor export.
- Box 1, interval 1 dayNew cases land here. So do cases that just failed. Box 1 is pulled into every PR run unconditionally, regardless of next_due. The harness is least confident about box 1, so it pays the most attention.
- Clock 1. Pre-merge harness.Runs in CI on every PR that touches agents/, eval/, or rubric.yaml. Reads eval/cases.yaml (80 rows: 68 from real traffic, 12 adversarial). Scores against rubric.yaml. Posts the scorecard to the PR with gh pr comment. Blocks merge on regression > 3 points. Cadence: per PR, blocking, sub-12-minute. Catches: regressions on cases the team has already seen.
- Corpus drift: the indexer ran on Tuesday and the agent has been wrong since
- Enumerate failure modes before drawing a single nodeBefore we open graph.py, we write the FailureReason Literal in state.py. Six names, no more. Each name maps to an operational response a human can take, not an internal model error.
- Gate 1: judge-human agreement on a 50-case gold setCohen
- Guides | PIASIn-depth guides on forward-deployed AI engineering, hiring embedded ML teams, build-vs-buy for AI, and agentic AI architecture patterns.
- Head slice (high-frequency intents)
- Inputs Anthropic explicitly does not redact for youAnthropic
- Inputs phrased in a way the rubric author did not anticipateYour eval writer imagined users saying \
- memory/policy.yamlWhat is allowed to enter long-term memory. Per-axis confidence floors, max writes per session, namespace allowlist, hard-block list (PII, credentials, prompts the user explicitly asked the agent to forget).
- Part 1: monitoring/alerts.yaml — the alert spec, in the repoEvery alert that can fire on the agent is one entry in a YAML file inside the client
- Pattern 1: Bedrock Agents (managed runtime)AWS owns the graph. You declare a supervisor agent, collaborator agents, and action groups; the execution loop is Bedrock
- POC seed casesThe first 60 to 120 cases. Hand-written during the week-2 scoping call with the product lead. Cover the golden path and the named near-misses we already know about. ID prefix: poc-NNN. provenance.source: poc. provenance.incident_url: null. These are the cases the rubric was actually designed against.
- Q1. Is rubric.yaml a real file on main, or a Notion link?test -f rubric.yaml. If the file does not exist at the repo root on main, the harness is undefined. The single most common shelfware pattern: a one-page rubric in Notion that nothing in the repo references and no PR can be checked against.
- rubric.yamlThe one file that defines what correct means for your workload. model_primary, model_fallback, rubric_min_score, ragas thresholds, max_per_case_regression. A new release is a one-line edit. If this file does not exist on main, your agent is not production-ready, regardless of whose framework you used.
- schema_validOutput parses against a Pydantic model. Pure regex/JSON-schema check. No model needed. Cost: 0 tokens. Runs in 4ms per case.
- Signal 1: label flip rate under 8 percent on a 30 day relabel passHand the same engineer the same gold set 30 days after they last labeled it, blind. Count the cases where their label flips. Healthy gold set: under 8 percent flip rate. Between 8 and 15: the rubric drifted, the engineer
- Staging corpus is curated. Production corpus is whatever the ingest job produced last night.
- Step 0: confirm the HF repo card actually says what you needOpen the model card on huggingface.co. Confirm three things: license is one your legal already cleared (Apache-2.0, MIT, or a vetted Gemma or Qwen license), parameter count and minimum GPU memory match a SKU you can already provision, and the tokenizer is one your orchestration code reads. If any of those are unclear, do not branch yet. A scoping note is faster than a reverted PR.
- Step 0: confirm the release supports the interfaces your agent usesRead the release notes for the three things your orchestration actually depends on: tool-use schema, structured-output mode, and streaming token shape. If any of those changed, the work is not a one-PR swap, it is a shim; stop here and schedule a scoping call. In practice most 4.x to 4.y bumps are clean swaps, most x to x+1 major bumps are not.
- Step 1: sample fails out of the trace store, do not sample successesWe pull the last 7 days of traces, score them with the same rules + LLM judge the harness uses, and keep only the runs that failed. Successes are eval-set fodder. The failure dataset is a deliberately biased corpus. The whole point is to overweight the failures because aggregate pass rate already overweights the easy cases.
- System documentation, with versioning
- The 2am on-call test
- The demo had three hand-picked inputs. Production has every input.
- The Google Next demo had three rehearsed prompts. Production has 4,000 a day.
- The intent the team did not know existed
- The word \Every stakeholder has a private definition. Sales means the agent answers demo questions. SRE means it has a runbook. Compliance means a BAA is signed. Finance means ROI is measurable. Nobody has the authority to promote the pilot because nobody can say which file flipped. The PIAS engagement writes that definition into rubric.yaml in week 2, and every stakeholder reads the same threshold.
- Tokenizer changed under the splitter
- Week 2: pick the input window cap, then back-fill the tiersStart from your model