Product.ai Jobs

Senior AI Engineer, Evals

Product.ai

Senior AI Engineer, Evals

Posted 2 Hours Ago

Be an Early Applicant

Hybrid

Los Angeles, CA

250K-450K Annually

Senior level

Hybrid

Los Angeles, CA

250K-450K Annually

Senior level

Lead verification architecture for LLM-driven agent loops: design oracle-separated verifiers, build regression corpora and eval harnesses, own throughput and truth-quality gates, set model-routing rules, and author corpora and specs that guarantee adversarially verified claims at scale.

The summary above was generated by AI

You build the oracle the model can't game - the verification an entire truth layer ships through.

Product.ai is the verified truth layer for shopping - the intelligence that tells you what's actually true about a product, including when not to buy. Profitable. Bootstrapped. No outside investors. No board. 20 people outbuilding companies 10× our size.

Strong people find us and keep finding us - they apply over months and years, because the field moves fast and the exact profile we need moves with it.

Why This Role ExistsOur moat is forged knowledge: claims about products that survive adversarial verification, with confidence tiers and citations attached. We produce it with Axiomatic Intelligence (ADP) - our patent-pending method that runs multi-provider adversarial research and distills it into tiered, citable verified claims. That forge is going industrial this year: hundreds of verified claim-sets across product categories, produced by AI agent loops instead of hand-run research. Verification is the one constraint that decides whether that scale-up produces truth or just volume, and it needs a dedicated owner - working directly with the founder, no layers between your verification architecture and the company's core asset.

It is also the most durable engineering investment in the building. Every model generation absorbs another layer of surface skill, so surface skills depreciate in months. Verification architecture is the one thing models never absorb - it compounds for years. This is the most fad-proof seat we have, and the non-coding oracle (how do you grade a verdict no human can check faster than the machine?) is an open frontier, not a solved problem. You'd be working at the edge of it.

The System You'll Need to Model

Judges lie in predictable directions. Models grade their own family roughly 30 points too kind. Humans judge model output barely above chance. So eval design here is oracle-separation design: the verifier must be structurally independent of the producer - separate models, separate context, separate incentives. You design the anchor the agents cannot author.
A truth forge under adversarial pressure. Research loops produce claims; competing frontier models attack them; what survives gets tiered confidence and a citation trail. Each claim class carries its own freshness decay - a price goes stale in days, a materials fact in years. Output must stay trustworthy while throughput climbs 100x.
Verification harnesses for agent loops across every discipline. Engineering, research, and content all run as long-lived agent loops governed by our architectural law - a three-tier system of constitutional rules, specifications, and code, with deterministic gates that fire when work is promoted. Those loops ground themselves in our shared brain: 8,600+ indexed documents the agents query to answer their own questions. Your regression corpora, golden sets, and refute-by-default verifier fleets are what make unattended runs safe.
Model-routing economics in a company where every workflow already runs on agents. Token ROI is a primary metric: every loop is instrumented for it, and the work is steering large token budgets toward business outcomes in real time. Which model class runs which work is engineering law, not a finance decision - we are quality-maximalist, because the expensive thing is a redo cycle, never tokens. You own the verification side of that ROI: proving the spend returned verified truth.
A system that outruns its own documentation. We built the harness before these loops were runnable at scale, and the architecture you join will not be the architecture 90 days later. You model the direction yourself; nobody writes you a brief.

If reading that energizes you, keep going. If it feels overwhelming or underspecified, this isn't the right fit.

What You Will Own

Forge throughput and the truth-quality gate. The number that matters: verified claims per week that survive adversarial challenge. You own both sides of it - loop throughput and the quality gates that keep scale from diluting truth. Like every outcome here, it is falsifiable: it ships with an evidence test a stranger could run.
The verification harness, company-wide. Every agent loop in the company - engineering, research, content - ships against verification you designed: oracle-separated verifier agents, guardrails, evals on agent behavior. Platform engineering operates the rails; you author the law and the corpora that run across them, and you own the verdict.
Regression corpora as the spec. Here the corpus is the contract: curated cases, paraphrase variants, adversarial attempts, golden sets. When behavior is disputed, the corpus decides. You own its growth, its honesty, and its bite.
The model-routing law. The standing rules for which model class runs which work, at what verification depth, with what fallback. Every company running agents will need this within two years; you'll have built ours from the inside.

Visibility here is registered architecture decisions and outcome movement - not hours, meetings, or activity.

Who You AreYou're a builder with a high technical bar whose leverage is judgment and taste - a product engineer who happens to live in verification. You decide what "verified" should mean and how you'll know it worked, and you write the spec the agents execute. Your leverage is judgment, not keystrokes.

You interrogate every green checkmark. A passing eval is a claim, and claims get challenged - you ask what the test could not have caught before you ask what it confirmed. You form working models of complex systems on your own, notice where your model is wrong, and update fast. You write clearly, because clear writing is evidence of clear thought.

You move between verification architecture and harness code without getting stuck at either altitude - a confidence-tier scheme in the morning, the verifier fleet that enforces it by evening. You treat agents as leverage you verify. You can do this job by hand and prove it - hand-verify a claim set, hand-grade a judge, hand-build a corpus - and that mastery is exactly what lets you trust the verdict when an agent produces it. You've earned the altitude you direct from; you didn't skip it.

You've probably built an eval harness another team came to depend on, a regression corpus that caught a real regression before users did, an LLM-judge pipeline where you measured the judge's bias instead of trusting it, or a CI gate for a non-deterministic system. We care about the artifact and the reasoning more than where you did it.

Who this isn't for. This role is wrong if you mainly optimize for leaderboard scores, or if you trust vendor-reported evals. It's wrong if your instinct when output is weak is to massage the wording rather than redesign the verification, and wrong if you'd rather publish findings than ship gates. It's wrong if your code is whatever the model handed you and you couldn't say why it's right, or if you treat agents as autocomplete you trust rather than leverage you verify - and wrong if you're comfortable letting an agent grade its own work. It's wrong if you think of yourself as a heads-down coder who wants tickets and defined scope rather than a builder who decides what to build. And it's the wrong place if you need defined process, fixed scope, and a program plan with milestones; verification here is steered live, not run as a project. You'll be happiest if you have high agency, think in corpora, and want your verification architecture to be the reason an entire truth layer can be trusted.

How We EvaluateWe don't run traditional AI engineering interviews.

Written artifact. Send us a system you built and the hardest failure you personally diagnosed in it - what broke, how you found it, and what you changed. An eval design doc, a judge-bias measurement, a corpus spec, or a verification postmortem all work. Writing quality is our first filter, and the failure story is the depth proof.
Video screen. Brief and async: 5-6 questions, about 15 minutes, whenever works for you.
Calls with company stakeholders. Short conversations with key members of the team.
Conversation with the founder. Chemistry and comprehension. Can you model the system above - and argue with it?
Paid work trial. One week of paid, real evals work in our real environment - live loops, live corpora, real verdicts. We watch how you ground yourself, whether you write the spec before the build, how you verify what your agents produce, and whether your self-assessment is honest.

Compensation & OwnershipTotal first-year comp: $375,000 - $450,000 (base + ownership + profit sharing). Base: $250,000 - $300,000, top of market for senior AI engineering.

Ownership is real and liquid: Profits Interest Units (PIUs) - Class B Membership Interests at $0 strike, real ownership day one, capital-gains treatment. Annual pro-rata profit sharing from Free Cash Flow - real cash every year, not a promise tied to an exit. Annual Tender Offer - the company buys back vested interests at fair market value, so you can turn ownership into cash every year without waiting for an IPO.

100% premium coverage for you and your family. Effectively unlimited token budget, steered by ROI, never capped - high token usage is encouraged here, because our harness tracks what the spend returned.

Based in Los Angeles, California. Hybrid, with flexibility; for the right builder, we're open to remote.
#BI-Hybrid

Similar Jobs at Product.ai

Product.ai

Chief Of Staff

2 Hours Ago

Hybrid

200K-400K Annually

Expert/Leader

200K-400K Annually

Expert/Leader

Artificial Intelligence • Big Data • Consumer Web • eCommerce

Own end-to-end operational outcomes and run long-lived AI agents to move them. Improve recruiting funnel throughput, operate an ownership-equity program, manage company cadence, vendors, and workplace, and instrument agent compute and outcomes. Measured on outcome movement and CEO decision capacity returned.

Product.ai

Forward Deployed Engineer - Agent Platforms

2 Hours Ago

Hybrid

200K-425K Annually

Senior level

200K-425K Annually

Senior level

Artificial Intelligence • Big Data • Consumer Web • eCommerce

Own agent-platform partnerships end-to-end: source and close integrations, build the production demo/integration (MCP/APIs), set pricing/monetization, and ship verified-commerce capabilities into major AI ecosystems. Success measured by live integrations, first MCP/API revenue, and reusable pricing/packaging playbooks.

Top Skills: Agent FrameworksAPIsApple App IntentsChatgpt AppsClaude ConnectorsGoogle Gemini ExtensionsMcpMcp Server

Product.ai

Artificial Intelligence Engineer

2 Hours Ago

In-Office

170K-500K Annually

Senior level

170K-500K Annually

Senior level

Artificial Intelligence • Big Data • Consumer Web • eCommerce

Own and evolve a production fleet of long-running AI agent automations: design runtimes, liveness checks, verification systems, token economics, model-routing, and a shared knowledge base. Build deterministic alarms, oracle-separated checkers, regression corpora, and CI/monitoring that prevent silent failures. Work directly with the founder, ship verification-first features, and replace ad-hoc review with scalable verification and escalation paths.

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine