Product.ai Logo

Product.ai

Senior AI Engineer, Evals

Posted Yesterday
Be an Early Applicant
Hybrid
Metropolitan, CA
250K-450K Annually
Senior level
Hybrid
Metropolitan, CA
250K-450K Annually
Senior level
Own verification architecture and eval harnesses that produce adversarially-verified, tiered product claims. Build oracle-separated verifier agents, regression corpora, and routing laws to ensure throughput and truth-quality gates. Author corpora specs, measure judge bias, enforce deterministic verification gates, and prove token ROI across agent loops. Work directly with founder and platform teams to ship verifiers, corpora, and model-routing rules that scale verified claims safely.
The summary above was generated by AI
You build the oracle the model can't game - the verification an entire truth layer ships through.

Product.ai is the verified truth layer for shopping - the intelligence that tells you what's actually true about a product, including when not to buy. Profitable. Bootstrapped. No outside investors. No board. 20 people outbuilding companies 10× our size.

Strong people find us and keep finding us - they apply over months and years, because the field moves fast and the exact profile we need moves with it.

Why This Role Exists

Our moat is forged knowledge: claims about products that survive adversarial verification, with confidence tiers and citations attached. We produce it with Axiomatic Intelligence (ADP) - our patent-pending method that runs multi-provider adversarial research and distills it into tiered, citable verified claims. That forge is going industrial this year: hundreds of verified claim-sets across product categories, produced by AI agent loops instead of hand-run research. Verification is the one constraint that decides whether that scale-up produces truth or just volume, and it needs a dedicated owner - working directly with the founder, no layers between your verification architecture and the company's core asset.

It is also the most durable engineering investment in the building. Every model generation absorbs another layer of surface skill, so surface skills depreciate in months. Verification architecture is the one thing models never absorb - it compounds for years. This is the most fad-proof seat we have, and the non-coding oracle (how do you grade a verdict no human can check faster than the machine?) is an open frontier, not a solved problem. You'd be working at the edge of it.

The System You'll Need to Model

  • Judges lie in predictable directions. Models grade their own family roughly 30 points too kind. Humans judge model output barely above chance. So eval design here is oracle-separation design: the verifier must be structurally independent of the producer - separate models, separate context, separate incentives. You design the anchor the agents cannot author.
  • A truth forge under adversarial pressure. Research loops produce claims; competing frontier models attack them; what survives gets tiered confidence and a citation trail. Each claim class carries its own freshness decay - a price goes stale in days, a materials fact in years. Output must stay trustworthy while throughput climbs 100x.
  • Verification harnesses for agent loops across every discipline. Engineering, research, and content all run as long-lived agent loops governed by our architectural law - a three-tier system of constitutional rules, specifications, and code, with deterministic gates that fire when work is promoted. Those loops ground themselves in our shared brain: 8,600+ indexed documents the agents query to answer their own questions. Your regression corpora, golden sets, and refute-by-default verifier fleets are what make unattended runs safe.
  • Model-routing economics in a company where every workflow already runs on agents. Token ROI is a primary metric: every loop is instrumented for it, and the work is steering large token budgets toward business outcomes in real time. Which model class runs which work is engineering law, not a finance decision - we are quality-maximalist, because the expensive thing is a redo cycle, never tokens. You own the verification side of that ROI: proving the spend returned verified truth.
  • A system that outruns its own documentation. We built the harness before these loops were runnable at scale, and the architecture you join will not be the architecture 90 days later. You model the direction yourself; nobody writes you a brief.


If reading that energizes you, keep going. If it feels overwhelming or underspecified, this isn't the right fit.

What You Will Own

  • Forge throughput and the truth-quality gate. The number that matters: verified claims per week that survive adversarial challenge. You own both sides of it - loop throughput and the quality gates that keep scale from diluting truth. Like every outcome here, it is falsifiable: it ships with an evidence test a stranger could run.
  • The verification harness, company-wide. Every agent loop in the company - engineering, research, content - ships against verification you designed: oracle-separated verifier agents, guardrails, evals on agent behavior. Platform engineering operates the rails; you author the law and the corpora that run across them, and you own the verdict.
  • Regression corpora as the spec. Here the corpus is the contract: curated cases, paraphrase variants, adversarial attempts, golden sets. When behavior is disputed, the corpus decides. You own its growth, its honesty, and its bite.
  • The model-routing law. The standing rules for which model class runs which work, at what verification depth, with what fallback. Every company running agents will need this within two years; you'll have built ours from the inside.


Visibility here is registered architecture decisions and outcome movement - not hours, meetings, or activity.

Who You Are

You're a builder with a high technical bar whose leverage is judgment and taste - a product engineer who happens to live in verification. You decide what "verified" should mean and how you'll know it worked, and you write the spec the agents execute. Your leverage is judgment, not keystrokes.

You interrogate every green checkmark. A passing eval is a claim, and claims get challenged - you ask what the test could not have caught before you ask what it confirmed. You form working models of complex systems on your own, notice where your model is wrong, and update fast. You write clearly, because clear writing is evidence of clear thought.

You move between verification architecture and harness code without getting stuck at either altitude - a confidence-tier scheme in the morning, the verifier fleet that enforces it by evening. You treat agents as leverage you verify. You can do this job by hand and prove it - hand-verify a claim set, hand-grade a judge, hand-build a corpus - and that mastery is exactly what lets you trust the verdict when an agent produces it. You've earned the altitude you direct from; you didn't skip it.

You've probably built an eval harness another team came to depend on, a regression corpus that caught a real regression before users did, an LLM-judge pipeline where you measured the judge's bias instead of trusting it, or a CI gate for a non-deterministic system. We care about the artifact and the reasoning more than where you did it.

Who this isn't for. This role is wrong if you mainly optimize for leaderboard scores, or if you trust vendor-reported evals. It's wrong if your instinct when output is weak is to massage the wording rather than redesign the verification, and wrong if you'd rather publish findings than ship gates. It's wrong if your code is whatever the model handed you and you couldn't say why it's right, or if you treat agents as autocomplete you trust rather than leverage you verify - and wrong if you're comfortable letting an agent grade its own work. It's wrong if you think of yourself as a heads-down coder who wants tickets and defined scope rather than a builder who decides what to build. And it's the wrong place if you need defined process, fixed scope, and a program plan with milestones; verification here is steered live, not run as a project. You'll be happiest if you have high agency, think in corpora, and want your verification architecture to be the reason an entire truth layer can be trusted.

How We Evaluate

We don't run traditional AI engineering interviews.

  • Written artifact. Send us a system you built and the hardest failure you personally diagnosed in it - what broke, how you found it, and what you changed. An eval design doc, a judge-bias measurement, a corpus spec, or a verification postmortem all work. Writing quality is our first filter, and the failure story is the depth proof.
  • Video screen. Brief and async: 5-6 questions, about 15 minutes, whenever works for you.
  • Calls with company stakeholders. Short conversations with key members of the team.
  • Conversation with the founder. Chemistry and comprehension. Can you model the system above - and argue with it?
  • Paid work trial. One week of paid, real evals work in our real environment - live loops, live corpora, real verdicts. We watch how you ground yourself, whether you write the spec before the build, how you verify what your agents produce, and whether your self-assessment is honest.


  • Compensation & Ownership

    Total first-year comp: $375,000 - $450,000 (base + ownership + profit sharing). Base: $250,000 - $300,000, top of market for senior AI engineering.

    Profits Interest Units (PIUs) - Class B Membership Interests at $0 strike, real ownership day one, capital-gains treatment; annual pro-rata profit sharing from free cash flow; annual tender liquidity; 100% family premium coverage; effectively unlimited token budget, steered by ROI, never capped.

    Based in Los Angeles, California. Hybrid, with flexibility; for the right builder, we're open to remote.
    #BI-Hybrid

    Similar Jobs at Product.ai

    Yesterday
    Hybrid
    200K-400K Annually
    Expert/Leader
    200K-400K Annually
    Expert/Leader
    Artificial Intelligence • Big Data • Consumer Web • eCommerce
    Act as the CEO's operational right hand, owning end-to-end falsifiable operational outcomes and running long-lived AI agents against them. Fix mid-funnel recruiting throughput, build and run ownership-equity operations, steward company cadence, vendors, and workplace, and instrument agent runs and compute ROI. Deliver measurable reductions in CEO decision load and candidate-to-decision latency, and build operable systems and guardrails that the company uses daily.
    Yesterday
    In-Office
    120K-200K Annually
    Senior level
    120K-200K Annually
    Senior level
    Artificial Intelligence • Big Data • Consumer Web • eCommerce
    Own and operate the Los Angeles workspace end-to-end as a product: vendor and facilities management, onboarding and trial logistics, event operations, inventory and budget control, and systems that automate repeatable tasks. Use AI tools daily, produce runbooks and checklists, and ensure visitors and new hires are productive on day one. Authority to define vendor spend thresholds and run the physical layer with measurable outcomes.
    Top Skills: Ai ToolsSpreadsheets
    Yesterday
    In-Office
    170K-500K Annually
    Senior level
    170K-500K Annually
    Senior level
    Artificial Intelligence • Big Data • Consumer Web • eCommerce
    Design, own, and operate a production agent harness and long-running AI automations. Build verification systems (oracle-separated checkers, regression corpora), deterministic liveness checks, instrumentation, CI gates, and model-routing/token-economics to ensure correctness and measurable outcomes. Work directly with founder across product, ops, and verification; ship alarms, regression suites, and escalation paths that scale human review only where judgment is required.
    Top Skills: Agent HarnessAi AgentsCi/CdData PipelinesGenerative ModelsInstrumentationKnowledge BaseLarge Language Models (Llms)Model RoutingMonitoringOracle-Separated CheckersRegression CorporaRegression TestingToken Economics

    What you need to know about the San Francisco Tech Scene

    San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

    Key Facts About San Francisco Tech

    • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
    • Major Tech Employers: Google, Apple, Salesforce, Meta
    • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
    • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
    • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
    • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

    Sign up now Access later

    Create Free Account

    Please log in or sign up to report this job.

    Create Free Account