AGI, Inc. Jobs

Research Engineer - Evals

AGI, Inc.

Research Engineer - Evals

Reposted 15 Days Ago

Be an Early Applicant

In-Office

San Francisco, CA, USA

Mid level

In-Office

San Francisco, CA, USA

Mid level

As a Research Engineer, you will develop evaluation frameworks for AI models and agents, ensuring reliable metrics for model performance and product readiness.

The summary above was generated by AI

Think Different. Build the Future. 🚀

Our Mission

Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day.

Why AGI, Inc.

We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale.

Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts.

We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo)

If you see possibility where others see limits, read on.

You decide what "better" means.

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust — and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI — across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

🤩 Tasks you will own

The eval suites that gate every model and agent release — capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss
The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy
The bar — what counts as ready to ship, and how we know

🤚 Areas where you will assist

Research, by making sure what we measure is what we want
Product engineers, by instrumenting real-user behavior on real devices
Partnerships, by translating "did it get better" into language an OEM partner can hold us to

📚 Skills you'll be expected to teach

How to measure non-deterministic systems — agent eval, tool use, long-horizon tasks, multilingual behavior
How to push back on a metric that's being gamed without breaking the team

🧑‍🎓 Skills you'll be expected to learn

On-device perf trade-offs and how they show up in real-user evals
What QA-ing AI at OEM scale actually looks like
The realities of shipping consumer agents to production partners

🏆 Timeline of success

After 30 days — You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days — You've stood up a new eval surface — agentic, on-device, or behavioral — and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days — Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

💰 Compensation

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

How to apply

Send a link to an eval, benchmark, or measurement system you built — and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.

170 Saint Germain Ave, San Francisco, CA , United States, 94114

Similar Jobs

Epsilon Health

Research Engineer - Data Quality & Evals

2 Days Ago

In-Office

San Francisco, CA, USA

Junior

Healthtech • Software

Build data filtering and curation pipelines for VLM and classifier training sets, develop model-based data-quality signals, design clinical evaluation methodology for radiology report generation, maintain clinical benchmarks, validate model-based evaluators against radiologist judgment, and implement continuous evaluation and regression testing across the research stack.

Top Skills: AirflowDatabricksDicomLlmsPythonSparkVision-Language Models

Variance

Research Engineer, Evals

19 Days Ago

In-Office

San Francisco, CA, USA

250K-400K Annually

Mid level

250K-400K Annually

Mid level

Software

The Research Engineer will define metrics and improve model quality by building benchmarks, datasets, and evaluation tools to enhance AI systems in fraud and risk investigations.

Top Skills: AIData AnalysisExperimentationMachine LearningPython

Mercor

Research Engineer – Benchmarking, Evals & Failure Analysis

13 Days Ago

In-Office

San Francisco, CA, USA

Mid level

Artificial Intelligence • Software

As a Research Engineer, you will focus on benchmarking, evaluations, and failure analysis of AI models, enhancing model performance through systematic analysis and collaboration with teams.

Top Skills: APIsCloud PlatformsMachine LearningNoSQLSQL

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

AGI, Inc.

Research Engineer - Evals

AGI, Inc. San Francisco, California, USA Office

Similar Jobs

Research Engineer - Data Quality & Evals

Research Engineer, Evals

Research Engineer – Benchmarking, Evals & Failure Analysis

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech