Dropbox Logo

Dropbox

Staff Product Manager, AI Eval Platform

Reposted 21 Hours Ago
Be an Early Applicant
Remote
Hiring Remotely in Canada
191K-259K Annually
Expert/Leader
Remote
Hiring Remotely in Canada
191K-259K Annually
Expert/Leader
Lead the development of AI evaluation systems at Dropbox, measuring quality and reliability of AI features, collaborating across teams, and defining metrics for product improvements.
The summary above was generated by AI
Role Description

As a Staff Product Manager within the Dash organization, you will play a crucial role in shaping how we measure and evaluate our AI-powered assistant and features. Dropbox is seeking a Staff Product Manager to lead AI Evaluations (Evals) — the systems, metrics, and processes that measure the quality and reliability of AI-powered features across Dropbox. In this role, you’ll define how we evaluate model performance, accuracy, and user satisfaction across diverse AI surfaces like Dash, search, summarization, and intelligent organization. You will be responsible for a core platform that enables every product team at Dropbox to launch new AI features with confidence, armed with the tools to measure their success both online and offline.

You’ll collaborate closely with Applied AI, Data Science, and Research to design frameworks that ensure our AI features are helpful, safe, and high-quality. This includes everything from defining success metrics for model improvements, to building scalable pipelines that assess qualitative and quantitative signals at scale.

This role sits at the intersection of AI systems, data rigor, and product judgment — ideal for a PM who loves turning ambiguity into measurable progress and ensuring that every AI interaction meets a bar of excellence.

Responsibilities
  • Define and drive the roadmap for Dropbox’s AI Evaluation Framework, covering both quantitative metrics and human-in-the-loop systems.
  • Define the strategic vision and north-star framework for how Dropbox measures AI performance, setting unified principles for quality, correctness, relevance, and reliability across Dash and other AI features.
  • End to end ownership of offline scoring pipelines, online instrumentation, dashboards, APIs, and LLM-as-Judge components used by all product teams.
  • Build and scale a self-serve measurement platform that enables any Dropbox team to launch features, run experiments, and measure performance with minimal friction.
  • Collaborate cross-functionally with ML, product, engineering, research, and data science to operationalize evaluation pipelines, design rubrics, and ensure metrics are valid, reproducible, and reliable.
  • Establish and maintain company-wide evaluation standards by defining rubrics, extending scorer taxonomies, and guidelines that become the foundation for AI quality measurement and benchmarking.
  • Integrate measurement systems into the product lifecycle by partnering with PMs and engineering to ensure evaluation and feedback loops are embedded from ideation through launch and iteration.
  • Communicate results, insights, and trade-offs to senior leadership, influencing product decisions and roadmap prioritization through clear storytelling backed by rigorous data.
Requirements
  • 10+ years of experience building measurement, analytics, or evaluation platforms, ideally in an ML/AI context (e.g. experimentation platform, metrics infrastructure, evaluation pipelines) particularly with an understanding of the end-to-end AI development lifecycle, from model training to deployment and monitoring.
  • BS/MS in Computer Science, Engineering, Business, Information Systems, Applied Math or Statistics, or relevant experience.
  • Experience designing and deploying evaluation frameworks and pipelines. E.g. solid offline vs online evaluation, metric definition and calibration, and human + model adjudication where needed. 
  • Deep understanding of ML evaluation, metrics, statistics. E.g. AUC, precision/recall, calibration, bias detection, variance, error analysis.
  • Technical fluency and ability to partner with engineers, software engineers, and data scientists. Candidate is comfortable reasoning about pipelines, APIs, performance, scale, latency, system tradeoffs, and more, with the ability to engage in deep technical discussions with engineers and data scientists, and translate complex technical concepts into clear product requirements.
  • Strong cross-functional collaboration skills. You will  need to work with PMs, researchers, engineers, data teams, labeling teams, and senior leaders.
  • Exceptional written and verbal communication skills, with a demonstrated ability to create clear, structured product documents and effectively communicate vision, trade-offs, and progress to stakeholders at all levels, including executives.
  • Bias, fairness, robustness mindset. Experience (or sensitivity) in designing evaluation with fairness / adversarial robustness / edge cases in mind.
Preferred Qualifications
  • Experience with developing or implementing LLM-based evaluation frameworks within a RAG (Retrieval-Augmented Generation)  context while leveraging LLM as a Judge for online evaluations. 
  • Hands-on experience with prompt evaluation, rubric design, human-in-the-loop evaluation, adversarial test design
  • Familiarity with experimentation at scale, including test design and measurement . e.g.  A/B testing systems, causal inference, counterfactual measurement.
  • 5+ years of experience in building self-service internal platforms / ML infrastructure / SDKs / APIs.
  • Experience building platforms or internal tools for technical users or developers and non-technical audiences alike. 
  • PhD or advanced degree in a quantitative field (CS, ML, statistics, etc.).
Compensation
Canada Pay Range
$191,300$258,700 CAD

Top Skills

AI
Data Platforms
Llm
Metrics
Ml
HQ

Dropbox San Francisco, California, USA Office

Though remote is our primary way of working, meaningful in-person connection and collaboration is a critical part of Virtual First. Our San Francisco Studio is a place for teams to come together to host meetings, off-sites, and build community.

Similar Jobs at Dropbox

An Hour Ago
Remote
Canada
99K-134K Annually
Senior level
99K-134K Annually
Senior level
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
As a Program Manager, you will optimize engineering operations, manage processes, enhance workflows, and improve collaboration across teams to deliver high-quality software.
Top Skills: AirtableAsanaConfluenceJIRA
21 Hours Ago
Remote
Canada
168K-227K Annually
Mid level
168K-227K Annually
Mid level
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
The Account Executive will develop account strategies to acquire new business, build customer relationships, deliver presentations, negotiate contracts, and meet revenue targets through effective sales activities.
21 Hours Ago
Remote
Canada
219K-296K Annually
Expert/Leader
219K-296K Annually
Expert/Leader
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Lead strategy and execution of AI-powered experiences in Dropbox's Dash. Define product vision, translate opportunities into strategies, and collaborate with teams to drive product innovation.
Top Skills: AIMlNlp

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account