Rhoda AI Logo

Rhoda AI

Research Scientist / Engineer - Training Systems

Posted Yesterday
Be an Early Applicant
In-Office
Mountain View, CA, USA
Expert/Leader
In-Office
Mountain View, CA, USA
Expert/Leader
Lead design and implementation of large-scale multimodal training systems. Diagnose and optimize compute, communication, and memory bottlenecks across thousands of GPUs; define parallelism strategies; build observability and regression detection tools; collaborate with researchers and infra to improve distributed efficiency and scaling for robotics world models.
The summary above was generated by AI

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possible by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality.

We're looking for a Staff / Principal ML Training Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.

What You'll Do

Own training performance end-to-end

  • Diagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)

  • Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritization

  • Drive measurable gains in:

    • Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)

    • Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)

    • Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)

Design training systems (not just tune them)

  • Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approaches

  • Improve execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvements

  • Contribute to and extend training frameworks where needed

Make performance observable and measurable

  • Establish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiency

  • Build tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurations

  • Develop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressions

Partner deeply with researchers

  • Work side-by-side with research scientists and research engineers — no silos

  • Translate model innovations into scalable, efficient implementations

  • Advise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length data

Collaborate on cluster-level efficiency

  • Work with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behavior

What We're Looking For

  • Proven track record improving large-scale distributed training performance

  • Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)

  • Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clusters

  • Strong systems intuition — ability to reason across compute, communication, and memory bottlenecks

  • Exceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvements

  • High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)

  • Experience with multimodal or video training (variable-length sequences, packing/bucketing)

  • Experience working on large-scale training frameworks or distributed runtimes

  • Familiarity with cluster topology, networking, and large-scale scheduling effects

Why This Role

  • Direct leverage on research velocity — every efficiency gain you make accelerates model iteration across the entire research team

  • Own the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarks

  • Improvements you make compound across every training run the company executes — high ownership, high impact, small elite team

Similar Jobs

41 Seconds Ago
Hybrid
5 Locations
99K-232K Annually
Mid level
99K-232K Annually
Mid level
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Lead sourcing and procurement engagements to drive cost savings, improve supplier relationships, and strengthen supply chain resilience. Advise clients on strategic sourcing frameworks, implement digital procurement (Coupa), manage complex procurement projects and contracts, perform spend analysis, and coach team members while overseeing budgeting and compliance.
Top Skills: Coupa Software
42 Seconds Ago
Hybrid
5 Locations
155K-410K Annually
Expert/Leader
155K-410K Annually
Expert/Leader
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Lead Oracle Cloud finance transformation engagements, set strategic direction, oversee complex implementations, mentor teams, build client relationships, and drive process and technology improvements across ERP and EPM solutions.
Top Skills: AnalyticsMachine LearningOracle CloudOracle Cloud EpmOracle Cloud ErpRpa
43 Seconds Ago
Hybrid
San Francisco, CA, USA
77K-202K Annually
Senior level
77K-202K Annually
Senior level
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Provide strategic guidance to clients by analyzing market trends, assessing business performance, and developing actionable recommendations. Use data analytics and competitive research to design go-to-market strategies, optimize operations, and drive business transformation. Build client relationships, support implementation, and mentor junior team members while maintaining professional standards.

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account