Rhoda AI Logo

Rhoda AI

Research Engineer - Training Platform

Posted Yesterday
Be an Early Applicant
In-Office
Mountain View, CA, USA
Mid level
In-Office
Mountain View, CA, USA
Mid level
Build and maintain large-scale training orchestration and experiment management systems for distributed GPU model training. Implement observability, scheduling, artifact management, and tooling to optimize research iteration, cluster utilization, and reliability while collaborating with research and infra teams.
The summary above was generated by AI

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possible by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality.

We're looking for a Research Engineer to build and maintain the training platform that powers our model development — experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

What You'll Do

  • Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters

  • Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management

  • Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection

  • Optimize and automate the research iteration loop from experiment launch to results analysis

  • Manage job scheduling and cluster utilization for efficient use of GPU compute

  • Build internal tooling and interfaces that help researchers move faster

  • Collaborate with training systems, data infrastructure, and research teams to support their platform needs

What We're Looking For

  • Strong software engineering skills with experience in MLOps or ML platform engineering

  • Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)

  • Experience building experiment tracking, reproducibility, and artifact management systems

  • Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)

  • Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice to Have (But Not Required)

  • Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)

  • Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)

  • Experience supporting large model training pipelines (LLMs, VLMs, or video models)

  • Understanding of parallelism strategies and how they affect training efficiency and debugging

  • Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Why This Role

  • Your platform is the daily tool every researcher and engineer uses to train models

  • Improvements to training velocity and reliability compound across every experiment the team runs

  • High visibility with direct feedback from researchers and ML engineers

  • Build systems that scale from today's models to future frontier training runs

Similar Jobs

4 Minutes Ago
Hybrid
Mountain View, CA, USA
179K-323K Annually
Expert/Leader
179K-323K Annually
Expert/Leader
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Lead a Transformation team to design, staff, sequence, and deliver cross-functional, time-bound teams that redesign business workflows using collaboration platforms, AI, and automation. Own team operating model, intake, success metrics, vendor/governance coordination, engineering demand, sustainment handoffs, executive reporting, and stakeholder engagement across the program.
4 Minutes Ago
Remote or Hybrid
Mountain View, CA, USA
Senior level
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Build, ship, and operate end-to-end production AI/ML solutions for vehicle diagnostics, prognostics, and test analytics. Implement ML pipelines on Azure/Databricks, develop observable ML services, ensure model/data observability, and collaborate with SMEs to embed models into engineering workflows while mentoring other practitioners.
Top Skills: Ci/CdDatabricksExperiment TrackingAzureMlflowModel RegistriesPythonPyTorchScikit-LearnSparkSQLTensorFlow
5 Minutes Ago
Hybrid
Mountain View, CA, USA
218K-333K Annually
Senior level
218K-333K Annually
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Lead a team developing AI/ML solutions for vehicle motion control and estimation. Oversee design, implementation, and deployment of embedded AI systems, mentor engineers, coordinate cross-functional teams, ensure project delivery using agile practices, and drive adoption of AI/ML across the organization.
Top Skills: C++Deep LearningEtasGitIbm Rational Tool SuiteJamaJavaJIRAMatlab-SimulinkNatural Language ProcessingNeural NetworksPython

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account