Mind Robotics Logo

Mind Robotics

Machine Learning Infrastructure Engineer

Reposted 4 Days Ago
In-Office
Palo Alto, CA, USA
Entry level
In-Office
Palo Alto, CA, USA
Entry level
The role involves building systems for large-scale model training, focusing on distributed training, ML infrastructure, and GPU performance optimization.
The summary above was generated by AI
The Role

At Mind Robotics, we’re building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure.

We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering everything from experimentation to production deployment.

Responsibilities
  • Design and implement scalable systems for training large ML models

  • Enable efficient workflows for data ingestion, training, and iteration

  • Develop and optimize distributed training systems across hundreds of GPUs

  • Implement strategies for parallelization, sharding, and efficient compute utilization

  • Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management

  • Partner closely with modeling teams to accelerate iteration speed and reduce training costs

  • Build internal tools for experiment tracking, monitoring, and debugging

  • Implement systems for tracking training performance, failures, and resource utilization

  • Debug and resolve bottlenecks across the training stack

  • Provide lightweight infrastructure support for deploying and running models in production environments

  • Optimize inference performance and reliability where needed

  • Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)

  • Manage compute resources efficiently across training jobs

Qualifications
  • Strong experience building infrastructure for large-scale ML training

  • Deep understanding of how modern LLM/VLM systems are trained and scaled

  • Proven experience setting up and scaling distributed training across hundreds of GPUs

  • Strong understanding of parallelization strategies (data, model, pipeline parallelism)

  • Strong proficiency in Python programming

  • Expert-level proficiency in PyTorch and/or JAX

  • Strong understanding of techniques like attention optimization, kernel fusion, and efficient memory usage

Nice to Have
  • Experience supporting inference systems in production

  • Familiarity with robotics or embodied AI workloads

  • Experience building tools for experiment management and researcher productivity

Similar Jobs

2 Days Ago
Hybrid
Palo Alto, CA, USA
195K-343K Annually
Expert/Leader
195K-343K Annually
Expert/Leader
Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Design, build, and optimize large-scale ML infrastructure: embedding generation, batch inference, data storage/compute, data management, quality systems, and production deployments with ML engineers to improve ranking and recommendation systems.
Top Skills: C++Embedding SystemsFeature StoreFlinkJavaPythonPyTorchRayScalaSparkTensorFlow
19 Days Ago
Remote or Hybrid
2 Locations
185K-335K Annually
Senior level
185K-335K Annually
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Lead design and development of scalable, high-performance ML training infrastructure. Drive distributed training performance optimization, observability, and developer experience. Own cross-functional infrastructure initiatives, set technical direction and standards, and mentor engineers to deliver platform capabilities that support large-scale model training.
Top Skills: AWSAzureDistributed TrainingFsdpGCPGpu ComputingPipeline ParallelismPythonPytorch 2.XTensorFlow
13 Days Ago
Hybrid
Palo Alto, CA, USA
133K-235K Annually
Junior
133K-235K Annually
Junior
Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
The Software Engineer will optimize ML infrastructure for training and inference, develop scalable systems, and work closely with ML engineers on producing high-performance models.
Top Skills: C++Caffe2FlinkJavaPythonPyTorchRayScalaScikit-LearnSparkSpark MlTensorFlow

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account