Causal Labs Logo

Causal Labs

Machine Learning - Infrastructure

Reposted 15 Days Ago
Be an Early Applicant
In-Office
San Francisco, CA, USA
Mid level
In-Office
San Francisco, CA, USA
Mid level
Design and maintain distributed ML training clusters, develop scalable pipelines for large datasets, and optimize performance for ML workloads.
The summary above was generated by AI

Our mission is general causal intelligence, AI that is capable of (1) predicting the future and (2) identifying the optimal actions to change that future.

To achieve this breakthrough, we are building a Large Physics foundation Model (LPM) because domains governed by physics have inherent cause and effect relationships, unlike visual or textual data.

Weather is the ideal training ground for an LPM. It is the most well-observed physical system, offering rapid, objective ground truth feedback from sensory observations and data at a scale that dwarfs what is used to train today’s LLMs.

Causal Labs is a team of researchers and engineers from self-driving, drug discovery, and robotics - including Google DeepMind, Cruise, Waymo, Meta, Nabla Bio, and Apple - who believe general causal intelligence will be the most important technical breakthrough for civilization.

We look for infrastructure engineers who are excited to tackle unsolved problems.

Our training and inference challenges demand deep expertise in setting up distributed training clusters and optimizing performance for large models. If you have experience building large-scale ML infrastructure in related fields such as language and vision models, robotics, biology -- join us on this mission.

Responsibilities

  • Design, deploy, and maintain large distributed ML training and inference clusters

  • Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle

  • Research and test various training approaches including parallelization techniques and numerical precision trade-offs across different model scales

  • Analyze, profile and debug low-level GPU operations to optimize performance

  • Stay up-to-date on research to bring new ideas to work

What we’re looking for

We value a relentless approach to problem-solving, rapid execution, and the ability to quickly learn in unfamiliar domains.

  • Strong grasp of state-of-the-art techniques for optimizing training and inference workloads

  • Demonstrated proficiency with distributed training frameworks (e.g. FSDP, DeepSpeed) to train large foundation models

  • Knowledge of cloud platforms (GCP, AWS, or Azure) and their ML/AI service offerings

  • Familiarity with containerization and orchestration frameworks (e.g., Kubernetes, Docker)

  • Background working on distributed task management systems and scalable model serving & deployment architectures

  • Understanding of monitoring, logging, observability, and version control best practices for ML systems

You don’t have to meet every single requirement above.

Top Skills

AWS
Azure
Docker
GCP
Kubernetes
Ml Training Frameworks
HQ

Causal Labs San Francisco, California, USA Office

San Francisco, CA, United States

Similar Jobs

4 Days Ago
Hybrid
Menlo Park, CA, USA
190K-255K Annually
Senior level
190K-255K Annually
Senior level
Artificial Intelligence • Big Data • Healthtech • Machine Learning • Software • Biotech
The role focuses on building and supporting machine learning infrastructure for cancer detection research, empowering teams by enhancing their computational capabilities and ensuring software quality and system efficiency.
Top Skills: AWSBazelBeamC#C++DockerFlinkGoJavaJupyterNumpyPythonPyTorchR NotebookRaySparkTensorFlow
14 Days Ago
Remote or Hybrid
Sunnyvale, CA, USA
189K-291K Annually
Senior level
189K-291K Annually
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
As a Staff ML Infra Engineer, you will develop and deploy offboard machine learning solutions for autonomous vehicles, ensuring model integration and performance across teams. You'll build ML infrastructure, implement CI/CD pipelines, support data curation, and mentor engineers.
Top Skills: Ci/CdDockerKubernetesNumpyPythonPyTorch
17 Days Ago
Hybrid
2 Locations
155K-206K Annually
Senior level
155K-206K Annually
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
As a Senior ML Infrastructure Engineer, you'll design and build scalable platforms for ML inference workflows, collaborating with teams to optimize model serving and enhance system reliability.
Top Skills: C++GpusPythonRayserveTritonVllm

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account