Periodic Labs Jobs

ML Systems Engineer

Periodic Labs

ML Systems Engineer

Reposted 2 Days Ago

Be an Early Applicant

In-Office

Menlo Park, CA, USA

300K-400K Annually

Expert/Leader

In-Office

Menlo Park, CA, USA

300K-400K Annually

Expert/Leader

The ML Systems Engineer will design and manage efficient training and inference systems, optimize hardware utilization, and collaborate with researchers on RL loop integration, enhancing scientific discovery.

The summary above was generated by AI

About Periodic Labs

We're an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what's scientifically possible.

About the Role

You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled to the RL feedback loop that drives scientific discovery.

This is not a pure infrastructure role and it is not a pure research role — it sits exactly at their intersection. You will go deep into the stack: scheduling, kernels, RDMA, weight synchronization, and communication primitives, while working shoulder-to-shoulder with researchers to co-design the algorithms and infrastructure together.

The RL loop is central to how Periodic Labs works. Models propose experiments, experiments generate data, data feeds back into training. The speed and reliability of that loop is a direct multiplier on the pace of scientific discovery. You will own the infrastructure that makes it fast.

What You'll Do

Build rack and topology-aware scheduling for GB series GPUs across Ray, Slurm, and Kubernetes, minimizing latency and maximizing utilization across heterogeneous cluster configurations

Build online and offline profilers that surface bottlenecks across the training and inference stack and translate findings into actionable optimizations
Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks in large-scale training runs
Run methodical benchmarking to identify optimal RL training configurations across model sizes, batch strategies, and hardware topologies
Write and optimize communication and GPU kernels to extract maximum throughput from the hardware

Design and implement zero-copy RDMA weight synchronization between training and inference to keep the RL loop tight and low-latency
Build fast sandbox execution environments that allow rapid rollout of model-generated actions and return of rewards without blocking the training pipeline

Engage directly with the SGLang, Megatron, and Ray communities — contributing upstream, influencing roadmaps, and pulling in improvements that benefit Periodic Labs’ workloads

Work in close collaboration with RL and pretraining researchers to co-design algorithms and infrastructure together — you will shape what is possible at the research level by knowing what is achievable at the systems level, and vice versa

The net result: high-throughput, fault-tolerant training and inference systems tightly coupled with a low-latency RL feedback loop that accelerates scientific discovery at every turn.

You Might Thrive in This Role if You Have Experience With

Large-scale inference infrastructure: load balancing, traffic shifting, scheduling, and serving architecture at production scale
Low-level systems programming: RDMA, NVLink, kernel-level work, and network stack optimization
GPU cluster scheduling and orchestration across Ray, Slurm, or Kubernetes, with awareness of rack topology and hardware locality
Writing and optimizing CUDA kernels, communication primitives, or distributed training collective operations
Profiling and benchmarking distributed ML systems to identify and eliminate bottlenecks across compute, memory, and network
Checkpoint management and streaming at scale, including direct cloud storage integration
Building or contributing to open source ML infrastructure projects (e.g., SGLang, Megatron-LM, vLLM, Ray)
Working directly with ML researchers on algorithm-infrastructure co-design — you understand the research well enough to make systems decisions that serve it

Mechanics

Minimum education: Bachelor’s degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual compensation range for this role - $300,00-$400,000

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.

Similar Jobs

ServiceNow

Senior Machine Learning Engineer

Yesterday

Hybrid

Mountain View, CA, USA

Senior level

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation

Build, optimize, and scale end-to-end ML infrastructure for training, evaluating, and serving large language models. Implement distributed training and low-latency inference pipelines, abstractions to automate ML workflows, collaborate cross-functionally, and drive best practices for ML and data engineering to support production LLMs at scale.

Top Skills: C++GoHuggingfacePythonPyTorchTensorrt-LlmVllm

Block

Machine Learning Engineer

Yesterday

In-Office or Remote

277K-415K Annually

Expert/Leader

277K-415K Annually

Expert/Leader

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency

Design, build, and operate production ML systems that generate trusted signals for ranking, retrieval, recommendations, propensity/churn/LTV, and next-best-action decisioning. Define signal/data contracts, own feature and candidate generation through serving, experimentation, monitoring, and feedback loops, and evaluate long-term business impact, trust, fairness, and compliance. Partner across product, data, modeling, risk, and compliance and apply AI/agents to accelerate engineering and operations.

Top Skills: Agent-Assisted Operations ToolingBatch PipelinesCloud InfrastructureCoding AgentsData WarehousesEmbeddingsEvaluation HarnessesEvent StreamsExperimentation SystemsFeature StoresJavaKotlinKubernetesLakehousesLightgbmModel-Serving InfrastructureObservability ToolingPythonPyTorchRanking/Retrieval SystemsRecommendation FrameworksSemantic SearchSQLTensorFlowWorkflow OrchestrationXgboost

Cash App

Machine Learning Engineer

16 Days Ago

Remote or Hybrid

277K-415K Annually

Expert/Leader

277K-415K Annually

Expert/Leader

Blockchain • Fintech • Mobile • Payments • Software • Financial Services

Design, build, and operate production ML signal systems—ranking, retrieval, recommendations, propensity, and next-best-action—covering feature/candidate generation, serving, experimentation, monitoring, and feedback. Define signal contracts (freshness, provenance, confidence), evaluate long-term impact (trust, fairness, compliance), and partner across product, data, and risk teams to deliver reusable customer-intelligence capabilities.

Top Skills: Batch PipelinesCloud InfrastructureCoding AgentsData WarehousesEmbeddingsEvent StreamsExperimentation SystemsFeature StoresJavaKotlinKubernetesLakehousesLightgbmModel-Serving InfrastructureObservability ToolingPythonPyTorchRanking/Retrieval SystemsRecommendation FrameworksSemantic SearchSQLTensorFlowWorkflow OrchestrationXgboost

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine