Periodic Labs Jobs

Supercompute Engineer

Periodic Labs

Supercompute Engineer

Reposted 2 Hours Ago

Be an Early Applicant

In-Office

Menlo Park, CA, USA

350K-450K Annually

Senior level

In-Office

Menlo Park, CA, USA

350K-450K Annually

Senior level

The HPC Engineer will design, operate, and optimize high-performance computing infrastructure, focusing on large-scale GPU and CPU clusters for AI and research workloads, ensuring performance and reliability for scientific discovery.

The summary above was generated by AI

About Periodic Labs

We’re an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what’s scientifically possible.

About the Role

As a Supercomputing Engineer at Periodic Labs, you will design, build, and operate the high-performance computing infrastructure that powers our AI and scientific research. Our models demand extreme compute at scale — large GPU and CPU clusters, high-speed interconnects, low-latency parallel storage, and workload schedulers that make every cycle count. You will work directly with researchers and infrastructure engineers to ensure our compute environment is fast, reliable, and optimized for scientific discovery at the frontier.

This is a deeply hands-on role. You will architect and tune systems, automate provisioning, diagnose performance bottlenecks, and design for resilience at scale. You’ll partner with research and ML teams to understand their workloads and shape an HPC environment that removes friction and accelerates science.

What You’ll Do

Design, deploy, and operate large-scale GPU and CPU clusters for AI training, scientific simulation, and research workloads
Manage and optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or equivalent) for maximum throughput and minimum latency
Own workload scheduling and resource management using Slurm, Kubernetes, or similar systems — tuning for throughput, fairness, and researcher productivity
Implement and maintain automated cluster provisioning, configuration management, and lifecycle tooling using Ansible, Terraform, or custom orchestration
Monitor cluster health, performance, and utilization; build dashboards and alerting to proactively identify and resolve bottlenecks
Partner with research and ML engineering teams to profile workloads, diagnose performance issues, and tune hardware and software stacks for specific computational demands
Design and implement backup, disaster recovery, and fault-tolerance strategies for research data and compute infrastructure
Evaluate and integrate new hardware (GPUs, accelerators, networking) and software technologies as the field evolves
Establish standards and runbooks for HPC operations, capacity planning, and incident response
Collaborate with security and infrastructure teams to implement access controls, network segmentation, and compliance controls appropriate for a research environment

You Will Thrive in This Role If You Have

Experience designing and operating large-scale HPC or GPU clusters in research, cloud, or enterprise environments
Deep knowledge of high-speed interconnects such as InfiniBand (HDR/NDR) or RoCE, including fabric management, tuning, and troubleshooting
Hands-on experience with parallel and distributed storage systems (Lustre, GPFS, WEKA, BeeGFS, or similar) — configuration, performance tuning, and capacity management
Experience with workload managers and schedulers such as Slurm, PBS Pro, LSF, or Kubernetes-based HPC orchestration
Linux systems administration at scale, including kernel tuning, NUMA optimization, CPU and memory affinity, and GPU driver management
Infrastructure automation using Ansible, Terraform, or equivalent — you treat infrastructure as code
Experience with GPU computing environments including CUDA, NCCL, MPI, and multi-node distributed training or simulation setups
Performance profiling, benchmarking, and tuning of computational workloads across CPU, GPU, memory, network, and storage
Experience with monitoring and observability tooling (Prometheus, Grafana, or equivalent) in large, heterogeneous compute environments
Ability to collaborate with researchers or data scientists to understand workload requirements and translate them into infrastructure decisions

Especially Strong Candidates May Also Have

Experience operating GPU clusters for large-scale AI or ML training workloads such as multi-node transformer training
Familiarity with AI accelerators beyond GPUs, such as TPUs, Trainium, or custom ASIC environments
Experience in mixed on-prem and cloud HPC environments, including burst-to-cloud or hybrid scheduling patterns
Background in scientific computing domains such as computational chemistry, physics simulation, or bioinformatics
Experience with containerized HPC environments (Singularity/Apptainer, Docker, or container-aware schedulers)
Knowledge of network security, access control, and compliance requirements for regulated research data
Contributions to open-source HPC tooling or published work on HPC system design or performance

Mechanics

Minimum education: Bachelor’s degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual base compensation range for this role is $350,000-$450,000

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.

Similar Jobs

PwC

Connected Supply Chain, Planning - Kinaxis, Manager

12 Minutes Ago

Hybrid

San Francisco, CA, USA

99K-232K Annually

Mid level

99K-232K Annually

Mid level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

Lead client engagements to optimize supply chain planning using Kinaxis and analytics. Manage projects, mentor staff, design inventory and distribution strategies, implement SCM technology, and ensure performance and compliance.

Top Skills: Data AnalyticsKinaxisSupply Chain Management Software

PwC

Strategy& Financial Services - AWM Consulting Manager

12 Minutes Ago

Hybrid

99K-232K Annually

Mid level

99K-232K Annually

Mid level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

Lead strategy engagements for asset and wealth management clients: analyze market trends, develop and implement growth and operational strategies, manage client accounts, lead and mentor teams, conduct competitive research, and drive business transformation while promoting innovation and maintaining professional standards.

PwC

Connected Supply Chain, Planning - Kinaxis, Senior Associate

12 Minutes Ago

Hybrid

San Francisco, CA, USA

77K-202K Annually

Senior level

77K-202K Annually

Senior level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

Lead supply chain planning and Kinaxis-focused optimization efforts: analyze processes, recommend technology and analytics-driven improvements, manage client relationships, lead teams, and drive cost, responsiveness, and operational excellence.

Top Skills: Data AnalyticsKinaxisSupply Chain Management Software

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine