Phizenix

ML Infrastructure Engineer

Reposted 14 Days Ago

Easy Apply

In-Office

Menlo Park, CA

180K-200K Annually

Senior level

Easy Apply

In-Office

Menlo Park, CA

180K-200K Annually

Senior level

The ML Infrastructure Engineer will design distributed systems for ML training, optimize inference, build automation pipelines, and monitor production performance.

The summary above was generated by AI

ML Infrastructure Engineer
Menlo Park, CA | On-Site | Full-Time/Direct Hire

Looking for ML Infra experts (Bay Area preferred) with deep experience in CUDA, GPU optimization, VLLMs, and LLM inference—pure language focus, no vision/audio.

Client Opportunity | Through Phizenix

Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an AI startup pioneering diffusion-based large language models—built for faster generation, multimodal integration, and scalable enterprise deployment.

We’re looking for a ML Infrastructure Engineer to help build the infrastructure that powers large-scale model training and real-time inference. You’ll collaborate with world-class researchers and engineers to design high-performance, distributed systems that bring advanced LLMs into production.

Responsibilities

Design and manage distributed infrastructure for ML training at scale
Optimize model serving systems for low-latency inference
Build automated pipelines for data processing, model training, and deployment
Implement observability tools to monitor performance in production
Maximize resource utilization across GPU clusters and cloud environments
Translate research requirements into robust, scalable system designs

Must-Haves

Masters or PhD in Computer Science, Engineering, or a related field (or equivalent experience)
Strong foundation in software engineering, systems design, and distributed systems
Experience with cloud platforms (AWS, GCP, or Azure)
Proficient in Python and at least one systems-level language (C++/Rust/Go)
Hands-on experience with Docker, Kubernetes, and CI/CD workflows
Familiarity with ML frameworks like PyTorch or TensorFlow from a systems perspective
Understanding of GPU programming and high-performance infrastructure

Nice-to-Haves

Experience with large-scale ML training clusters and GPU orchestration
Knowledge of LLM-serving tools (vLLM, TensorRT, ONNX Runtime)
Experience with distributed training strategies (e.g., data/model/pipeline parallelism)
Familiarity with orchestration tools like Kubeflow or Airflow
Background in performance tuning, system profiling, and MLOps best practices

At Phizenix, we’re committed to supporting diverse and inclusive teams. This is your chance to shape the systems that power the next generation of AI innovation. Let’s build the future—together.

California Pay Range

$180,000—$200,000 USD

Top Skills

Airflow

AWS

Azure

C++

Ci/Cd

Cuda

Docker

GCP

Gpu Optimization

Kubeflow

Kubernetes

Llm Inference

Onnx Runtime

Python

PyTorch

Rust

TensorFlow

Tensorrt

Vllm

Vllms

101 E. Vineyard Ave, Suite #119–115, Livermore, CA , United States, 94550

Similar Jobs

Nuro

Software Engineer

12 Hours Ago

In-Office

Mountain View, CA, USA

160K-241K Annually

Mid level

160K-241K Annually

Mid level

Artificial Intelligence • Automotive • Information Technology • Robotics

The role involves optimizing machine learning models, developing infrastructure for model life cycles, and collaborating across teams to enhance Nuro's autonomy technology.

Top Skills: C++CudaJaxKerasPythonPyTorchTensorFlowTriton

Boson AI

Site Reliability Engineer

2 Days Ago

In-Office

Santa Clara, CA, USA

150K-250K Annually

Senior level

150K-250K Annually

Senior level

Artificial Intelligence • Machine Learning

As a Senior Site Reliability Engineer, you will manage HPC cluster operations, deploy infrastructure-as-code solutions, support research teams, and develop automation tools.

Top Skills: AnsibleAWSAzureBashCephGCPGitopsGpudirectInfinibandKubernetesLinuxPythonRdmaTerraform

Dyna Robotics

Infrastructure Engineer

14 Days Ago

In-Office

Redwood City, CA, USA

180K-270K Annually

Senior level

180K-270K Annually

Senior level

Robotics

The role involves designing and maintaining large-scale ML infrastructure, optimizing distributed training systems, and enhancing computing performance for model training.

Top Skills: AccelerateAWSDistributed SystemsGCPHigh-Performance ComputingKubernetesPyTorchTensorrtTriton

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Phizenix

ML Infrastructure Engineer

Top Skills

Phizenix Livermore, California, USA Office

Similar Jobs

Software Engineer

Site Reliability Engineer

Infrastructure Engineer

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech