Sciforium Logo

Sciforium

Senior HPC & GPU Infrastructure Engineer

Posted 5 Days Ago
Be an Early Applicant
In-Office
San Francisco, CA
190K-250K Annually
Senior level
In-Office
San Francisco, CA
190K-250K Annually
Senior level
The Senior HPC & GPU Infrastructure Engineer maintains GPU compute clusters, leads system reliability, manages Linux environments, and optimizes ML infrastructure.
The summary above was generated by AI

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

About the role

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTOrchcustodian of our high-density accelerator environment and the linchpin between hardware operations, distributed systems, and machine learning workflows. This role spans everything from hands-on Linux systems engineering and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing every bit of performance out of hardware, enjoy debugging GPUs at scale, and want to build world-class AI infrastructure, this role is for you.

What you'll do

1. System Health & Reliability (SRE)

  • On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly.

  • Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load.

  • Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

  • OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets.

  • Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure.

  • Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

  • Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration.

  • Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm).

  • Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems.

  • Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes).

Ideal candidate profile
  • 5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles.

  • Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.

  • Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging.

  • Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning.

  • Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD).

  • Proficiency in Bash and Python for scripting, automation, and workflow tooling.

  • Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior.

  • Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking.

Nice-to-have
  • Experience with job schedulers such as Slurm, Kubernetes, or Run:AI.

  • Exposure to vLLM, model serving optimizations, or inference systems.

  • Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform).

  • Previous experience supporting ML research teams in a startup or research-heavy environment.

Benefits include
  • Medical, dental, and vision insurance

  • 401k plan

  • Daily lunch, snacks, and beverages

  • Flexible time off

  • Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Top Skills

Ansible
Bash
Centos
Cuda
Gpfs
Jax
Kubernetes
Lustre
Nfs
Python
PyTorch
Rhel
Rocm
Ubuntu
HQ

Sciforium San Francisco, California, USA Office

San Francisco, CA, United States

Sciforium Los Altos, California, USA Office

4401 El Camino Real, Los Altos, California, United States, 94022

Similar Jobs

Yesterday
In-Office or Remote
Santa Clara, CA, USA
148K-288K Annually
Senior level
148K-288K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Contribute to NVIDIA’s AI Infrastructure by automating datacenter operations and implementing monitoring solutions for large-scale Machine Learning systems.
Top Skills: GoKubernetesPythonSlurm
An Hour Ago
In-Office
Costa Mesa, CA, USA
191K-253K Annually
Senior level
191K-253K Annually
Senior level
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Design and implement advanced GNC algorithms for autonomous systems, collaborate with engineers, and validate performance through simulation and testing.
Top Skills: MatlabSimulink
An Hour Ago
In-Office
Costa Mesa, CA, USA
166K-220K Annually
Mid level
166K-220K Annually
Mid level
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Design and implement advanced GNC algorithms for autonomous systems, collaborate on integration, and ensure system performance through testing and validation.
Top Skills: MatlabSimulink

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account