GenBio AI

High Performance Computing (HPC) Engineer

Reposted Yesterday

Be an Early Applicant

In-Office

Palo Alto, CA

170K-260K Annually

Mid level

In-Office

Palo Alto, CA

170K-260K Annually

Mid level

The HPC Engineer will manage GPU clusters, implement distributed training techniques, optimize performance, and collaborate with data scientists on AI model development.

The summary above was generated by AI

Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology and medicine through the power of Generative AI. Our team comprises leading minds and innovators in AI and Biological Science, pushing the boundaries of what is possible. We are dreamers who reimagine a new paradigm for biology and medicine.

We are committed to decoding biology holistically and enabling the next generation of life-transforming solutions. As the first mover in pan-modal Large Biological Models (LBM), we are pioneering a new era of biomedicine, with our LBM training leading to ground-breaking advancements and a transformative approach to healthcare. Our exceptionally strong R&D team and leadership in LLM and generative AI position us at the forefront of this revolutionary field. With headquarters in Silicon Valley, California, and a branch office in Paris, we are poised to make a global impact. Join us as we embark on this journey to redefine the future of biology and medicine through the transformative power of Generative AI.

Job Description

GPU Cluster Management: Design, deploy, and maintain high-performance GPU clusters, ensuring their stability, reliability, and scalability. Monitor and manage cluster resources to maximize utilization and efficiency.
Distributed/Parallel Training: Implement distributed computing techniques to enable parallel training of large deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization to achieve faster convergence and reduced training times.
Performance Optimization: Fine-tune GPU clusters and deep learning frameworks to achieve optimal performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis.
Deep Learning Framework Integration: Collaborate with data scientists and machine learning engineers to integrate distributed training capabilities into GenBio AI’s model development and deployment frameworks.
Scalability and Resource Management: Ensure that the GPU clusters can scale effectively to handle increasing computational demands. Develop resource management strategies to prioritize and allocate computing resources based on project requirements.
Troubleshooting and Support: Troubleshoot and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and resolve technical challenges efficiently.
Documentation: Create and maintain documentation related to GPU cluster configuration, distributed training workflows, and best practices to ensure knowledge sharing and seamless onboarding of new team members.

Job Requirements:

Master’s or Ph.D. degree in computer science, or a related field with a focus on High-Performance Computing, Distributed Systems, or Deep Learning.
2+ years proven experience in managing GPU clusters, including installation, configuration, and optimization.
Strong expertise in distributed deep learning and parallel training techniques.
Proficiency in popular deep learning frameworks like PyTorch, Megatron-LM, DeepSpeed, etc.
Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN).
Knowledge of performance profiling and optimization tools for HPC and deep learning.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes)
Strong background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes)

Join us as we embark on this journey to redefine the future of biology and medicine.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. GenBio AI participates in the U.S. Department of Homeland Security’s E-Verify program to confirm the employment eligibility of all newly hired employees. For more information on E-Verify, please visit www.e-verify.gov.

Top Skills

AWS

Cuda

Cudnn

Deep Learning

Deepspeed

Distributed Computing

Docker

GCP

Gpu Clusters

Kubernetes

Megatron-Lm

Python

PyTorch

Slurm

Palo Alto, CA, United States, 94301

Similar Jobs

SpaceX

Systems Engineer

15 Days Ago

Easy Apply

In-Office

Hawthorne, CA, USA

Easy Apply

120K-170K Annually

Mid level

120K-170K Annually

Mid level

Aerospace • Other

The HPC Systems Engineer will manage HPC clusters, provide application support, install Linux systems, and document technical concepts for SpaceX personnel.

Top Skills: AnsibleBashCfdCudaDockerFeaGpuGrafanaLinuxLsfNagiosPbsPodmanPrometheusPuppetPythonPyTorchSingularitySlurmTensorFlow

Ericsson

Architect

An Hour Ago

In-Office

Santa Clara, CA, USA

153K-207K Annually

Senior level

153K-207K Annually

Senior level

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)

Lead the design of scalable data platforms, oversee data engineering efforts, and implement AI-driven automation while collaborating with cross-functional teams.

Top Skills: AirflowDbtInformaticaJavaPythonScalaSnowflake

Ericsson

Site Reliability Engineer

An Hour Ago

In-Office

Santa Clara, CA, USA

138K-173K Annually

Junior

138K-173K Annually

Junior

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)

The EWS SRE supports cloud-native platforms built on Kubernetes, focusing on automation, monitoring, and operational excellence, while fostering learning and contribution to resilient systems.

Top Skills: AIAnsibleBashDockerGitGoGrafanaKubernetesLinuxPrometheusPythonTerraform

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

GenBio AI

High Performance Computing (HPC) Engineer

Top Skills

GenBio AI Palo Alto, California, USA Office

Similar Jobs

Systems Engineer

Architect

Site Reliability Engineer

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech