NVIDIA Jobs

Senior System Software Engineer - GPU Performance

NVIDIA

Senior System Software Engineer - GPU Performance

Reposted 15 Days Ago

Be an Early Applicant

In-Office or Remote

Hiring Remotely in Santa Clara, CA, USA

148K-288K Annually

Mid level

In-Office or Remote

Hiring Remotely in Santa Clara, CA, USA

148K-288K Annually

Mid level

Conduct performance analysis on large multi-GPU clusters, evaluate interactions between libraries and hardware, and develop tools for performance visualization and data analysis.

The summary above was generated by AI

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC. We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack
Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available
Triage and root-cause performance issues reported by our customers
Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information
Collaborate with a very dynamic team across multiple time zones

What we need to see:

M.S. (or equivalent experience) or PhD in Computer Science, or related field with relevant performance engineering and HPC experience
3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
Experience conducting performance benchmarking and triage on large scale HPC clusters
Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)
Implement micro-benchmarks in C/C++, read and modify the code base when required
Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python
Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)
Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:

Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control
Experience debugging network issues in large scale deployments
Familiarity with CUDA programming and/or GPUs
Experience with Deep Learning Frameworks such PyTorch, TensorFlow

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 16, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

2701 San Tomas Expressway, Santa Clara, CA, United States, Santa Clara

Similar Jobs

PwC

Managed Services - Data Quality Engineer - Senior Associate -

2 Hours Ago

Remote or Hybrid

Richmond, CA, USA

77K-202K Annually

Senior level

77K-202K Annually

Senior level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

Maintain data integrity and quality through advanced testing and validation of ETL pipelines. Analyze complex data issues, build solutions, mentor junior staff, engage with clients, and support continuous improvement across data management, governance, and pipeline orchestration.

Top Skills: Apache AirflowAWSAws GlueAzureETLInformatica Data Quality (Idq)PrefectPythonQlikSnowflakeSQL

PwC

IT Infrastructure Managed Services - Onshore Delivery Director

2 Hours Ago

Remote or Hybrid

155K-410K Annually

Senior level

155K-410K Annually

Senior level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

The IT Infrastructure Managed Services Director leads cloud and network architecture solutions, drives business growth, and mentors teams, ensuring exceptional service delivery and client satisfaction.

Top Skills: Cloud ArchitectureInfrastructure SolutionsNetwork Architecture

PwC

Designer

2 Hours Ago

Remote or Hybrid

151K-187K Annually

Senior level

151K-187K Annually

Senior level

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI

Design and improve user experiences for human-AI systems by conducting research, usability testing, creating personas and prototypes, collaborating with cross-functional teams, analyzing trends, and building client relationships to deliver human-centered design solutions.

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

NVIDIA

Senior System Software Engineer - GPU Performance

NVIDIA Santa Clara, California, USA Office

Similar Jobs

Managed Services - Data Quality Engineer - Senior Associate -

IT Infrastructure Managed Services - Onshore Delivery Director

Designer

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech