xAI Logo

xAI

Senior Engineer - Post-training Infrastructure

Reposted 15 Days Ago
Be an Early Applicant
Easy Apply
In-Office
2 Locations
Senior level
Easy Apply
In-Office
2 Locations
Senior level
Design, implement, and optimize large-scale distributed training systems for LLM focusing on Reinforcement Learning. Troubleshoot and enhance system performance.
The summary above was generated by AI
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge.

Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity.

We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important.

All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

xAI is seeking experienced software engineers to design, implement and optimize large-scale distributed training systems for LLM, particularly in the areas of Reinforcement Learning (RL) and Agent. The training system needs to be robust, fast, and reasonably flexible to support state-of-the-art research.

Focus

  • Design and implement distributed RL training systems
  • Profile, debug and optimize system performance
  • Software and algorithm co-design with researchers

Ideal Experiences

  • Built scalable training framework for AI models in HPC clusters, including but not limited to
    • Scalable orchestration framework and tools
    • Reinforcement learning framework consists of asynchronous training, inference, simulation and more components
  • Experience in configuring and troubleshooting operating systems for maximum performance.
  • Experience in building high-performance sandboxes, virtual machines and simulations.

Typical problems you will deal with

  1. We have a new algorithm that requires a certain computation / communication pattern that the current system does not support or is very inefficient to handle. How should we refactor or improve the current system to support this algorithm?
  2. We have a new pre-training model, which has very different traits compared to the previous generation. As a result, the current system does not quite work. How should we refactor or improve the current system?
  3. The amount of FLOPs / HBM bandwidth / throughput the current system can achieve is only “XXX”, which is only “YY%” of the theoretical limit. How much can we improve?
  4. The training run restarts / crashes every “XXX” hours. What are the root causes of these restarts / crashes? Is there any way to reduce the errors?

Tech Stack

  • Python / Rust / C++
  • JAX and PyTorch
  • CUDA and NCCL

Interview Process

After submitting your application, the team reviews your CV and statement of exceptional work.

If your application passes this stage, you will be invited to a 15-minute phone interview during which a member of our team will ask some basic questions.

If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews:

  1. Coding assessment in a language of your choice.
  2. Two systems hands-on: Demonstrate practical skills in live problem-solving sessions that involve both system design and coding.
  3. Meet the Team: Present your past exceptional work and your vision with xAI to a small audience.

Our goal is to finish the main process within one week. We don’t rely on recruiters for assessments. Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.

Location

The role is based in the Bay Area [San Francisco and Palo Alto]. Candidates are expected to be located near the Bay Area or open to relocation.

xAI is an equal opportunity employer and does not unlawfully discriminate based on race, color, religion, ethnicity, ancestry, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, age, disability, medical conditions, genetic information, marital status, military or veteran status, or any other applicable legally protected characteristics. 

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all applicable federal, state, and local laws, including the San Francisco Fair Chance Ordinance, Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. 

For Los Angeles County (unincorporated) Candidates:

xAI reasonably believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: 

  • Access to information technology systems and confidential information, including proprietary and trade secret information, and/or user data;
  • Interacting with internal and/or external clients and colleagues; and
  • Exercising sound judgment.

California Consumer Privacy Act (CCPA) Notice

Top Skills

C++
Cuda
Jax
Nccl
Python
PyTorch
Rust
HQ

xAI San Francisco, California, USA Office

3180 18th St., San Francisco, CA, United States

xAI Palo Alto, California, USA Office

1450 Page Mill Road, Palo Alto, CA, United States

Similar Jobs

3 Minutes Ago
In-Office
Costa Mesa, CA, USA
220K-292K Annually
Senior level
220K-292K Annually
Senior level
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
As a Research Engineer in Machine Learning at Anduril, you will optimize ML algorithms for edge devices, prototype LLM-based systems, and benchmark models while collaborating across business lines to identify new research problems.
Top Skills: Deep Learning ModelsMl AlgorithmsPythonPyTorchTransformer Architectures
4 Minutes Ago
In-Office
4 Locations
165K-242K Annually
Senior level
165K-242K Annually
Senior level
Cloud • Information Technology • Machine Learning
The IT SOX Director leads the company's IT SOX compliance program, focusing on IT General Controls and application controls, ensuring compliance and collaborating with various teams.
Top Skills: CoupaGitIt General Controls (Itgcs)NetSuiteSalesforceWorkday
6 Minutes Ago
Easy Apply
In-Office or Remote
3 Locations
Easy Apply
183K-216K Annually
Senior level
183K-216K Annually
Senior level
Consumer Web • Healthtech • Professional Services • Social Impact • Software
As a Conversational AI Designer, you will optimize LLM-powered support chatbot interactions, establish design standards, and drive the automation roadmap to enhance user satisfaction.
Top Skills: AIData AnalyticsLlmMachine LearningWorkflow Automation

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account