xAI Logo

xAI

Site Reliability Engineer - Kubernetes Platform

Sorry, this job was removed at 12:13 a.m. (PST) on Thursday, Feb 26, 2026
Easy Apply
In-Office
Palo Alto, CA
Easy Apply
In-Office
Palo Alto, CA

Similar Jobs

46 Minutes Ago
In-Office or Remote
San Francisco, CA, USA
Senior level
Senior level
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
Lead the product-led growth strategy for Deepgram's self-serve business, focusing on revenue growth, user experience, and cross-functional team alignment across marketing, product, sales, and engineering.
Top Skills: APIsBusiness Intelligence ToolsData WarehousingProduct Analytics PlatformsSQL
48 Minutes Ago
In-Office
Costa Mesa, CA, USA
220K-292K Annually
Expert/Leader
220K-292K Annually
Expert/Leader
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Responsible for structural loads, dynamics, and aeroelasticity analysis of the Omen air vehicle, ensuring structural integrity and dynamic stability throughout development.
Top Skills: FemapNastran
50 Minutes Ago
In-Office
2 Locations
34K-41K Hourly
Internship
34K-41K Hourly
Internship
Artificial Intelligence • Hardware • Information Technology • Machine Learning
Build and enhance GenAI-powered workflows and agents to accelerate ASIC development. Develop reusable agent skills, write Python/Perl scripts and utilities, collaborate with cross-functional teams to improve design efficiency and quality.
Top Skills: Agent FrameworksBitbucketClaude CodeCoding AssistantsConfluenceGenaiGitJIRAModel Context Protocol (Mcp)PerlPythonRetrieval-Augmented Generation (Rag)Unix/Linux
About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.


About the Role

We are seeking a highly skilled Site Reliability Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI’s infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment.

Responsibilities
  • Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently.
  • Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads.
  • Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs.
  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems.
  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible.
  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs.
  • Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components.
  • This is an in-person role based in Palo Alto, CA, with up to 25% travel required.
Required Qualifications
  • 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems.
  • Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm.
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs/SLOs.
Preferred Qualifications
  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments.
  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience.
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges.
  • Passion for problem-solving and a proactive drive to deliver impactful results.
  • A sense of adventure and humor to navigate challenges with a positive mindset.
Annual Salary Range

$180,000 - $440,000 USD

Benefits

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.

HQ

xAI San Francisco, California, USA Office

3180 18th St., San Francisco, CA, United States

xAI Palo Alto, California, USA Office

1450 Page Mill Road, Palo Alto, CA, United States

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account