Andromeda (andromeda.ai)

Site Reliability Engineer - AI Infrastructure

Posted 8 Days Ago

In-Office or Remote

Hiring Remotely in San Francisco, CA, USA

Senior level

In-Office or Remote

Hiring Remotely in San Francisco, CA, USA

Senior level

The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.

The summary above was generated by AI

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

What You’ll Do

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
Build automation and tooling to streamline cluster deployments and integrations.
Debug customer issues across networking, storage, scheduling, and system layers.
Improve reliability and scalability of both training and inference infrastructure.
Design and implement monitoring, alerting, and observability for critical systems.
Collaborate with engineering and product teams to plan and deliver infrastructure for new services.
Participate in on-call and incident response, leading postmortems and reliability improvements.
What We’re Looking For

5+ years experience in SRE, DevOps, or infrastructure engineering roles.
Strong Linux systems and networking fundamentals.
Deep experience with Kuber

Kubernetes and container orchestration at scale.

Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).
Strong automation and scripting skills (Python, Go, or Bash).
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).
Track record of operating production systems and leading incident response.

Nice to Have

Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).
Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).
Customer-facing support or consulting experience.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Top Skills

Ansible

Bash

Datadog

Grafana

Helm

Kubernetes

Loki

Prometheus

Python

Terraform

228 Grant Ave, San Francisco, California, United States, 94108 4612

Similar Jobs

Andromeda (andromeda.ai)

Senior Site Reliability Engineer

8 Days Ago

In-Office or Remote

San Francisco, CA, USA

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

Design and operate large-scale GPU infrastructure for distributed AI training, ensuring reliability, performance, and efficient customer partnerships.

Top Skills: AnsibleCudaDeepspeedFsdpGpuHelmInfinibandKubernetesLinuxMegatronNcclNvidia A100Nvidia B200Nvidia H100NvlinkPyTorchRoceTerraform

Deepgram

Site Reliability Engineer

13 Days Ago

Remote

USA

150K-220K Annually

Senior level

150K-220K Annually

Senior level

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI

The engineer will build and operate AI/ML infrastructure, managing services on AWS and bare metal, using tools like Kubernetes and Terraform.

Top Skills: AWSBashGoKubernetesPythonSlurmTerraform

Atticus

Chief Of Staff

An Hour Ago

Remote

USA

145K-180K Annually

Mid level

145K-180K Annually

Mid level

Insurance • Legal Tech • Social Impact

As Chief of Staff to the CMO, you will lead strategic marketing projects, conduct research, evaluate growth opportunities, and work cross-functionally to design and test marketing strategies.

Top Skills: Google AnalyticsLookerSQL

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Andromeda (andromeda.ai)

Site Reliability Engineer - AI Infrastructure

Top Skills

Andromeda (andromeda.ai) San Francisco, California, USA Office

Similar Jobs

Senior Site Reliability Engineer

Site Reliability Engineer

Chief Of Staff

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech