Blaxel Jobs

Site Reliability Engineer

Blaxel

Site Reliability Engineer

Reposted Yesterday

In-Office

San Francisco, CA, USA

175K-250K Annually

Mid level

In-Office

San Francisco, CA, USA

175K-250K Annually

Mid level

The Site Reliability Engineer will ensure the reliability and performance of AI infrastructure, build core systems, handle incident response, and develop automation tools.

The summary above was generated by AI

The role

We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform.

You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-low-latency, stateful, serverless compute engine rock-solid as we serve billions of agent requests for the most sophisticated AI teams in the world.

This role is highly technical and execution-heavy. You’ll own our reliability posture end-to-end—observability, performance tuning, incident ops, infrastructure health, and the automation systems that keep everything running smoothly. We want you to design new reliability systems, push the boundaries of automation, and continuously evolve the platform to meet the demands of next-generation AI workloads. If you're a builder who thrives on owning critical infrastructure at scale, this role is for you.

What you'll do

Collaborating closely with the founders, the infra team, and the dev team—and leveraging AI wherever it creates leverage—you will architect and operate the systems that keep Blaxel fast, resilient, and secure.

Architect, operate, and continuously improve the core infrastructure powering our 25ms cold-start compute engine.
Build and evolve our observability stack (metrics, traces, logs), ensuring we detect issues before users do.
Define, monitor, and drive SLOs/SLIs across key system surfaces to maintain world-class reliability.
Lead incident response with rigor: root cause analysis, post-mortems, and driving systemic fixes.
Design and implement self-healing, automated operational systems to eliminate toil and scale ops.
Work across compute, networking, storage, and sandboxed execution layers to tune performance under extreme workloads.
Build automation and tooling—often with AI agents—to streamline operations, debugging, capacity planning, and failure prediction.
Stress-test and push our systems to the edge: load testing, chaos engineering, and performance benchmarking.
Own security best practices at the infrastructure layer, from sandboxed compute to network isolation.
Partner with platform engineers to ensure reliability is designed into new features from day one.

Who you are

Deeply technical by default: Fluent across systems, cloud, networking, and distributed computing. You love debugging real failures, not theoretical ones.
AI-fluent operator: You understand how AI systems behave under scale, their unique resource patterns, and the infrastructure challenges of agentic frameworks.
Builder at heart: You want to invent new reliability systems—not just maintain existing ones. You thrive in a zero-to-one infra environment.
High-velocity execution: You have a strong bias for action and a track record of shipping reliable systems quickly with excellent judgment.
Automation-first mindset: You hate repeated manual work and instinctively reach for automation or AI-driven ops to scale yourself.
Calm under pressure: When incidents hit, you operate with clarity, precision, and ownership.
Data-driven engineer: You measure everything—latency, tail behavior, resource efficiency, reliability trends—and let data guide your decisions.

Required skills

3+ years in SRE, DevOps, or infrastructure engineering roles
Strong proficiency in at least one programming language such as Go, Rust, or Python
Hands-on experience with a major cloud provider (AWS, GCP)
Solid knowledge of Linux systems, networking fundamentals, and distributed systems
Experience with bare-metal servers and datacenter operations (PXE/iPXE provisioning, IPMI/BMC, RAID/NVMe, SR-IOV, high-throughput networking)
Experience with Kubernetes or similar orchestrators
Familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog)
Experience building and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Strong debugging, problem-solving, and incident-management skills

Preferred

Experience with infrastructure-as-code tools such as Terraform or Pulumi
Knowledge of service mesh or API gateway technologies
Exposure to chaos engineering or resiliency-testing frameworks
Background in security best practices for cloud environments
Prior experience in high-growth or high-availability environments

Bonus

Experience with any of the following is a plus (not required):

Serverless compute systems
Sandboxed execution environments
Ultra-low-latency runtime engineering
Distributed key-value stores and databases
Chaos engineering
Rust, Go, or systems-level programming
Deep generative AI infrastructure

About Blaxel

Blaxel is AWS for AI agents. We’re a new kind of cloud computing infrastructure optimized for the unique demands of agentic AI, leveraging a purpose-built 25ms cold-start serverless compute engine.

Now processing billions of agent requests, we power the coding agents and background AI tasks infrastructure for top AI startups. Founders choose us when they hit the limits of general-purpose clouds. We solve the hard infrastructure problems—statefulness, ultra-low latency, and secure sandboxed code execution—so they can focus on building their core AI products.

We raised a $7.3M seed round led by First Round Capital.

Similar Jobs

Domino Data Lab

Site Reliability Engineer

5 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

200K-230K Annually

Senior level

200K-230K Annually

Senior level

Artificial Intelligence • Machine Learning

Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.

Top Skills: Cloud PlatformsGoKubernetesLinuxLlm/Ai ToolingLogs And TracingObservability ToolingPythonSlo/Sli Frameworks

Uniphore

Site Reliability Engineer

9 Days Ago

In-Office

Palo Alto, CA, USA

233K-336K Annually

Expert/Leader

233K-336K Annually

Expert/Leader

Artificial Intelligence • Machine Learning

Lead platform reliability and automation at scale by building production Go services, Kubernetes operators, multi-cloud infrastructure, and self-service tooling. Provide technical leadership through architecture, code, on-call escalation ownership, incident remediation, and mentorship to elevate engineering teams' operational maturity.

Top Skills: AWSAzureController-RuntimeGCPGoKubernetesKubernetes OperatorTerraform

CrowdStrike

Site Reliability Engineer

12 Days Ago

Hybrid

Sunnyvale, CA, USA

140K-215K Annually

Expert/Leader

140K-215K Annually

Expert/Leader

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

Lead and manage an SRE/Platform engineering team to ensure reliability, scalability, and performance of CrowdStrike's cloud-native security platform. Provide technical leadership, incident command, SLO-driven reliability, capacity planning, automation, and mentorship while collaborating with cross-functional teams.

Top Skills: Apache FlinkApache KafkaAWSAzureElkGCPGoGrafanaIstioJaegerKubernetesLinkerdOpentelemetryPrometheusSplunk

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine