Backblaze Jobs

Sr. Site Reliability Engineer

Backblaze

Sr. Site Reliability Engineer

Reposted 16 Days Ago

Remote

Hiring Remotely in United States

150K-200K Annually

Senior level

Remote

Hiring Remotely in United States

150K-200K Annually

Senior level

As a Sr. Site Reliability Engineer, you'll ensure service reliability, build automation, and collaborate on infrastructure improvements while mentoring others.

The summary above was generated by AI

About Backblaze

Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands.

Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals.
But while there is a lot to celebrate in our past, there is almost as much opportunity ahead of us. We’re seeking a Sr. Site Reliability Engineer to join our team!

About the Role:

We are seeking a Senior Site Reliability Engineer (SRE) to help ensure the stability, scalability, and reliability of our services and infrastructure. This role focuses on building automation, maintaining observability, and supporting incident response to keep customer-facing systems performing at their best.
The SRE will collaborate with engineering, product, and operations teams to embed reliability practices into day-to-day development and operations while contributing to tools and processes that improve efficiency and reduce manual effort.

What You'll Do:

Service Reliability & Operations

Own and drive the availability, durability, and performance of critical services across all production environments.
Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership.
Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services.
Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes.
Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management).

Automation & Tooling

Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform.
Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability.
Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins).
Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems.

Collaboration

Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation.
Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features.
Lead capacity planning and disaster recovery strategy across critical infrastructure components.
Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance.
Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams.

Continuous Improvement

Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation.
Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans.
Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.

Qualifications:

Education & Experience

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
8+ years of progressive experience in site reliability, systems engineering, or operations.
Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.

Technical Skills

Expert-level Linux systems administration and advanced troubleshooting skills.
Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
Expert knowledge of incident response methodologies and operational best practices.
Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.

Preferred Attributes

Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.

Backblaze Perks:

Healthcare for family, including dental and vision
Competitive compensation and 401K
RSU grants for full-time employees
ESPP program
Flexible vacation policy
Maternity & paternity leave
MacBook Pro to use for work, plus a generous stipend to personalize your workstation
Childcare bonus (human children only)
Fertility treatment and support
Learning & development program
Commuter benefits
Culture that supports a healthy work-life balance

To provide greater transparency to candidates, we share base pay ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar-stage growth companies. Final offer amounts are determined by multiple factors, including candidate location, skills, depth of work experience, and relevant licenses/credentials, and may vary from the amounts listed below.

The expected salary range for this role is $150,000 - $200,000.

At Backblaze, we value being fair and good to our customers, partners, and employees. That’s why diversity, equity, and inclusion are at the core of our values. We are committed to fostering a workforce where all employees feel a sense of belonging regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education. We believe that our dedication to cultivating a diverse workspace not only allows us to better serve our customers in over 175 countries but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer.

To understand more about the data we collect and process as part of your application, please view our Backblaze Employee Privacy Notice.

500 Ben Franklin Ct., San Mateo, CA, United States, 94401

Similar Jobs

Circle (circle.so)

Senior Site Reliability Engineer

14 Days Ago

Easy Apply

Remote

United States

Easy Apply

130K-140K Annually

Senior level

130K-140K Annually

Senior level

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software

Lead SRE work to keep Circle highly available and performant: respond to incidents, own monitoring/alerting/log management, manage and optimize MySQL/Postgres/ClickHouse/Redis databases, maintain server infrastructure and deployment pipelines, collaborate with engineering teams, and build internal SRE tooling and automation.

Top Skills: AWSClickhouseKubernetesLlm-Based Tools (Copilots)MySQLPostgresRedis

Coinbase

Senior Site Reliability Engineer

15 Days Ago

Easy Apply

Remote

USA

Easy Apply

186K-219K Annually

Senior level

186K-219K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

Own reliability, automation, and DevOps for Coinbase's corporate IAM platform: on-call/incident response, CI/CD and IaC pipelines, identity lifecycle tooling, observability and disaster recovery, documentation, and cross-team IAM advisement to ensure secure, scalable access for a global workforce.

Top Skills: AbacAuth0AWSAzureC#Ci/CdContainer OrchestrationDuoEntraidGCPGenerative AiGitGoIacJavaMfaOktaPingPythonRbacRubySsoTerraform

Coinbase

Senior Site Reliability Engineer

15 Days Ago

Easy Apply

Remote

USA

Easy Apply

186K-219K Annually

Senior level

186K-219K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

Senior SRE on the IT Operations team owning reliability, monitoring, and incident response for AI infrastructure. Build automation, CI/CD and Kubernetes tooling, improve observability and documentation, and develop internal full-stack tools using Go or Python. Partner with Infrastructure, Security, and Compliance to scale secure, resilient AI deployment pipelines.

Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxPuppetPythonRubySaltTerraform

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Backblaze

Sr. Site Reliability Engineer

Backblaze San Mateo, California, USA Office

Similar Jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech