Maximum of 25 job preferences reached.
Top Reliability Engineer Jobs in San Francisco, CA
Artificial Intelligence • Fintech • Information Technology • Logistics • Payments • Business Intelligence • Generative AI
Lead design, automation, and maintenance of cloud-based database infrastructure (primarily SQL Server and MySQL). Improve reliability with monitoring, HA/DR, automation, troubleshooting, on-call support, and mentoring of junior engineers while collaborating across teams.
Top Skills:
AuroraAWSBashFailover ClusteringMySQLNew RelicOrchestratorPmmPythonRdsRubySQL ServerVividcortex
Healthtech • Social Impact • Software
Define and scale reliability practices across the company by creating SLO/SLA frameworks, improving observability, evolving incident response, building self-service tooling and scorecards, and driving cross-team adoption to enable teams to build and operate reliable production systems at scale.
Top Skills:
AWSDatadogEksKubernetesPostgresTerraform
Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
The Senior Hardware Reliability Engineer ensures product reliability through planning, testing, and collaboration across engineering and operations. Responsibilities include leading investigations, analyzing failure data, and designing reliability strategies throughout the product lifecycle.
Top Skills:
Environmental TestingFailure AnalysisFirmware EngineeringHardware ReliabilityReliability ModelingStress Testing
Security • Software • Cybersecurity • Automation
As a Senior Site Reliability Engineer, you will enhance the reliability of Drata’s product teams through automation, architecture reviews, and operational excellence using cloud-native technologies.
Top Skills:
AiopsAWSBashDatadogDockerGitGithub ActionsKubernetesLinuxMySQLPythonTerraform
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead design and delivery of scalable cloud infrastructure for the Spend product. Embed with development teams to drive reliability, performance, observability, incident response, and automation. Own SLOs, runbooks, DevOps metrics, and collaborate with central DevOps and security teams to ensure compliance and resilience. Lead infrastructure projects including new service launches, data centre migrations, and modernising data pipelines.
Top Skills:
Analytics PipelinesAWSData StreamingDevOpsGCPIncident ResponseKubernetesObservabilitySlosSre
Artificial Intelligence • Information Technology • Machine Learning • Marketing Tech • Software • Biotech • Design
The Hardware Reliability Engineer plans and executes reliability testing, develops testing methods, performs failure analysis, and collaborates with cross-functional teams to ensure product quality.
Top Skills:
Data AnalysisElectrical EngineeringEnvironmental ReliabilityMechanical EngineeringReliability Testing
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Financial Services • Generative AI
As a Site Reliability Engineer, you'll design and improve critical production systems, lead incident response, and enhance observability while embedding with product teams to ensure reliability and performance at scale.
Top Skills:
AWSC++Ci/CdGoPythonRust
Artificial Intelligence • Information Technology • Machine Learning • Natural Language Processing • Productivity • Software • Generative AI
The SRE will ensure the reliability of backend systems, scale Kubernetes-based control planes, and improve automation mechanisms while managing incident processes.
Top Skills:
AWSAzureDockerGCPJavaKubernetesLinuxTerraform
Cloud • Information Technology • Machine Learning
Own, build, and operate production reliability tooling and systems across the cloud stack. Lead projects to improve availability, scalability, automation, observability, and incident response. Ship production services in Python/Go, participate on-call, reduce toil through automation, and maintain long-lived platform frameworks.
Top Skills:
Cloud-NativeGoGpu-Accelerated InfrastructureKubernetesMetricsPythonSlos/SlisStructured LogsTracing
eCommerce • Fintech • Payments • Software
The role involves ensuring software reliability and performance, managing incidents, developing infrastructure automation, and mentoring junior engineers within a platform team.
Top Skills:
AWSCloudFormationDatadogKubernetesOpentelemetryRubyRuby On RailsTerraform
Reposted 10 Days AgoSaved
Easy Apply
Easy Apply
AdTech
As a Site Reliability Engineer, you'll maintain the infrastructure for systems, ensure efficiency, automate processes, monitor databases, and participate in architecture discussions.
Top Skills:
Amazon KinesisAws LambdaAws SnsBigQueryDockerGcp (Google Cloud Platform)GitlabGoogle Cloud FunctionsGoogle Cloud RunGoogle Pub/SubGrafanaIstioKafkaKubernetesMySQLPrometheusSpannerSQLTerraform
Reposted 10 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
The Senior Site Reliability Engineer will lead security design and implementation for cloud infrastructures, mentor teams, and automate security solutions.
Top Skills:
AnsibleAWSAzureCloud Security ToolsCloudFormationGCPGoTerraform
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Artificial Intelligence • Machine Learning • Software • Generative AI
Help design, scale, and improve platform reliability: define SLOs/SLIs, run on-call and incident response, build observability, improve resilience to external dependencies, enhance CI/CD and deploy safety, optimize cost and capacity, and influence infrastructure architecture.
Top Skills:
AmplitudeAWSCloud RunContainersEcsFargateFirebaseGCPKubernetesModalNext.JsNode.jsPythonReactRedisSentryServerlessTypescriptUpstash
Artificial Intelligence • Hardware • Robotics • Software
The role involves developing and executing test strategies for autonomous systems, collaborating with engineering teams for reliability, and analyzing data for risk assessments.
Top Skills:
PythonSQL
Reposted 11 Days AgoSaved
Cloud • Software
Responsible for maintaining FedRAMP-compliant infrastructure, collaborating with software engineers, and ensuring system availability and security. Duties include infrastructure design, automation, monitoring, and incident response.
Top Skills:
AWSGoKubernetesPuppetPythonTerraform
Reposted 2 Days AgoSaved
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
The engineer will build and operate AI/ML infrastructure, managing services on AWS and bare metal, using tools like Kubernetes and Terraform.
Top Skills:
AWSBashGoKubernetesPythonSlurmTerraform
Wearables
The Hardware Reliability Engineer will oversee reliability testing for Skip's wearable devices, manage FMEA, perform environmental stress testing, and lead root cause analyses to ensure product performance in real-world conditions.
Top Skills:
Environmental TestingFmeaHaltHassThermal ChambersVibration TablesWeibull Analysis
Healthtech • Information Technology • Software • Telehealth
Lead reliability efforts for Zocdoc's cloud-based, consumer-facing services: monitor and maintain production systems, automate tooling and infrastructure, support scaling and performance, debug production incidents, and work with product teams to improve uptime and reliability.
Top Skills:
AWSDistributed SystemsDnsDockerGCPGenaiHTTPHttpsKubernetesLoad BalancerMicroservicesNtpReverse ProxyTcp/IpTlsWeb Application Firewall
Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
The Senior Site Reliability Engineer will enhance reliability of Block's platform, improve incident response using AI tools, and coordinate incident management. Responsibilities include building reliable systems, standardizing tools, and leading high-severity incidents during on-call rotations.
Top Skills:
Amazon Web ServicesDatadogDynamoDBGrpcHTTPIstioJavaJSONKotlinKubernetesLaunchdarklyMySQLProtocol BuffersTerraformVitess
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
As a Site Reliability Engineer, you will ensure system stability and resilience, define reliability standards, and automate operational processes while collaborating cross-functionally to improve performance and reduce incidents.
Top Skills:
BashCi/CdDockerGoGrafanaKubernetesLinuxPrometheusPython
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
The role involves defining reliability strategies, leading initiatives across teams, enhancing monitoring and incident response, and mentoring engineers at Dropbox.
Top Skills:
Ai TechnologiesDebuggingDistributed SystemsIncident ResponseObservabilityReliability Risk ManagementSlasSlos
Artificial Intelligence • Machine Learning
Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.
Top Skills:
Cloud PlatformsGoKubernetesLinuxLlm/Ai ToolingLogs And TracingObservability ToolingPythonSlo/Sli Frameworks
Artificial Intelligence • Software • Energy • Renewable Energy
The Reliability Engineer will drive reliability for Solid-State Transformers, conduct test plans, analyze data, document risks, and support hardware deployment.
Top Skills:
Ansys SherlockMatlabPythonRReliasoft
Artificial Intelligence • Software • Generative AI
The Founding Platform & Reliability Engineer will design and operate reliable, scalable infrastructure for an AI storytelling platform, involving hands-on implementation and strategic decision-making.
Top Skills:
AmplitudeAWSCloud RunFirebaseGCPModalNext.JsNode.jsPythonReactRedisSentryTypescriptUpstash
Energy
The Reliability Engineer will define reliability requirements, conduct tests, analyze designs, and collaborate on hardware reliability within energy storage products.
Top Skills:
JmpMinitabPython
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Filters
Total selected ()
No Results
No Results









.png)






.png)



















