Get the job you really want.
Maximum of 25 job preferences reached.
Top Reliability Engineer Jobs in San Francisco, CA
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
The Senior Site Reliability Engineer will manage system incidents, enhance monitoring and database infrastructure, and collaborate on scalable systems to maintain reliability as usage scales.
Top Skills:
AWSClickhouseKubernetesMySQLPostgresRedis
Cloud • Digital Media • Information Technology
Operate and improve Kubernetes-based production systems, manage cluster lifecycle and networking, build CI/CD and GitOps pipelines, define SLOs and incident response, automate resolution with AI, implement monitoring/alerting, and drive reliability through automation and chaos engineering.
Top Skills:
AnsibleArgocdBashBgpCalicoCephCiliumCni PluginsCorootDatadogDnsEbpfFalcoFluxcdGoGrafanaKubernetesLokiLonghornMetallbPrometheusPythonSIEMTerraformThanosVictoriametricsVxlanXdp
Healthtech • Insurance
The Senior Software Engineer will lead complex projects, mentor engineers, and ensure cloud infrastructure is resilient and automated. Responsibilities include developing software, managing production environments, and enforcing coding standards.
Top Skills:
ArgocdAWSGCPGithub ActionsGrafanaIstioKubernetesPrometheusTerraform
Artificial Intelligence • Software
As a Software Engineer on the Site Reliability team, you'll ensure system reliability, scalability, and observability while partnering with engineering teams and improving incident management processes.
Top Skills:
AWSCi/Cd ToolingContainer OrchestrationDatadogGrafanaPrometheusTerraform
Healthtech • Software
The Database Reliability Engineer manages and maintains cloud-based database infrastructures for SaaS applications, focusing on automation, process improvement, and collaboration with engineering teams.
Top Skills:
AnsibleAWSAzureAzure Data FactoryC#DatabricksGCPGitGrafanaInfluxdbMySQLPostgresPowershellPythonSQLSQL ServerTerraform
Aerospace • Hardware • Logistics • Robotics • Software • Transportation
The Senior Site Reliability Engineer will lead cloud infrastructure initiatives, develop best practices, write software, and manage systems while working closely with developers. They will also participate in an on-call rotation and set high technical standards for interviews.
Top Skills:
AWSKafkaKubernetes
Artificial Intelligence • Cloud • Software
The Senior Site Reliability Engineer will automate operations, improve workflows, manage secure infrastructure, and participate in on-call rotation for an AI-driven company.
Top Skills:
AristaAWSBashCephChefCifsCiscoDnsDockerElk StackFortinetHpHTTPIcmpIpIscsiJenkinsKubernetesLinux/DebianMesosphereNfsNode.jsPivotal GreenplumPostgresPythonRabbitMQRaidRubyS3ScyllaSshSslSupermicroTcpTlsUbuntu
Robotics • Pharmaceutical
The Hardware Reliability Engineer ensures the robustness of robotic systems through testing, analysis, and collaboration across teams to improve designs and reduce risks.
Top Skills:
Onshape CadPython
Artificial Intelligence • Information Technology • Machine Learning • Marketing Tech • Software • Biotech • Design
The Hardware Reliability Engineer plans and executes reliability testing, develops testing methods, performs failure analysis, and collaborates with cross-functional teams to ensure product quality.
Top Skills:
Data AnalysisElectrical EngineeringEnvironmental ReliabilityMechanical EngineeringReliability Testing
Artificial Intelligence • Machine Learning • Generative AI
As a Software Engineer in Infrastructure Reliability, you'll design and build resilient systems, optimize performance, improve automation, and collaborate with teams to enhance infrastructure reliability.
Top Skills:
AWSAzureCi/Cd PipelinesDatadogElk StackGCPGrafanaKubernetesPrometheusSplunkTerraform
Artificial Intelligence • Information Technology • Software
Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.
Top Skills:
BullmqKoaKubernetesNode.jsPostgraphilePostgresReactRedisTypescript
Artificial Intelligence • Software • Energy • Renewable Energy
Lead and own reliability engineering for solid state transformer products across the lifecycle: develop reliability guidelines, run DFMEAs and physics-of-failure tests, specify and analyze accelerated life tests, perform root-cause failure analysis, mentor a reliability team, collaborate with design/manufacturing/quality/suppliers, and monitor field performance to drive predictive health and reliability improvements.
Top Skills:
3D Finite Element ModelingAltaBayesian MethodsBlocksimCross-SectioningCsamEdxMaximum Likelihood EstimationMonte Carlo AnalysisOptical MicroscopyReliasoft Synthesis PlatformRgaSemSherlockWeibull DistributionWeibull++X-RayXfmea
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Artificial Intelligence • Machine Learning • Robotics • Software • Transportation • Design • Manufacturing
Lead reliability engineering for EV powertrain systems (EDU, high-voltage battery, power distribution). Define reliability targets, drive DFMEA, develop virtual and physical validation and PHM strategies, support field monitoring and corrective actions, embed durability in system design, and collaborate with suppliers and cross-functional teams to improve lifecycle reliability.
Top Skills:
Sql,Pyspark,Python
Healthtech • Software
As a DevOps Engineer, you'll build and maintain scalable infrastructures, manage monitoring systems, provide operational support, and collaborate across teams to enhance the company's cloud environment.
Top Skills:
AnsibleAWSAzureBashChefDockerGCPGithub ActionsJenkinsPostgresPuppetPythonTerraform
Financial Services
Design, develop, and deploy robust platform solutions while ensuring reliability, scalability, and security of the system. Collaborate with teams to enhance tooling and automation.
Top Skills:
GCPKubernetesTerraform
Healthtech • Information Technology • Software • Telehealth
The Senior Site Reliability Engineer will develop, monitor, and maintain distributed production systems, ensuring uptime for patients and providers while automating processes and supporting a large engineering team.
Top Skills:
AWSDockerGCPKubernetes
Artificial Intelligence • Big Data • Cloud • Software • Analytics • Infrastructure as a Service (IaaS) • Big Data Analytics
As an Airflow Reliability Engineer, you'll provide expertise in Apache Airflow, solve challenges for customers, and contribute to open-source projects, while enhancing your technical and customer-facing skills.
Top Skills:
Apache AirflowAWSAzureDockerGCPKubernetesPostgresPythonSQL
Information Technology • Legal Tech
The Senior Technology Site Reliability Engineer is responsible for maintaining and optimizing infrastructure and applications, ensuring reliability and performance while automating processes and collaborating with teams.
Top Skills:
AWSChefDatadogGoGrafanaJavaPrometheusPuppetPythonSaltTerraform
Artificial Intelligence • Software
Lead SRE pod and define reliability strategy. Scale ClickHouse and PostgreSQL for terabyte-level growth, optimize performance, build reliability patterns, automate operations, implement observability, and define SLOs and error budgets.
Top Skills:
AlertingClickhouseDistributed TracingFailoverGoMetricsPartitioningPostgresPythonReplicationSlosTypescript
Fintech • Payments
The Senior Staff SRE leads reliability engineering initiatives, drives operational excellence, mentors staff, and influences architecture to enhance system reliability and performance.
Top Skills:
Ai/MlAWSAzureDockerElk StackGCPGrafanaKubernetesMySQLNoSQLPostgresSplunk
Big Data • Healthtech • HR Tech • Machine Learning • Software • Telehealth • Big Data Analytics
The Staff Site Reliability Engineer will architect, operate, and improve the platform while ensuring security compliance and enhancing development processes.
Top Skills:
AWSElasticsearchIstioKubernetesNatsNode.jsPostgresPythonReactTerraformTypescript
Reposted 22 Days AgoSaved
Easy Apply
Easy Apply
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
The Senior Site Reliability Engineer will build and scale identity management tools, automate operations, ensure security, and support AWS, GCP, and Azure environments.
Top Skills:
AnsibleAWSAzureC#Cloud Identity ProvidersDockerGCPGoInfrastructure As CodeJavaKubernetesPythonRubyTerraform
HR Tech • Information Technology • Professional Services • Sales • Software
Own and operate production-grade Kubernetes infrastructure on AWS, build GitOps CI/CD with GitHub Actions and ArgoCD, develop AI agents and internal DevOps tooling, maintain Datadog-based observability, and manage on-call incident response while collaborating with engineering teams to improve reliability and delivery speed.
Top Skills:
Ai/LlmArgocdAWSCi/CdDatadogGithub ActionsGitopsGoKubernetesPython
Reposted 4 Days AgoSaved
Easy Apply
Easy Apply
Aerospace
As a Senior Reliability Test Engineer, you'll develop and implement reliability test strategies, work with engineering teams, and build a reliability test program to ensure product quality and longevity.
Top Skills:
CPythonSQL
Software
As an AI Support Engineer, you'll manage support requests, resolve user issues, optimize ML models, and contribute to product development.
Top Skills:
Tensorrt
Popular Job Searches
All Filters
Total selected ()
No Results
No Results















.png)


.png)
.png)












