Get the job you really want.
Maximum of 25 job preferences reached.
Top Reliability Engineer Jobs in San Francisco, CA
Software
Lead and manage engineering teams for ConductorOne's cloud infrastructure, ensuring reliability, security, and compliance while fostering team growth and culture.
Top Skills:
AICi/CdCloud InfrastructureIso 27001)KubernetesSecurity Compliance (Soc 2
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
The Site Reliability Engineer will build and maintain infrastructure, improve software systems, develop scalable microservices, and ensure quality software delivery.
Top Skills:
AWSGoGoogle Cloud PlatformJavaKubernetesAzureSQL
Fintech • Payments
The Senior Staff SRE leads reliability engineering initiatives, drives operational excellence, mentors staff, and influences architecture to enhance system reliability and performance.
Top Skills:
Ai/MlAWSAzureDockerElk StackGCPGrafanaKubernetesMySQLNoSQLPostgresSplunk
Fintech • Information Technology • Payments
Lead software engineering initiatives for Middleware Reliability Engineering by automating processes, enhancing system reliability, and promoting DevOps practices, impacting global payment systems.
Top Skills:
AnsibleAWSAzureDockerElkGCPGitGoGrafanaJavaJenkinsKubernetesPrometheusPythonTerraform
Robotics
The Site Reliability Engineer will design and operate scalable systems, own cloud infrastructure, implement observability tools, and ensure production excellence.
Top Skills:
AWSAzureDatadogGCPKubernetesPrometheusSplunkTerraform
Productivity
The Senior Site Reliability Engineer will enhance site reliability through monitoring, optimizing infrastructure, collaborating on engineering projects, and ensuring systems’ stability.
Top Skills:
AWSDockerKubernetesTemporal
Software • Generative AI
As a Site Reliability Engineer at Fireworks AI, you'll ensure system reliability, manage incidents, develop monitoring solutions, and reduce operational toil, while collaborating with software engineers to embed reliability in the development lifecycle.
Top Skills:
AWSAzureDockerElk StackGCPGoGrafanaKubernetesPrometheusPython
Artificial Intelligence • Machine Learning • Database
The role involves ensuring the reliability and performance of distributed database systems, developing monitoring strategies, and automating operations in a cloud-native environment.
Top Skills:
AnsibleArgoAWSAzureDockerGCPGitlab CiGoJavaJenkinsKubernetesPythonTerraform
Artificial Intelligence • Machine Learning • Generative AI
The Software Engineer in Reliability will ensure system scalability, reliability, and performance, collaborating with teams to improve infrastructure and handle incidents.
Top Skills:
Cloud InfrastructureCloudFormationContainer Orchestration PlatformsContainerization TechnologiesDatadogGrafanaIac ToolsKubernetesMicroservices ArchitectureObservability ToolsProgramming LanguagesPrometheusService Mesh TechnologiesSplunkTerraform
Software
Responsible for deploying observability platforms and automating their operation, developing software for system reliability, and leading cross-team collaboration on monitoring solutions.
Top Skills:
AnsibleGoKubernetesPrometheusPromqlTerraform
Artificial Intelligence • Big Data • Machine Learning • Software
The role involves designing and implementing custom installations of the C3 AI Platform for Federal customers, ensuring uptime, and automating system processes while collaborating with cross-functional teams.
Top Skills:
AnsibleAWSAzureBashKubernetesLinuxPuppetPythonRubyTerraform
Fintech
The Principal Site Reliability Engineer designs and implements software to enhance application performance and resilience while ensuring security standards. Responsibilities include automating application management, providing observability, and leading cross-functional teams. Mentorship and on-call rotation participation are expected.
Top Skills:
AuroraAWSChefDockerDynamo DbGitGoJavaJenkinsJmsKafkaKubernetesMavenMemcachedOraclePythonRedisSqsSwarm
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Fintech • Software
The SRE is responsible for building cloud-native platforms, improving application reliability, and fostering collaboration within teams.
Top Skills:
Ci/CdKubernetesOpenshiftOpenstackPrometheusSplunkVMware
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Generative AI
The Site Reliability Engineer will develop, deploy, and operate AI infrastructure, focusing on high-performance and scalable machine learning systems using Kubernetes and cloud platforms.
Top Skills:
AWSAzureC++GCPGoKubernetesOci
Big Data • Cloud • Marketing Tech • Social Impact • Software
As a Senior Site Reliability Engineer, you will support product deployments, provide engineering support, maintain systems, and collaborate with teams globally to enhance infrastructure reliability.
Top Skills:
AWSCassandraCircleCIDynamoDBGCPGoJenkinsKubernetesNosql DatabasesPythonScylladbSinglestore DbTerraform
Reposted 24 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
The Senior Site Reliability Engineer will support, maintain and grow the Atlas platform, focusing on automating processes and running multi-cloud environments.
Top Skills:
AWSAzureDnsGCPGoHTTPLinuxPythonRubyTls
Artificial Intelligence • Software
As a Senior/Staff Network Reliability Engineer, you'll optimize and maintain Fluidstack's network platform, ensuring performance and reliability for AI and HPC workloads. Responsibilities include tuning networking protocols, deploying and validating switches, automating telemetry, conducting root-cause analyses, and collaborating with vendors.
Top Skills:
BgpDpdkEbpfEvpnGeneveGoPythonRdmaRustTcp/IpVxlanXdp
Artificial Intelligence • Information Technology
As a Site Reliability Engineer, maintain user-facing services, implement best practices for reliability, and manage production incidents.
Top Skills:
AnsibleCloud ServicesKubernetesProgramming LanguagesTerraform
Artificial Intelligence • HR Tech • Professional Services
As a Senior Site Reliability Engineer, you will leverage AI tools, manage AWS cloud infrastructure, operate Kubernetes clusters, and collaborate on reliability enhancements, automating incident responses and improving metrics.
Top Skills:
AWSDatadogKubernetesOpentelemetryPrometheusTerraform
Artificial Intelligence • Healthtech • Information Technology • Software
As a Site Reliability Engineer, you will manage the production environment, focusing on infrastructure design, automation, and optimizing deployment pipelines to ensure high availability.
Top Skills:
HelmKafkaKubernetesPostgresPythonRedisTerraformTypescript
Consumer Web • Mobile
As a Site Reliability Engineer at Patreon, you'll improve AWS infrastructure, implement SRE practices, enhance Kubernetes capabilities, and develop automation tools.
Top Skills:
AnsibleAWSChefKubernetesPuppetPythonTerraform
Financial Services
The Senior Cluster Site Reliability Engineer will enhance the research compute cluster's uptime, reliability, and performance through engineering and operational improvements, ensuring high availability for researchers working on machine learning problems.
Top Skills:
AnsibleAWSAWSCephDockerElkGCPGCPGrafanaHorovodHpcInfinibandKubeflowKueueLokiLustreMlflowOpentelemetryPodmanPrometheusPythonRdmaRubyS3SingularitySlurmTerraform
Artificial Intelligence • Cloud • Information Technology • Software
The Senior Site Reliability Engineer is responsible for managing AI infrastructure, ensuring reliability through scalability, incident response, and collaboration with suppliers, focusing on Kubernetes and advanced GPU services.
Top Skills:
AnsibleBashGrafanaKubernetesPrometheusPython
Information Technology
As a Site Reliability Engineer, you'll design and operate scalable storage systems and optimize performance for AI research data management.
Top Skills:
GoKubernetesPulumiRust
Reposted 7 Days AgoSaved
Easy Apply
Easy Apply
Energy
The Site Reliability Engineer will design and implement scalable systems, automate IT infrastructure management, and support deployed systems, ensuring high availability and performance.
Top Skills:
Active DirectoryAnsibleAWSAzureChefJSONLinuxPuppetPythonRestVMwareWindows ServerYaml
Top San Francisco Companies Hiring Reliability Engineers
See AllPopular Job Searches
All Filters
Total selected ()
No Results
No Results
































