Get the job you really want.
Maximum of 25 job preferences reached.
Top Senior Site Reliability Engineer Jobs in San Francisco, CA
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Principal Staff SRE will lead initiatives in building and optimizing core infrastructure services on-prem and cloud, deploying and managing services at scale, and improving performance with automation and monitoring tools.
Top Skills:
DhcpDnsEbpfGoLdapLinuxNtpPythonTerraformXdp
Software
Lead a team of Support Engineers focused on AI Inference Infrastructure, managing incidents, optimizing performance, and improving operational practices.
Top Skills:
AIGrafanaKubernetesLokiMlPrometheus
Software
The Lead Site Reliability Engineer will oversee the architecture and operational excellence of Mattermost's infrastructure, mentoring teams and driving strategic initiatives for performance and reliability in regulated sectors.
Top Skills:
AWSGrafanaKubernetesPrometheusTerraform
Information Technology
As a Site Reliability Engineer at New Era Technology, you'll focus on ensuring operational efficiency, creating reliable systems, and enhancing service performance through AWS expertise.
Top Skills:
AWS
Fintech
As a Site Reliability Engineer I, you'll enhance the reliability and maintainability of systems, develop applications, manage cloud infrastructure, and contribute to observability practices. You'll also participate in on-call rotations.
Top Skills:
BashCloud InfrastructureGenaiInfrastructure As CodeJavaLinuxPythonUnixWindows
Blockchain • Information Technology • Internet of Things
The Site Reliability Engineer will ensure system reliability, security, and performance by implementing infrastructure as code, CI/CD, and monitoring solutions.
Top Skills:
AWSAzureBashGCPGoKubernetesPythonRustTerraform
Artificial Intelligence • Machine Learning • Generative AI
As a Site Reliability Engineer, you will manage Kubernetes clusters, automate infrastructure, improve operational metrics, and enhance reliability across data centers.
Top Skills:
CloudFormationGoGpuKubernetesLinuxPythonTerraform
Aerospace • Manufacturing
As a Site Reliability Engineer, you'll build and manage observability platforms for satellite communications, define SLOs/SLIs, and collaborate on incident response and deployment automation.
Top Skills:
ArgocdAWSElkGCPGoGrafanaIstioJaegerKubernetesLinkerdLokiOpentelemetryPrometheusPythonTempoTerraform
Aerospace • Manufacturing
The Staff Site Reliability Engineer will design and manage Aalyria's centralized observability platform, focus on metrics, logging, and tracing systems, implement SLOs and SLIs, automate deployments, and drive incident response strategies for enhanced reliability across satellite and cloud platforms.
Top Skills:
AWSElkGCPGitopsGoGrafanaJaegerJavaKubernetesLokiOpentelemetryPrometheusPythonTempoTerraform
Automotive
Design and implement scalable cloud infrastructure, monitor performance, automate processes, ensure security and compliance, and lead a DevOps team.
Top Skills:
AWSBashCi/CdDockerElk StackGCPGrafanaKubernetesPrometheusPythonTerraform
Reposted 7 Days AgoSaved
Easy Apply
Easy Apply
Artificial Intelligence • Information Technology • Logistics • Machine Learning • Software
Lead reliability initiatives for the production platform, manage incident response, define SLIs/SLOs, and enhance security by embedding it into delivery pipelines. Drive platform improvements in AWS and CI/CD processes.
Top Skills:
AuroraAWSBazelCi/CdDagsterDbtDuckdbDynamoDBEcsJavaJavaScriptKubernetesPythonSpaceliftSqsSsmTerraformTrinoTypescript
Big Data • Healthtech • Information Technology • Analytics
As a Lead Site Reliability Engineer, you'll design and manage scalable cloud infrastructure on GCP, optimize CI/CD processes, and ensure system reliability through observability and incident response, while mentoring others in a cross-product SRE group.
Top Skills:
BashGitlab Ci/CdGkeGoogle Cloud PlatformJenkinsPythonSentrySumo LogicTerraform
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Artificial Intelligence
The Staff/Lead/Senior/Principal Site Reliability Engineer will establish SRE practices, ensure platform reliability, and support infrastructure scaling for enterprise AI workloads.
Top Skills:
AWSBetterstackCloudwatchGithub ActionsGrafanaKubernetesMongodbPagerdutyPostgresPrometheusTerraform
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
The Site Reliability Engineer will build and maintain infrastructure, improve software systems, develop scalable microservices, and ensure quality software delivery.
Top Skills:
AWSGoGoogle Cloud PlatformJavaKubernetesAzureSQL
Reposted 16 Hours AgoSaved
Easy Apply
Easy Apply
Artificial Intelligence • Cloud • Software
As a Senior Site Reliability Engineer, you will manage and optimize AI-optimized compute services, ensure reliability, and implement monitoring systems. Responsibilities include creating Ansible playbooks, managing incident responses, and collaborating with suppliers to maintain system stability.
Top Skills:
AnsibleBashGrafanaLinuxPrometheusPython
Cloud • Fintech • Information Technology • Software • Business Intelligence
As a Site Reliability Engineer, you will ensure production system reliability, optimize performance, respond to incidents, and collaborate on infrastructure improvements.
Top Skills:
AnsibleAWSBashDatadogDockerElkGitGrafanaKubernetesNew RelicOpentelemetryPrometheusPythonReactRubyRuby On RailsTerraform
Information Technology • Software • Web3
As a Software Engineer focused on SRE and DevSecOps, you will design scalable infrastructure, implement CI/CD pipelines, and automate processes while collaborating with teams to enhance performance and security.
Top Skills:
AnsibleBashDatadogDockerGCPGrafanaKubernetesPythonReactRustSolidityTerraformWeb3
Cloud • Security • Software
The Site Reliability Engineer will design, automate and scale cloud infrastructure while ensuring uptime, performance, and security best practices.
Top Skills:
AnsibleAWSAzureChefDockerGCPGoJavaScriptKubernetesLinuxPuppetPythonRubySaltstackTerraform
AdTech • Marketing Tech • Analytics
As a Staff Software Engineer - SRE, you'll manage cloud infrastructure, improve application reliability, collaborate across teams, and support back-office systems.
Top Skills:
AWSDatadogDockerKafkaKibanaKubernetesLinuxPostgresPythonRdsRedshiftShell/BashSparkTerraform
Hardware • Machine Learning • Security • Software
The Site Reliability Engineer will manage software deployment for IoT devices, improve observability, maintain dashboards, automate processes, and collaborate on incident responses.
Top Skills:
AnsibleAWSBashC/C++DatadogGrafanaGroovyJavaJavaScriptNoSQLPostgresPrometheusPythonRSigmaSQLTerraform
Artificial Intelligence • Cloud • Fintech • Machine Learning • Mobile • Software
The Staff Site Reliability Engineer will design, implement, and optimize infrastructure for AI services, ensure reliability and performance, and drive automation and observability excellence across engineering teams.
Top Skills:
AzureAzure DevopsDockerElk StackGithub ActionsGrafanaKubernetesMimirPostgresPrometheusSQL ServerTeamcityTerraform
Greentech • Software • Energy
This role involves managing cloud infrastructure, improving system reliability, automation, incident response, and mentoring engineers, requiring deep technical expertise and leadership skills.
Top Skills:
AWSBashDatadogDockerGCPJavaScriptKubernetesLinuxPythonTypescript
Software
Lead and manage engineering teams for ConductorOne's cloud infrastructure, ensuring reliability, security, and compliance while fostering team growth and culture.
Top Skills:
AICi/CdCloud InfrastructureIso 27001)KubernetesSecurity Compliance (Soc 2
Security • Software • Cybersecurity
Seeking a Site Reliability Engineer to manage software development tools for DevOps, optimize workflows, and ensure system performance and reliability while integrating AI-driven solutions.
Top Skills:
ArtifactoryAWSAzureBashClickupConfluenceDockerFigmaFullstoryGCPGitGrafanaJIRAKubernetesPower BIPrometheusPythonSplunkTerraform
Artificial Intelligence • Generative AI
Lead GPU cluster design and operations, manage Kubernetes, implement Infrastructure-as-Code, and develop observability stacks for high-performance AI models.
Top Skills:
AnsibleArgo CdBashEbpfFluxGitopsGrafanaHelmInfinibandKubernetesNvidia DcgmOpentelemetryPrometheusPythonRdmaTerraform
Top San Francisco Companies Hiring Senior Site Reliability Engineers
See AllPopular Job Searches
All Filters
Total selected ()
No Results
No Results































