Get the job you really want.
Maximum of 25 job preferences reached.
Top Senior Site Reliability Engineer Jobs in San Francisco, CA
Software
The role involves managing compute infrastructure for decentralized applications, requiring critical thinking, documentation skills, and experience in Kubernetes and blockchain management.
Top Skills:
BlockchainGitopsInfrastructure-As-CodeKubernetesProgramming Languages
Artificial Intelligence • Security • Software
You will develop and improve cloud infrastructure, support distributed systems, and write infrastructure-as-code while collaborating across teams.
Top Skills:
AWSCloudFormationDockerGoJavaKubernetesPythonTerraform
Artificial Intelligence • eCommerce • Retail
Lead the SRE and DevOps team, ensure infrastructure reliability, oversee cloud operations, drive automation, and collaborate cross-functionally.
Top Skills:
AzureBashCi/CdDatadogDockerElk StackGoGrafanaKubernetesPowershellPrometheusPythonTerraform
Aerospace • Big Data • Greentech • Hardware • Social Impact
Design, deploy, and operate compute services for on-premises and cloud satellite imaging platforms. Build reproducible, scalable, highly available deployments, troubleshoot distributed systems, optimize constrained environments, document and automate operations, and participate in on-call rotations to ensure reliability for customer-facing and air-gapped deployments.
Top Skills:
AlloyAnsibleBashCudaGitopsGrafanaHelmJIRAK3SKubernetesKustomizeOpentelemetryPrometheusProxmoxPythonRke2TalosTerraform
Software
Join the SRE team to improve monitoring, alerting, observability, and reliability of Fireblocks' production systems. Triage incidents, run RCA, create runbooks and automation (Python, Lambda, shell, Ansible, ArgoCD), collaborate with R&D/support, and participate in on-call rotation.
Top Skills:
AnsibleArgocdAWSAws LambdaAzureBashBitbucketC++ChefCoralogixDatadogDockerGerritGitGitlabGCPHelmJavaScriptKubernetesLinuxMySQLNew RelicNginxNode.jsPhabricatorPrometheusPuppetPythonShellSplunk
Real Estate • Financial Services • PropTech
As a Site Reliability Engineer, you will support AWS Cloud products, optimize processes, enhance automation, and ensure system reliability and performance.
Top Skills:
ArgocdAWSAzure DevopsBashCi/CdCloudwatchDockerEksFluxcdGitKubernetesPowershellPythonSQLTerraform
Artificial Intelligence • Big Data • Machine Learning • Software
The role involves designing and implementing custom installations of the C3 AI Platform for Federal customers, ensuring uptime, and automating system processes while collaborating with cross-functional teams.
Top Skills:
AnsibleAWSAzureBashKubernetesLinuxPuppetPythonRubyTerraform
Reposted 14 Days AgoSaved
Easy Apply
Easy Apply
Cloud • Software • Analytics
The Principal Cloud Site Reliability Engineer will lead the design and implementation of cloud infrastructure, manage CI/CD pipelines, mentor teams, and ensure secure, performant systems in AWS and Azure environments.
Top Skills:
AnsibleAWSAzureBashChefDockerElkGrafanaJenkinsKubernetesMongoDBMySQLPostgresPrometheusPuppetPythonRdsSaltTerraform
Cloud • Software
In this role, you'll support large-scale applications, improve observability, mentor team members, and ensure reliability by collaborating on deployments and writing automation scripts while providing 24/7 support.
Top Skills:
AnsibleAWSBashConfluenceDockerElk StackGCPGitlab CicdGrafanaJenkinsJIRAKubernetesLinuxMongoDBMySQLNagiosOciPerlPostgresPrometheusPuppetPythonTerraform
Software
Lead SRE to define SRE strategy, architecture, and roadmap; design and operate containerized, compliant cloud environments; build observability, incident management, automation, and developer platform capabilities; mentor SRE team and collaborate with security, compliance, and product teams to ensure reliability at scale.
Top Skills:
AWSAws MarketplaceAzureAzure MarketplaceGCPGoogle Cloud MarketplaceGrafanaKubernetesPrometheusTerraform
Computer Vision • Information Technology • Machine Learning • Natural Language Processing • Real Estate • Software
The SRE will maintain infrastructure for SaaS products on AWS, support developers, manage platform components, and handle IT tasks.
Top Skills:
AWSComputer VisionIacLarge Language ModelsNlpTerraform
Artificial Intelligence • Information Technology
As a Site Reliability Engineer, maintain user-facing services, implement best practices for reliability, and manage production incidents.
Top Skills:
AnsibleCloud ServicesKubernetesProgramming LanguagesTerraform
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Generative AI
The Site Reliability Engineer will develop, deploy, and operate AI infrastructure, focusing on high-performance and scalable machine learning systems using Kubernetes and cloud platforms.
Top Skills:
AWSAzureC++GCPGoKubernetesOci
Artificial Intelligence • Healthtech • Information Technology • Software
As a Site Reliability Engineer, you will manage the production environment, focusing on infrastructure design, automation, and optimizing deployment pipelines to ensure high availability.
Top Skills:
HelmKafkaKubernetesPostgresPythonRedisTerraformTypescript
Information Technology • Software
As a DevOps Engineer, you'll design and scale secure systems, manage AWS environments, automate operations, and ensure operational excellence for revenue teams.
Top Skills:
Amazon AuroraAWSDockerDynamoDBGithub ActionsKafkaS3SnowflakeSparkSqsTerraform
Financial Services
The Senior Cluster Site Reliability Engineer will enhance the research compute cluster's uptime, reliability, and performance through engineering and operational improvements, ensuring high availability for researchers working on machine learning problems.
Top Skills:
AnsibleAWSAWSCephDockerElkGCPGCPGrafanaHorovodHpcInfinibandKubeflowKueueLokiLustreMlflowOpentelemetryPodmanPrometheusPythonRdmaRubyS3SingularitySlurmTerraform
Artificial Intelligence • Information Technology • Machine Learning • Software • Cybersecurity • Generative AI • Data Privacy
Lead global SRE and infrastructure teams to ensure reliability, scalability, and cost-efficiency of production and developer platforms. Define cloud and Kubernetes architecture, IaC, CI/CD, SLOs/SLIs, incident management, and cloud cost optimization while partnering with Security, Product, Finance, and Engineering.
Top Skills:
AIAutomationAWSCi/CdCloud-Native SystemsGCPInfrastructure As CodeKubernetesTerraform
Computer Vision • Machine Learning • Software
As a Site Reliability Engineer, ensure the reliability, performance, and scalability of Ditto's cloud infrastructure by developing observability solutions, leading incident management, and collaborating with product engineering teams.
Top Skills:
AWSAzureCDatadogGCPGoGrafanaHelmJavaKubernetesPrometheusRustTerraform
Software
As a Site Reliability Engineer, you will manage the reliability and scalability of platform infrastructure, build observability tools, and automate processes to enhance operational excellence.
Top Skills:
AWSGCPGoKubernetesPulumiPythonTerraform
Fintech
The Principal Site Reliability Engineer designs and implements software to enhance application performance and resilience while ensuring security standards. Responsibilities include automating application management, providing observability, and leading cross-functional teams. Mentorship and on-call rotation participation are expected.
Top Skills:
AuroraAWSChefDockerDynamo DbGitGoJavaJenkinsJmsKafkaKubernetesMavenMemcachedOraclePythonRedisSqsSwarm
Artificial Intelligence • Healthtech • Software
The Staff Site Reliability Engineer will lead the reliability of production systems by defining SRE practices, improving observability, and ensuring fault-tolerance in cloud environments.
Top Skills:
AWSGoKubernetesPostgresPythonTerraformTypescript
Digital Media • Social Media • Software • Sports
Lead the technical architecture and execution of migration to AWS, drive developer enablement, and automate infrastructure using code-first principles.
Top Skills:
Aws EksDatadogGithub ActionsGoIstioK6KubernetesNode.jsTerraform
Software
As a Site Reliability Engineer, you'll enhance system reliability, collaborate on production readiness, define SLIs/SLOs, and improve incident response.
Top Skills:
AWSDatadogGrafanaKubernetesOpentelemetryPrometheusTypescript
Cloud • Security • Software • Cybersecurity
The Senior Lead Site Reliability Engineer will ensure performance and uptime of security products, develop automation pipelines, and improve monitoring systems, working closely with various teams.
Top Skills:
AzureDatabricksDockerGoJenkinsKubernetesPythonTerraform
Software
Design, implement, and maintain scalable backend systems and APIs; build cloud infrastructure (preferably GCP) using Terraform; operate containerized workloads with Kubernetes; ensure reliability, security, and performance; participate in on-call rotations, architecture discussions, and cross-functional delivery.
Top Skills:
Ci/CdCloud AutomationContainer OrchestrationGoGoogle Cloud PlatformIamInfrastructure As CodeKubernetesMicroservicesPythonService-Oriented ArchitectureTerraform
Popular Job Searches
All Filters
Total selected ()
No Results
No Results

































