Maximum of 25 job preferences reached.
Top Senior Site Reliability Engineer Jobs in San Francisco, CA
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Generative AI
The Site Reliability Engineer will develop, deploy, and operate AI infrastructure, focusing on high-performance and scalable machine learning systems using Kubernetes and cloud platforms.
Top Skills:
AWSAzureC++GCPGoKubernetesOci
Artificial Intelligence • Machine Learning • Software • Industrial
The Site Reliability Engineer will own backend services and cloud infrastructure, focusing on system reliability, scalability, and operational excellence for Archetype AI's platform.
Top Skills:
AWSAzureC++CloudFormationCudaGCPKubernetesPulumiPythonPyTorchRustTerraform
Artificial Intelligence • Machine Learning • Biotech • Generative AI
The Site Reliability Engineer will manage digital infrastructure, ensuring access to compute resources, automating processes, and maintaining resource visibility for researchers.
Top Skills:
AnsibleDockerGrafanaKubernetesPrometheusPythonTailscaleTalos Linux
Software
Design, implement, and maintain scalable backend systems and APIs; build cloud infrastructure (preferably GCP) using Terraform; operate containerized workloads with Kubernetes; ensure reliability, security, and performance; participate in on-call rotations, architecture discussions, and cross-functional delivery.
Top Skills:
Ci/CdCloud AutomationContainer OrchestrationGoGoogle Cloud PlatformIamInfrastructure As CodeKubernetesMicroservicesPythonService-Oriented ArchitectureTerraform
Computer Vision • Machine Learning • Software
As a Site Reliability Engineer, ensure the reliability, performance, and scalability of Ditto's cloud infrastructure by developing observability solutions, leading incident management, and collaborating with product engineering teams.
Top Skills:
AWSAzureCDatadogGCPGoGrafanaHelmJavaKubernetesPrometheusRustTerraform
Digital Media • Social Media • Software • Sports
Lead the technical architecture and execution of migration to AWS, drive developer enablement, and automate infrastructure using code-first principles.
Top Skills:
Aws EksDatadogGithub ActionsGoIstioK6KubernetesNode.jsTerraform
Software
As a Site Reliability Engineer, you'll enhance system reliability, collaborate on production readiness, define SLIs/SLOs, and improve incident response.
Top Skills:
AWSDatadogGrafanaKubernetesOpentelemetryPrometheusTypescript
Edtech
The Lead Software Engineer will lead the SRE team, focusing on reliability, performance optimization, security, and mentoring developers, while improving overall platform resilience.
Top Skills:
ActivejobAnsibleAWSAws CloudwatchEc2EcsElasticsearchGitGCPGoogle Cloud StackdriverJenkinsJIRAKubernetesMemcachedMongoDBNew RelicNode.jsPostgresRedisRuby On RailsSidekiqSpinnakerTerraformTerragrunt
Artificial Intelligence • Cloud • Information Technology • Software
Contribute to the reliability and performance of Mithril's GPU orchestration platform through automation, observability, and infrastructure management. Collaborate with the team to ensure scalability across multi-cloud environments while maintaining systems stability and implementing SLOs.
Top Skills:
AWSAzureGCPGoGrafanaKubernetesLinuxOpentelemetryPrometheusPulumiPythonTcp/IpTerraform
Artificial Intelligence • Information Technology • Software
The Site Reliability Engineer will ensure high availability and performance of CodeRabbit's AI-powered code review platform, enhancing system reliability through infrastructure ownership, performance engineering, and automation.
Top Skills:
AWSDatadogDockerElk StackGoogle Cloud PlatformGrafanaKubernetesLinuxNode.jsPrometheusTerraformTypescript
Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
The Senior Site Reliability Engineer will manage system incidents, improve monitoring and logging, optimize database infrastructure, and collaborate on scaling systems efficiently.
Top Skills:
AWSClickhouseKubernetesMySQLPostgresRedis
Healthtech
Develop and implement processes to ensure high availability and reliability of services. Responsibilities include incident management, automation, capacity planning, and risk mitigation.
Top Skills:
AWSAzureDatadogDockerGrafanaJavaScriptNew RelicPrometheusPythonRubySplunkTerraform
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Internet of Things • Cybersecurity
The Site Reliability Engineer will manage AWS GovCloud infrastructure, ensuring compliance and high availability while driving automation, security, and incident response best practices.
Top Skills:
AnsibleAws GovcloudBashDockerElk StackGitlab Ci/CdGrafanaJenkinsKubernetesPrometheusPythonTerraform
Information Technology • Automation
The SRE/Infrastructure Engineer will architect and manage secure, scalable systems for automated penetration testing, optimizing reliability, and enhancing infrastructure based on customer demand. Responsibilities include maintaining production environments, leading technical discussions, and promoting high coding standards.
Top Skills:
AWSAzureCloudFormationElkGCPNew RelicOpentelemetryPostgresPrometheusTerraform
Artificial Intelligence • Software
The SRE at Fluidstack is responsible for ensuring infrastructure reliability and performance, handling complex production issues, and improving platform stability.
Top Skills:
AnsibleBashGoKubernetesPythonSlurmTerraform
Artificial Intelligence • Software • Generative AI
The Lead Site Reliability Engineer will drive technical strategy, ensure high service availability, manage cloud infrastructure, and lead a team to optimize systems and automate processes.
Top Skills:
AWSAzureDockerGoogle Cloud PlatformKubernetesTerraform
Healthtech • Software
The SRE Technical Project Manager will lead project delivery, incident management, automation processes, and uptime communication, partnering with SRE and development teams to ensure system stability and scalability.
Top Skills:
Ai BotsDatadogJIRAJira Service ManagementMs TeamsOpsgeniePagerduty
13 Days AgoSaved
Easy Apply
Easy Apply
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
The role involves leading AI product development, enhancing CI/CD frameworks, automating IT workflows, supporting AWS services, and driving cloud security best practices.
Top Skills:
AnsibleAWSBashChefCi/CdDockerGitKubernetesPuppetPythonRubySaltTerraform
Software • Cryptocurrency
Manage and scale Kubernetes clusters, automate infrastructure, optimize performance, maintain blockchain nodes, and improve system reliability while collaborating with product teams.
Top Skills:
Aws (Ec2Aws EksDatadogDockerIam)KubernetesOpentelemetryPulumiRdsS3Terraform
Cloud • Information Technology
As a Staff Site Reliability Engineer, you will enhance cloud product lines, ensuring real-time scalability, collaborating with teams, and automating builds.
Top Skills:
AnsibleAWSAzureBashDnsDockerEnvoyGCPGitGoGrafanaHaproxyHTTPJenkinsKafkaKubernetesLinuxMySQLOciOpentelemetryPostgresPrometheusPuppetPythonRedisTcp/IpTelegrafTerraformTls
Software
As a Site Reliability Engineer, you will enhance system reliability, manage cloud services, respond to incidents, and support network systems.
Top Skills:
AutomationCisco RoutingCloud ServicesF5 Load BalancingFortinet FirewallsInfrastructure AutomationMonitoringNetworking
Reposted 22 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
Develop and maintain Kubernetes runtime environments, support developers, resolve critical issues, and participate in on-call rotations for production systems.
Top Skills:
AWSAzureCert-ManagerCorednsCrdsCriCsiGatekeeperGCPGoHelmKubernetesKustomizeOperatorsPythonTerraform
Cloud • Security • Cybersecurity
As a Junior Site Reliability Engineer, you will support cloud operations, implement automation for cloud infrastructure, and ensure system reliability and security.
Top Skills:
AnsibleAWSAzureBashElastic StackGCPJIRAPowershellPythonServicenowSplunkTerraform
Artificial Intelligence • Software
As a Senior Staff SRE Tech Lead, you'll oversee reliability and scalability, mentor engineers, optimize systems, and enhance data infrastructure.
Top Skills:
ClickhouseGoPostgresPythonTypescript
Information Technology • Mobile • Software
As a Site Reliability Engineer, you'll ensure system reliability and scalability, automate processes, optimize performance, and collaborate on system design.
Top Skills:
AWSAzureBashCloudFormationDatadogDockerElkGoGoogle Cloud PlatformGrafanaHelmKubernetesNew RelicPrometheusPulumiPythonTerraform
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Filters
Total selected ()
No Results
No Results












.png)



.png)

















