Maximum of 25 job preferences reached.
Top Reliability Engineer Jobs in San Francisco, CA
Artificial Intelligence • Fintech • Machine Learning • Social Impact • Software
Lead technical direction for software architecture and cross-team initiatives focusing on scaling consumer-facing systems and maximizing loan originations while maintaining compliance and system integrity.
Top Skills:
AWSCi/CdDockerGithub ActionsInfrastructure As CodeReactRuby On Rails
Hardware • Healthtech • Machine Learning • Software
Lead reliability engineering for electromechanical systems, including testing and validation of hardware to ensure performance and durability.
Top Skills:
Accelerated Life TestingFmeaHalt/HassJmpLabviewMatlabPythonSpcWeibull Analysis
Artificial Intelligence • Healthtech
The Site Reliability Engineer will enhance system reliability, define observability standards, respond to incidents, and collaborate with engineering teams on performance and compliance improvements.
Top Skills:
AWSContainerized ServicesDistributed WorkflowsObservability ToolingPostgresServerless Compute
Energy • Renewable Energy
The Staff Reliability Engineer will ensure hardware reliability in high-voltage electronics, develop reliability test programs, and collaborate on design and testing across teams.
Top Skills:
Hv ElectronicsPower ConversionPython
Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
The Site Reliability Engineer will ensure the reliability and performance of AI infrastructure, build core systems, handle incident response, and develop automation tools.
Top Skills:
AWSDatadogElkGCPGithub ActionsGitlab CiGoGrafanaJenkinsKubernetesLinuxPrometheusPulumiPythonRustTerraform
Software
The Site Reliability Engineer will enhance reliability, observability, and incident response of You.com's production services, while collaborating with teams to implement best practices and improve operational efficiency through tooling and automation.
Top Skills:
AWSBashCi/CdEksGhaGitGitGrafanaOpentelemetryPrometheusPythonTerraform
Reposted 24 Days AgoSaved
Fintech • Financial Services
The Staff Infrastructure Reliability Engineer leads Redfin's production database and storage systems, collaborating on strategies for reliability, scalability, and performance, while mentoring engineers and guiding complex technical discussions.
Top Skills:
AWSAws AuroraAws RdsAws S3DynamoDBElasticacheOpensearchPostgresPythonRdbms
Fintech
As a Senior Site Reliability Engineer, you will ensure the reliability, scalability, and security of Prosper's Cloud Platform while designing AI-assisted operations and mentoring junior engineers.
Top Skills:
ApmCi/CdCloudInfrastructure As CodeKubernetes
Big Data • Cloud • Digital Media • Machine Learning • Mobile • Software • Industrial
Lead reliability for Autodesk GovCloud services by deploying, operating, and automating production systems. Define SLOs/SLIs, build observability and automation, run incident response and on-call rotation, ensure compliance (FedRAMP), perform resilience testing and toil reduction, and collaborate across engineering, security, and platform teams to improve service reliability and operability.
Top Skills:
APIsAWSAws GovcloudAzureBashCaching TechnologiesCi/CdCloudwatchContainersDatabasesDatadogDnsDynatraceFedrampGoIl4Il5Infrastructure As CodeJavaKubernetesLoad BalancingMessaging SystemsNetworkingPowershellPythonSplunkStorage Platforms
Social Media
Operate, scale, and improve a cloud-native platform on AWS and Kubernetes. Manage GitOps deployments with ArgoCD and Helm, provision infra with Terraform/Terragrunt, build CI/CD automation, enhance observability, respond to incidents, reduce operational toil through scripting, and collaborate with security and application teams to improve reliability and platform guardrails.
Top Skills:
ArgocdAWSBashContainersEksGithub ActionsGitopsHelmIamKubernetesLinuxPythonTerraformTerragrunt
Legal Tech • Software
Lead automation and optimization of Filevine's data platform: performance tune MSSQL/Postgres, optimize Snowflake, provision infrastructure with Terraform/AWS, run stateful containers on Kubernetes, integrate AI/LLM and MCP for operational automation, manage CI/CD, capacity planning, documentation, and serve in 24/7 on-call rotation.
Top Skills:
AWSC#DapperDockerDynamoDBEntity FrameworkGitlabKubernetesLlmsMcp (Model Context Protocol)Microsoft Sql Server (Mssql)Octopus DeployOpensearchPostgresPowershellPythonRedisSnowflakeTerraform
Artificial Intelligence • Information Technology
The Site Reliability Engineer will drive reliability for the Tinker platform, focusing on incident response, monitoring, and ensuring system resilience while collaborating across teams.
Top Skills:
Cloud InfrastructureKubernetes
New
Track Smarter, Apply Better.
Ditch the spreadsheets. Organize your job search with our freeApplication Tracker.
Use For Free
Fintech • Software
As a Senior Site Reliability Engineer, you'll build and scale internal platform offerings, design monitoring systems, and collaborate with software engineers to ensure application performance and reliability.
Top Skills:
AnsibleAWSCloudFormationDatadogDockerElk StackGrafanaGrpcJavaKubernetesPostgresPrometheusPythonTerraform
Cloud
The role involves building and managing observability infrastructure in GCP, automating deployments, and optimizing data processes for high reliability.
Top Skills:
GkeGoGCPGrafanaKubernetesOpentelemetryPythonRubySplunkTerraform
Artificial Intelligence • Software
Design, build, and scale control- and data-plane infrastructure for distributed AI workloads. Improve reliability, performance, scheduling, and observability for Ray clusters across cloud and on-prem environments. Support accelerator integration, container image management, and provide on-call troubleshooting and cross-team collaboration.
Top Skills:
AWSAzureContainersGCPGoGpusGrafanaKubernetesLinuxPrometheusPythonRayTpusVms
Software
As a Senior DevOps / Platform Reliability Engineer, you will manage CI/CD pipelines, automate infrastructure, operate Kubernetes, and enhance observability while ensuring security and compliance for enterprise systems.
Top Skills:
Argo CdAurora MysqlAWSBashCloudFormationEksElasticacheGithub ActionsGrafanaKubernetesLinuxMskOpentelemetryPrometheusPythonS3Terraform
Artificial Intelligence
The SRE/Infrastructure Engineer will manage Terraform and Kubernetes across cloud platforms, ensuring scalable infrastructure. Responsibilities include multi-cloud deployments, observability, and creating reusable components.
Top Skills:
AWSAzureCloudflareGCPKubernetesTerraform
Digital Media • Gaming • News + Entertainment • Sports
As a Sr Principal Site Reliability Engineer, you will ensure maximum platform availability, lead incident response processes, drive automation, and collaborate across teams to optimize system performance and operational efficiency.
Top Skills:
Automation ToolsCloud TechnologiesContent Delivery NetworksMedia Streaming TechnologiesMonitoring Tools
Artificial Intelligence • Big Data • Machine Learning • Software
The role involves designing and implementing custom installations of the C3 AI Platform for Federal customers, ensuring uptime, and automating system processes while collaborating with cross-functional teams.
Top Skills:
AnsibleAWSAzureBashKubernetesLinuxPuppetPythonRubyTerraform
Artificial Intelligence • Machine Learning • Database
The role involves ensuring the reliability and performance of distributed database systems, developing monitoring strategies, and automating operations in a cloud-native environment.
Top Skills:
AnsibleArgoAWSAzureDockerGCPGitlab CiGoJavaJenkinsKubernetesPythonTerraform
Artificial Intelligence • Healthtech • Software • Automation
Design, build, and operate Optura's multi-cloud, HIPAA-aware platform: run Kubernetes across cloud and customer on-prem/air-gapped environments, create unified deployment tooling (Helm/operators/GitOps), own SLOs/capacity/incident response, drive reliability, implement identity/networking/security controls, and build IaC/GitOps patterns in partnership with product and security teams.
Top Skills:
AksArgo CdAWSAzureBackstageCluster ApiCrossplaneDistributed TracingEksGCPGitopsGkeGoGrafanaHelmKmsKubernetesMtlsOidcOpenshiftOpentelemetryOperatorsPrometheusPulumiPythonRancherReplicatedSecrets ManagementService MeshTalosTerraformVpc
Reposted 23 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
The Senior Site Reliability Engineer will develop and support distributed storage services, ensuring reliability and operational safety, with a focus on automation and efficiency.
Top Skills:
AWSAzureDnsGoGoogle Cloud PlatformKubernetesLinuxPythonTcp/IpTls
Big Data • Cloud • Software • Database
Seeking a Site Reliability Engineer with expertise in networking and distributed systems for building secure multi-cloud infrastructure. Responsibilities include maintaining network architecture and ensuring reliable service-to-service communication, involving a 24/7 on-call rotation.
Top Skills:
AWSAzureBgpDnsGCPIpv6KubernetesLoad BalancingMtlsService MeshTcp/IpTlsVpcsVpns
Artificial Intelligence • Machine Learning • Generative AI
The Software Engineer in Reliability will ensure system scalability, reliability, and performance, collaborating with teams to improve infrastructure and handle incidents.
Top Skills:
Cloud InfrastructureCloudFormationContainer Orchestration PlatformsContainerization TechnologiesDatadogGrafanaIac ToolsKubernetesMicroservices ArchitectureObservability ToolsProgramming LanguagesPrometheusService Mesh TechnologiesSplunkTerraform
Artificial Intelligence • Healthtech • Information Technology • Software
As a Site Reliability Engineer, you will manage the production environment, focusing on infrastructure design, automation, and optimizing deployment pipelines to ensure high availability.
Top Skills:
HelmKafkaKubernetesPostgresPythonRedisTerraformTypescript
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Filters
Total selected ()
No Results
No Results







.png)


























