Maximum of 25 job preferences reached.
Top Infrastructure Engineer Jobs in San Francisco, CA
Artificial Intelligence • Natural Language Processing • Generative AI
As a Staff Infrastructure Engineer, you'll define and drive strategies for cloud-based compute clusters, ensuring secure, reliable, and scalable infrastructure. Responsibilities include lifecycle management, collaborating with cross-functional teams, and mentoring engineers.
Top Skills:
AWSAzureGCPGoKubernetesPythonRustTerraform
Artificial Intelligence • Machine Learning • Generative AI
The role focuses on building and maintaining infrastructure for ML training. Responsibilities include API design, improving performance, and debugging across systems.
Top Skills:
Distributed SystemsGpusNetworkingPythonPyTorchStorage
Software
The intern will join the Design Verification Infrastructure team to develop and maintain the Verification Platform, enhancing design verification technologies through collaboration and software development in Scala and Python.
Top Skills:
ChiselCirctEda ToolsPythonScala
Artificial Intelligence • Productivity • Software
As a core engineer on the Web Infrastructure team, you will enhance Notion's web client performance and development speed by improving load times, interaction latency, and providing tooling for product engineers.
Top Skills:
ReactWebpack
Artificial Intelligence • Fintech • Machine Learning • Natural Language Processing • Payments • Software • Financial Services
The Lead Voice Infrastructure Engineer will design and operate telephony services, improve core workflows, and enhance reliability within AI-driven communication systems.
Top Skills:
CC++GoPstnPythonRtpSipWebrtc
Artificial Intelligence
The SRE/Infrastructure Engineer will manage Terraform and Kubernetes across cloud platforms, ensuring scalable infrastructure. Responsibilities include multi-cloud deployments, observability, and creating reusable components.
Top Skills:
AWSAzureCloudflareGCPKubernetesTerraform
Cloud • Information Technology • Machine Learning
Lead end-to-end technical delivery of large-scale bare-metal GPU clusters for strategic customers: facility/rack design, GPU cluster bring-up, InfiniBand/RoCE fabric validation, HPC benchmarking and remediation, operational models for BMaaS, and cross-team product feedback. Act as primary technical customer contact, run proofs-of-concept, collaborate with engineering teams, and support security-sensitive, production-ready supercomputers.
Top Skills:
AnsibleBare Metal As A Service (Bmaas)BashBiosBmcFirmwareGb200Gpu ClustersHigh-Speed FabricHpcIb_Write_BwInfinibandKubernetesLinuxNcclNvidia HgxNvlinkPxe BootPythonRoceSlurmTcp/Ip
Fintech • Information Technology • Software
The Senior Infrastructure Engineer will improve system reliability and efficiency, develop standards and tooling, and collaborate with engineering teams to optimize workflows and cloud infrastructure.
Top Skills:
Aurora PostgresqlAWSCicdDatadogDockerKubernetesLinuxOpensearchPrometheusPythonSumologicTerraform
Artificial Intelligence • Information Technology • Software • Consulting
Design, build, and operate scalable GPU/accelerator infrastructure for large-scale training and inference. Implement scheduling, storage, networking (RDMA/InfiniBand/NCCL), observability, fault tolerance, security, and developer tooling. Partner with ML teams for capacity planning, cost optimization, automation, and operational runbooks.
Top Skills:
C++Ci/CdDeepspeedFsdpGoGpuInfinibandJaxKubernetesLinuxMegatron-LmNcclPythonPyTorchRayRay TrainRdmaSlurm
Artificial Intelligence • Information Technology • Software • Consulting
Design, build, and operate petabyte-scale data pipelines and storage for AI training and evaluation. Implement ingestion, cleaning, versioning, lineage, high-throughput loaders, labeling/active-learning workflows, privacy controls, observability, and cost/performance optimizations while collaborating with ML researchers.
Top Skills:
Apache BeamSparkCi/CdGpusJavaKotlinPythonRayScala
Artificial Intelligence • Information Technology
Build, operate, and maintain research infrastructure (evaluation frameworks, RL training systems, experiment tracking, visualization). Develop scalable distributed pipelines, ensure reproducibility and observability, and partner with researchers and infrastructure teams to accelerate ML research and tooling adoption.
Top Skills:
JaxPythonPyTorchRayRustSpark
Reposted 9 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
The Security Software Engineer will design and implement security controls for MongoDB Atlas, collaborating across engineering teams and ensuring adherence to high security standards.
Top Skills:
ApparmorC/C++CgroupsEbpfGoGrafanaJavaKubernetesPythonRustSeccompSelinuxSplunkTerraformVictoria Metrics
New
Cut your apply time in half.
Use ourAI Assistantto automatically fill your job applications.
Use For Free
Reposted 9 Days AgoSaved
Easy Apply
Easy Apply
Big Data • Cloud • Software • Database
The Senior Site Reliability Engineer will lead security design and implementation for cloud infrastructures, mentor teams, and automate security solutions.
Top Skills:
AnsibleAWSAzureCloud Security ToolsCloudFormationGCPGoTerraform
Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
The Staff ML Engineer will design and operate Samsara's ML platform, collaborating with teams to enhance ML features and improve safety outcomes. Responsibilities include overseeing system reliability, leading technical direction, and mentoring engineers.
Top Skills:
AWSCloud InfrastructureKubernetesMachine LearningRaySpark
Agency • Professional Services • Consulting
Architect and build backend infrastructure for an AI-driven enterprise email platform. Design scalable distributed systems ingesting millions of messages/day, implement secure, compliance-ready and on-prem-capable systems, and serve as a hands-on senior engineer driving a zero-to-one product build.
Top Skills:
Backend InfrastructureDistributed SystemsEncryptionEnterprise EmailOn-Prem DeploymentPythonTypescript
Software
As a Senior Backend Infrastructure Engineer, you'll develop systems that improve reliability and productivity for AI agent infrastructure, including deployment, observability, and data management.
Top Skills:
AWSCi/CdClickhouseDjangoDockerFastapiKubernetesModalNode.jsPostgresPythonRedisTerraformTypescript
Software
As a Staff Frontend Infrastructure Engineer, you'll build tools and systems for frontend engineers, focusing on design systems, developer tooling, and code quality infrastructure.
Top Skills:
EslintNext.JsPrettierReactTypescriptVitest
Artificial Intelligence • Natural Language Processing • Generative AI
The ML Infrastructure Engineer will develop and scale AI safety systems infrastructure, optimize machine learning pipelines, and ensure reliable system performance while collaborating with research teams.
Top Skills:
AirflowAWSGCPJaxKubernetesPythonPyTorchSparkTensorFlow
Legal Tech • Professional Services
The Principal Cloud Infrastructure Engineer will design, analyze, and maintain cloud infrastructure, ensuring compliance with security standards and integrating application development with IT infrastructure.
Top Skills:
Azure DevopsBicepFunction AppsGraph ApiKubernetesLogic AppsAzurePower PlatformTerraform
Information Technology
As a ML Platform & Infrastructure Engineer, you'll design CI/CD pipelines for ML workflows, build evaluation infrastructure, and develop SDKs and tools to enhance experimentation. You'll track and visualize model performance while optimizing resources.
Top Skills:
AWSDockerGCPKubernetesPython
Healthtech • Information Technology • Professional Services • Consulting
The Senior Cloud Infrastructure Engineer will design, deploy, and manage AWS infrastructure focusing on enterprise networking, security, and reliability while supporting cloud migrations in a HIPAA-regulated environment.
Top Skills:
AWSCiscoCloudFormationHipaaPalo AltoSd-WanTerraformVMware
Artificial Intelligence • Computer Vision • Software
As an Infrastructure Engineer, you will secure, scale, and maintain core infrastructure, collaborate across teams, and optimize machine learning workflows.
Top Skills:
AWSBash ScriptingGCPGithub ActionsHelmKubernetesNode.jsPythonPyTorchSpaceliftTensorFlowTerraform
Hardware • Software
The Staff Infrastructure Engineer will shape technical direction, support critical systems, and enhance automation across infrastructure, ensuring reliability and best practices.
Top Skills:
AWSGCPLinuxTerraform
Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3
Lead product vision and multi-year strategy for developer infrastructure across the code lifecycle. Own roadmap for CI/CD, release automation, testing, deployments, and production readiness; drive migrations to simplify systems, measure quality with scorecards, partner with Engineering/SRE/Security, and integrate emerging (AI) capabilities to improve developer velocity and reliability.
Top Skills:
Ai-Powered TestingBuild SystemsCi/CdDeployment PipelinesDora MetricsGenerative AiRelease AutomationSecuritySreTesting Infrastructure
Computer Vision • Gaming • Sports • Esports
Lead bring-up, administration, and operations of a large GPU/AI training cluster. Serve as bridge between researchers and hardware, ensuring SLURM jobs, parallel filesystems, networking, and monitoring operate reliably. Work across provisioning, storage, VPN/access, and traditional Linux sysadmin tasks; assist with physical racking and on-site datacenter needs. Collaborate closely with a small research team in Tokyo or San Francisco.
Top Skills:
AnsibleCephGpuGrafanaHpcK8SKubernetesLdapLinuxMaasNvidia HgxParallel File SystemsPrometheusSlinkySlurmTailscaleVastWarewulfWeka
Let Your Resume Do The Work
Upload your resume to be matched with jobs you're a great fit for.
Success! We'll use this to further personalize your experience.
Popular Job Searches
All Filters
Total selected ()
No Results
No Results



















.png)











