Kumo Logo

Kumo

Software Engineer Lead - Cloud Infrastructure

Reposted 19 Days Ago
Be an Early Applicant
Hybrid
Mountain View, CA
175K-250K Annually
Expert/Leader
Hybrid
Mountain View, CA
175K-250K Annually
Expert/Leader
Architect and operate scalable Kubernetes infrastructure for AI workloads, manage multi-cloud deployments, automate processes, and enhance system reliability.
The summary above was generated by AI
About Kumo.ai

Kumo is building the infrastructure layer for the next generation of enterprise AI — a platform that lets organizations turn their data into predictive intelligence instantly, without the heavy lifting of traditional ML pipelines. We have also built our own Relational Foundation Model that can provide predictions in seconds – no training, straight to business value!

Join a dynamic, rapidly expanding team of innovators from top-tier companies like Airbnb, LinkedIn, Pinterest, and Stanford, supported by the renowned Sequoia Capital. We're on the front lines of AI, solving some of its most challenging and impactful problems, and we've already delivered over $500M+ in tangible value to industry giants like Reddit, DoorDash, and Databricks. If you thrive in a fast-paced environment, are driven by ambitious goals, and crave an opportunity for massive impact, this is your chance to shape the future of AI.

The Opportunity

We’re hiring a Lead / Staff+ Infrastructure Engineer to own the architecture, reliability, and evolution of Kumo’s multi-tenant AI platform. This is a hands-on leadership role: you’ll design high-leverage systems, make critical architectural decisions, mentor engineers, drive cross-functional roadmaps, and still spend a meaningful portion of your time writing code and running production services. If you’ve built large-scale cloud-native infrastructure, led cross-team infrastructure initiatives, and want to influence both product and platform at a technical and organizational level, this role is for you.

What You’ll Own

  • Set the technical vision and roadmap for Kumo’s multi-tenant infrastructure across AWS, Azure, and GCP, balancing scalability, reliability, cost, and security.
  • Lead architecture and design for critical systems: Kubernetes-based multi-tenancy, real-time inference clusters, training pipelines, and CI/CD for large ML workloads.
  • Hands-on implementation: build and evolve IaC, GitOps flows, cluster autoscaling, and automation that reduce toil and accelerate developer productivity.
  • Define and drive SLOs, SLIs, and capacity planning; lead incident response, postmortems, and systemic remediation.
  • Own cost optimization at scale — from resource scheduling to spot/commit strategies and cross-cloud lifecycle management.
  • Mentor and grow engineers: set standards for architecture reviews, design docs, code quality, and operational excellence.
  • Hire and help scale the team — participate in recruiting, interviewing, and onboarding top-tier infrastructure talent.

What You Bring

  • 5-8+ years building and operating production cloud-native infrastructure; proven track record leading infrastructure initiatives end-to-end.
  • Deep, practical experience with Kubernetes at scale (multi-tenant environments, cluster federation, or large fleet operations).
  • Strong multi-cloud operational experience (designing and running services across AWS/Azure/GCP) and cloud cost management.
  • Demonstrated systems design skills for distributed systems, making architectural trade-offs and comfortable shipping code in a high-velocity environment (Python, Go, or similar) and reviewing complex PRs.
  • Proficiency in Go, Python, Rust or similar languages for automation tooling.
  • Excellent communicator: able to influence across engineering, ML science, product, and leadership — and to write clear design docs and trade-off analyses.

Nice to Have

  • Experience building infrastructure for ML/AI platforms or relational foundation models.
  • Background with Spark or large-scale data processing platforms (managed or self-hosted).
  • Familiarity with Kubernetes operators, controllers, CRDs, or service mesh patterns.
  • Expertise with Infrastructure-as-Code (Terraform/Pulumi) and GitOps (ArgoCD, Flux, Argo Workflows) in production.
  • Experience with tenant isolation, zero-trust identity models, and cloud security/compliance frameworks.
  • Prior experience building and scaling an infrastructure team (e.g., hiring, mentoring, org design).

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Top Skills

Ansible
Argo
AWS
Azure
Bash
Calico
CloudFormation
Docker
Envoy
Flux
GCP
Go
Grafana
Istio
Jenkins
Kubernetes
Make
Prometheus
Python
Rust
Terraform
Tigera
Traefik
HQ

Kumo Mountain View, California, USA Office

357 Castro St, Suite 200, Mountain View, CA, United States, 94041

Similar Jobs

27 Minutes Ago
Easy Apply
Hybrid
San Jose, CA, USA
Easy Apply
119K-170K Annually
Senior level
119K-170K Annually
Senior level
Cloud • Information Technology • Security • Software • Cybersecurity
The Staff Network & Infrastructure Engineer is responsible for network operations support, troubleshooting, and managing data center deployments while ensuring high-speed and reliable infrastructure.
Top Skills: BgpData Center OperationsDnsGreHigh-Speed Fiber NetworksHttpsIpsecIpv4Ipv6LinuxNetworking CertificationsShell ScriptingUnixWireshark
27 Minutes Ago
Easy Apply
Hybrid
San Francisco, CA, USA
Easy Apply
142K-214K Annually
Senior level
142K-214K Annually
Senior level
Consumer Web • eCommerce • Marketing Tech • Retail • Software • Analytics • Generative AI
As a Senior Product Data Scientist, you'll partner with teams to lead product-focused statistical analysis, experimentation, and actionable insights using advanced models while collaborating cross-functionally to enhance product decisions.
Top Skills: DbtPythonRSQL
27 Minutes Ago
Easy Apply
Hybrid
San Francisco, CA, USA
Easy Apply
88K-132K Annually
Mid level
88K-132K Annually
Mid level
Consumer Web • eCommerce • Marketing Tech • Retail • Software • Analytics • Generative AI
The role involves analyzing product opportunities through data analysis and A/B testing to inform product strategies and success metrics. It includes collaboration with cross-functional teams and ensuring the integrity of data pipelines.
Top Skills: PythonRSQL

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account