Thinking Machines Lab Logo

Thinking Machines Lab

Site Reliability Engineer (SRE)

Reposted 13 Days Ago
In-Office
San Francisco, CA, USA
350K-475K Annually
Mid level
In-Office
San Francisco, CA, USA
350K-475K Annually
Mid level
The Site Reliability Engineer will drive reliability for the Tinker platform, focusing on incident response, monitoring, and ensuring system resilience while collaborating across teams.
The summary above was generated by AI

Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals. 

We are scientists, engineers, and builders who’ve created some of the most widely used AI products, including ChatGPT and Character.ai, open-weights models like Mistral, as well as popular open source projects like PyTorch, OpenAI Gym, Fairseq, and Segment Anything.

About Tinker

Tinker is our fine-tuning API that empowers researchers and developers to customize frontier AI to their needs — opening access to capabilities that have previously been concentrated in a handful of labs. We manage the infrastructure while allowing Tinkerers full flexibility in training open weights models with their own data, algorithms, and for their own needs. Tinker is rapidly adding new customers, features, and novel use-cases. We’re hiring to grow the platform alongside the Tinker community.

About the Role

We're looking for a Site Reliability Engineer to drive the reliability of Tinker end-to-end. You'll work alongside the engineers building the platform and research teams to make every layer of the system more robust and resilient. 

What You’ll Do
  • Define and own end-to-end reliability, from CI/CD flows to production observability and incident response.
  • Develop appropriate Service Level Objectives for distributed training systems, balancing job completion reliability and scheduling latency with development velocity.
  • Design and implement monitoring and observability across the full training path.
  • Drive incident response for Tinker platform issues, ensuring rapid recovery, thorough incident reviews, and systematic improvements that prevent recurrence.
  • Harden multi-tenant isolation and resource scheduling so that LoRA-based workload co-scheduling maximizes utilization without compromising reliability or data separation
  • Collaborate with security teams to address production vulnerabilities
Skills and Qualifications

Minimum qualifications:

  • Bachelor's degree or equivalent experience in computer science, engineering, or similar.
  • Experience in distributed systems, cloud infrastructure, or site reliability engineering.
  • Proficiency writing software to solve reliability problems, including building tooling and automation.
  • Experience with production incident response, postmortems, and systematic reliability improvement.
  • Strong communication skills and track record of coordination across engineering and research teams.

Preferred qualifications — we encourage you to apply if you meet some but not all of these:

  • Deep experience operating production cloud services at scale (e.g., public cloud platforms, internal cloud services)
  • Background in distributed training frameworks and how infrastructure failures surface in training behavior.
  • Track record building checkpoint and recovery systems for long-running distributed jobs.
  • Expertise in Kubernetes at scale: deploying, operating, debugging, and tuning clusters handling heterogeneous GPU workloads.
Logistics
  • Location: This role is based in San Francisco, California.
  • Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 – $475,000 USD.
  • Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
  • Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Thinking Machines Lab will consider for employment qualified applicants with criminal histories in a manner consistent with the requirements of the California Fair Chance Act, the San Francisco Fair Chance Ordinance, and any other applicable state or local fair chance ordinance or law.

Similar Jobs

Yesterday
Remote or Hybrid
United States
200K-250K Annually
Senior level
200K-250K Annually
Senior level
Digital Media • Gaming • Information Technology • Software • Sports • Esports • Big Data Analytics
Lead long-term strategy and architecture for cloud and on‑prem platform infrastructure, driving Kubernetes and multi‑cloud reliability, IaC/GitOps automation, observability, SLO/SLI/error‑budget practices, incident leadership, AI‑augmented tooling adoption, and mentorship of senior engineers to improve platform resilience and developer experience.
Top Skills: Amazon Elastic Kubernetes Service (Eks)AutoscalingAWSCapacity PlanningCi/CdGitopsGoGoogle Cloud PlatformGoogle Kubernetes Engine (Gke)Identity And Access ManagementInfrastructure As CodeKubernetesLinuxNetworkingObservabilityOperatorsPulumiPythonRke2StorageTerraform
16 Hours Ago
Easy Apply
Remote or Hybrid
US
Easy Apply
200K-230K Annually
Senior level
200K-230K Annually
Senior level
Artificial Intelligence • Machine Learning
Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.
Top Skills: Cloud PlatformsGoKubernetesLinuxLlm/Ai ToolingLogs And TracingObservability ToolingPythonSlo/Sli Frameworks
14 Days Ago
Hybrid
San Francisco, CA, USA
160K-250K Annually
Senior level
160K-250K Annually
Senior level
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead design and delivery of scalable cloud infrastructure for the Spend product. Embed with development teams to drive reliability, performance, observability, incident response, and automation. Own SLOs, runbooks, DevOps metrics, and collaborate with central DevOps and security teams to ensure compliance and resilience. Lead infrastructure projects including new service launches, data centre migrations, and modernising data pipelines.
Top Skills: Analytics PipelinesAWSData StreamingDevOpsGCPIncident ResponseKubernetesObservabilitySlosSre

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account