Guild.ai Logo

Guild.ai

Engineer, Production Engineering

Posted 4 Days Ago
Be an Early Applicant
In-Office
San Francisco, CA, USA
Senior level
In-Office
San Francisco, CA, USA
Senior level
Own production infrastructure, security, and compliance for an AI-agent platform: manage Kubernetes on GCP, customer VPC deployments across clouds, observability, SOC2 readiness, pentest/bug-bounty coordination, IT identity, and automated CI/CD and progressive delivery to ensure secure, reliable production at scale.
The summary above was generated by AI
Engineer — Production Engineering

Location: San Francisco Bay Area (Hybrid/Onsite)
Type: Full-time
Stage: Early-stage startup

About the Role

We are building the control plane for AI agents in teams and companies.

As a Production Engineer, you will own the infrastructure, security, and compliance systems that allow our platform to ship fast and run reliably at scale. This is not a traditional ops role — you will write real code, contribute directly to the product, and own the full security and compliance surface of an early-stage company.

You'll work across Kubernetes infrastructure, cloud delivery, agent sandboxing, SOC2 compliance, IT systems, and production observability — and you'll contribute to the product itself, building security-sensitive features and auditing application code for vulnerabilities.

If you want to own the production backbone for the agent-native era — from a Terraform module to a pentest to an API key implementation — we want to talk.

What You'll Own

1. Cloud & Kubernetes Infrastructure

  • Our Stack: Manage and evolve our production and staging infrastructure on GCP (GKE) using Terraform. Own DNS, networking, and environment configuration end-to-end.

  • Customer Environments: Deploy and operate within customer VPCs across AWS, Azure, and GCP — adapting to varied infrastructure constraints, security requirements, and enterprise networking configurations.

  • Agent Sandboxing: Build and maintain Kubernetes-based sandboxing for agent execution — ensuring agents operate within strict network boundaries and must route through our API gateway rather than having unfettered internet access.

  • Observability: Own our observability stack, including OpenTelemetry instrumentation and integrations with New Relic and Splunk, to give the team deep visibility into system performance and agent runtime behavior.

2. Security, Compliance & IT

  • SOC2 & Audits: Lead infrastructure and operational work to support SOC2 compliance, including audit preparation, evidence collection, and control implementation.

  • Penetration Testing & Bug Bounty: Manage our HackerOne engagement — coordinating pentests, triaging incoming bug bounty reports, and driving remediation.

  • Product Security: Audit application code for security vulnerabilities, contribute security-sensitive product features (e.g., API key management), and ensure product and infrastructure security are coherent end-to-end.

  • IT & Identity: Own our IT stack — Okta, device management, and access controls — keeping the company secure as we scale.

3. CI/CD & Progressive Delivery

  • Deployment Pipelines: Design and maintain safe, automated CI/CD workflows supporting rollout strategies like canary and blue-green deployments.

  • Release Velocity: Make shipping to production a routine, boring, highly automated non-event.

What We're Looking For

Strong Fit

  • Experience: 5+ years in Production Engineering, Platform Engineering, or a security-focused infrastructure role, ideally at a fast-growing startup or SaaS company.

  • Our Stack: Strong hands-on experience with Kubernetes and GCP in production; comfortable with Terraform for managing real infrastructure.

  • Code over Click: Strong programming skills (Python, Go, TypeScript, etc.) with a passion for automating away toil.

  • Security Depth: Hands-on experience with compliance frameworks (SOC2), vulnerability management, and secure system design.

Bonus Points

  • Background with multi-tenant SaaS or enterprise security and procurement requirements.

  • Exposure to AI/ML infrastructure, particularly agent runtimes.

  • Experience building security-sensitive product features alongside infrastructure work.

  • Experience supporting pentests / bug bounties

  • Experience deploying and operating in customer VPCs or other external cloud environments across AWS, Azure, and/or GCP — navigating enterprise networking, security, and access constraints.

Why This Role is Unique
  • Broad Ownership: You'll own the full security and compliance surface of an early-stage company — from SOC2 to sandboxed agent execution to IT — while also contributing directly to the product.

  • Agent Infrastructure: You'll design infrastructure for autonomous AI agents, not just traditional web services — introducing unique sandboxing, observability, and security challenges.

  • Our Infra and Theirs: You'll operate across both our own production environment and customer cloud environments, requiring you to be fluent across AWS, Azure, and GCP.

  • High Autonomy: As an early hire, you'll have a seat at the table to choose the tools and define the architecture that carries us to scale.

Who Thrives Here
  • Engineers who are as comfortable reading application code for vulnerabilities as they are writing a Terraform module.

  • People who enjoy owning the full security and compliance surface, not just one layer of it.

  • Builders who can navigate the constraints of customer enterprise environments without losing velocity.

  • Those who are energized — not overwhelmed — by the breadth of an early-stage technical operations role.

HQ

Guild.ai San Francisco, California, USA Office

San Francisco, California, United States

Guild.ai Ross, California, USA Office

9 Woodside Way, Ross, California, United States, 94957 9698

Similar Jobs

23 Days Ago
In-Office or Remote
Santa Clara, CA, USA
184K-357K Annually
Senior level
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The role involves building automation and tooling for GPU infrastructure, improving workflows, and collaborating with teams for reliable cluster operations.
Top Skills: ArgocdCloud InfrastructureGitopsGoKubernetesLinuxPythonTerraform
5 Days Ago
In-Office or Remote
Santa Clara, CA, USA
272K-431K Annually
Expert/Leader
272K-431K Annually
Expert/Leader
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Lead the technical direction for production engineering in NVIDIA DGX Cloud, focusing on Kubernetes operations, automation, and reliability for GPU clusters.
Top Skills: Ai/Ml InfrastructureGitopsGoInfrastructure AutomationKubernetesLinuxPython
5 Days Ago
In-Office or Remote
Santa Clara, CA, USA
184K-357K Annually
Senior level
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Design, build, and automate large-scale GPU clusters while improving operational workflows, collaborating across teams, and handling incident responses.
Top Skills: ArgocdCloud InfrastructureContainersGitopsGoInfrastructure AutomationKubernetesLinuxPythonTerraform

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account