Bespoke Labs Logo

Bespoke Labs

Backend Engineer

Posted 15 Days Ago
Hybrid
Mountain View, CA, USA
Mid level
Hybrid
Mountain View, CA, USA
Mid level
Design and maintain the infrastructure for RL environments, focusing on execution, performance optimization, and production excellence while collaborating with research teams and clients.
The summary above was generated by AI

About Bespoke Labs

Bespoke Labs is an applied AI research lab pioneering data and RL environment curation for training and evaluating agents.

Recently, we curated Open Thoughts, one of the best open reasoning datasets used by multiple frontier labs, trained SOTA specialized models such as Bespoke-MiniChart-7B and Bespoke-MiniCheck, and built the environment infrastructure that frontier labs and enterprises use to make their agents reliable.

Bespoke is uniquely positioned to capture a large share of data and RL environment curation.

About the Role

We're looking for an Infrastructure Engineer to own the execution layer beneath our RL environments: the systems that let an agent operate inside a realistic, multi-tool world coherently for hours or days.

This is a hard systems problem disguised as an AI job. As the tasks agents can complete keep lengthening, the environments that train them have to stay coherent across far longer horizons than anything that exists today. That means sandboxing and isolation you can trust, execution that's fast and cheap enough to run at training scale, and the ability to snapshot, restore, inspect, and branch a running environment instead of treating every rollout as one-shot. You'll build the platform that makes all of this possible.

You'll work closely with our research and data teams, and directly with frontier labs and enterprise customers, to turn environment designs into infrastructure that runs reliably in production.

What You'll Do

  1. Environment Execution & Sandboxing:

    • Design and own the sandboxing and execution layer that environments run inside. Build systems to snapshot and restore environment state (disk, process, and where relevant memory and accelerator state) so runs can be paused, resumed, inspected, and branched rather than executed once.

    • Develop the machinery to detect failure modes early in a rollout (reward hacks, infra faults, fairness issues) and to revert to a known-good state, patch, and continue.

    • Extend execution to long-horizon and multi-node environments, where an agent operates across many tools and services over hours or days.

  2. Performance & Scale

    • Own the performance characteristics of the platform: throughput, latency, and cost-per-rollout at scale.

    • Drive utilization and scheduling so we can run far more environment rollouts per dollar without sacrificing reliability.

    • Profile and remove bottlenecks across the stack, from container startup to environment teardown.

    • Build the observability that lets us understand what's happening inside thousands of concurrent, long-running rollouts.

  3. Environment Platform

    • Build and maintain the framework for specifying, packaging, and deploying RL environments which is used by both humans and agents authoring environments internally.

    • Create the tooling that lets researchers and environment authors debug a specific failure across hundreds of long agent traces.

  4. Collaboration & Production Excellence

    • Scale prototypes into production systems with reproducible workflows and high engineering standards.

    • Write the documentation and tools that let internal teams and external users build on the platform.

What We're Looking For

  1. Systems & Infrastructure

    • Strong track record building production systems or research infrastructure at scale: distributed systems, execution engines, container/sandboxing infrastructure, or similar.

    • Deep comfort with the systems layer: containers and isolation (e.g. namespaces, cgroups, VMs, gVisor/Firecracker-style sandboxing), filesystems, process and state management.

    • Experience making systems fast and cheap — profiling, scheduling, resource utilization, and cost optimization at scale.

    • Proficiency with cloud platforms (GCP, AWS) and distributed computing.

    • Strong engineering fundamentals and a systematic approach to testing, validation, and reliability.

  2. Execution & Ownership

    • Comfort operating in ambiguity.

    • Strong Python skills; comfort in a systems language (Rust, Go, or C++) is a plus.

    • Ability to use modern tools such as Claude Code effectively.

  3. Collaboration & Communication

    • Excellent communication skills for working with research teams and enterprise customers.

    • Ability to translate between research needs and infrastructure requirements.

    • Comfortable presenting technical work to diverse audiences.

Nice to Have

Experience with RL training or evaluation infrastructure, or the execution layer for agent rollouts.

Experience with checkpoint/snapshot-restore systems, CRIU, or distributed state management.

Background in high-throughput, low-latency execution systems.

Contributions to widely-used infrastructure, datasets, benchmarks, or open-source systems.

Previous experience in a research engineering or infrastructure role at an AI or systems-heavy company.

Logistics

Location: Mountain View, CA

Compensation: Competitive salary and equity

Benefits: Health coverage, and the opportunity to work directly with the world's leading AI research labs

HQ

Bespoke Labs Mountain View, California, USA Office

800 W El Camino Real, Mountain View, California, United States, 94040

Similar Jobs

Yesterday
Hybrid
Sunnyvale, CA, USA
140K-215K Annually
Senior level
140K-215K Annually
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Build and maintain cloud systems that manage and distribute configuration to millions of endpoints. Design and implement scalable backend services (primarily in Go, some Python), collaborate across cloud, frontend, and endpoint teams, perform code reviews, mentor engineers, improve reliability and testing, investigate outages, and travel occasionally for in-person meetings.
Top Skills: AWSAzureC++CassandraElasticsearchGoKafkaKubernetesLinux ContainersPythonRedisTypescriptUnix Shell
24 Days Ago
In-Office
San Francisco, CA, USA
160K-300K Annually
Senior level
160K-300K Annually
Senior level
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Financial Services • Generative AI
As a Backend Engineer, you will develop scalable systems, optimize performance, mentor junior engineers, and lead critical projects at Hebbia's Agent Collaboration platform.
Top Skills: AWSElasticsearchGoJavaKafkaPostgresPythonRedis
24 Days Ago
In-Office
San Francisco, CA, USA
160K-300K Annually
Senior level
160K-300K Annually
Senior level
Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Financial Services • Generative AI
As a Backend Engineer, you'll build and maintain APIs and backend systems, optimize performance, mentor junior engineers, and drive user engagement for Hebbia's AI platform.
Top Skills: AWSElasticsearchGoJavaKafkaPostgresPythonRedis

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account