Cohere AI Logo

Cohere AI

Senior ML Systems Engineer, Frameworks & Tooling

Reposted 8 Hours Ago
In-Office or Remote
Hiring Remotely in San Francisco, CA, USA
Senior level
In-Office or Remote
Hiring Remotely in San Francisco, CA, USA
Senior level
The Senior ML Systems Engineer will build and maintain the training framework for large-scale language models, focusing on distributed training and performance optimization.
The summary above was generated by AI

Who are we?

Cohere is the leading security-first enterprise AI company. We build cutting-edge foundation AI models and end-to-end products that are designed to solve real-world business problems.

We’re training and deploying frontier models for enterprises who are building AI systems. We believe that our work is instrumental to the widespread adoption of AI and we are looking for folks that want to be part of that.

We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. Cohere is a team of researchers, engineers, designers, and more, who are all passionate about their craft.

We are a global technology company co-headquartered in Toronto and San Francisco, with key offices in London, New York City, Montreal, Seoul, Germany and Paris. Join us!

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs.

If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

What You’ll Work On
  • Build and own the training framework responsible for large-scale LLM training.

  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).

  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).

  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.

  • Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training.

  • Investigate and resolve performance bottlenecks across the ML systems stack.

  • Build robust systems that ensure reproducible, debuggable, large-scale runs.

You Might Be a Good Fit If You Have
  • Strong engineering experience in large-scale distributed training or HPC systems.
    Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.

  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).

  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.

  • Experience working with containerized environments (Docker, Singularity/Apptainer).

  • A track record of building tools that increase developer velocity for ML teams.

  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.

  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.

Nice to Have
  • Experience with training LLMs or other large transformer architectures.

  • Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.).

  • Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches).

  • Experience with data pipeline optimization, sharded datasets, or caching strategies.

  • Background in performance engineering, profiling, or low-level systems.

Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP).

Why Join Us
  • You’ll work on some of the most challenging and consequential ML systems problems today.

  • You’ll collaborate with a world-class team working fast and at scale.

  • You’ll have end-to-end ownership over critical components of the training stack.

  • You’ll shape the next generation of infrastructure for frontier-scale models.

  • You’ll build tools and systems that directly accelerate research and model quality.

Sample Projects:

  • Build a high-performance data loading and caching pipeline.

  • Implement performance profiling across the ML systems stack

  • Develop internal metrics and monitoring for training runs.

  • Build reproducibility and regression testing infrastructure.

  • Develop a performant fault-tolerant distributed checkpointing system.

Full-Time Employees at Cohere enjoy these Perks:
  • A weekly lunch stipend of $75/£75 or equivalent in your local currency for lunch.

  • Full health and dental benefits, including a separate budget for mental health.

  • RRSP matching, 401K, Pension Scheme.

  • 100% Parental Leave top-up for up to 6 months, for either parent.

  • Annual enrichment benefits:

    Arts & culture, fitness/wellness, quality time, and a workspace improvement credit.

    Education & learning stipend for conferences, courses, and coaching.

  • 6 weeks of paid vacation (30 working days!)

  • Budget for traveling to other offices if you are remote, plus an annual company offsite.

How and Where We Work:
  • Cohere is remote-friendly. We have offices in Toronto, San Francisco, New York City, London, Paris, Montreal, and more coming soon.

  • For those in the office: a daily lunch program, plenty of snacks, and regular community and social events.

  • For those not near an office: a co-working benefit so you can work alongside others in your city.

  • Everyone receives a $500 home office stipend to set up your workspace properly.

If any of the above doesn’t line up exactly with your experience, we still encourage you to apply.


We strive to create an inclusive work environment for all; we welcome applicants from all backgrounds and are committed to providing equal opportunities. Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.

We may use AI-enabled tools to screen and assess applicants against the criteria for this position. This helps our recruiters identify potentially qualified candidates, but it doesn't limit the applications our recruiters may review or consider.

Cohere AI San Francisco, California, USA Office

San Francisco, California, United States

Similar Jobs

12 Days Ago
Remote
125K-170K Annually
Senior level
125K-170K Annually
Senior level
Artificial Intelligence • Big Data • Cloud • Analytics
As a Senior Machine Learning Engineer, you will create machine learning solutions, optimize models, and lead client projects using cloud technologies.
Top Skills: AWSAzureGCPPythonSQL
An Hour Ago
Remote or Hybrid
Senior level
Senior level
Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Manage and grow ServiceNow partner ecosystem across Canada through partner business planning, enablement, governance, reporting, coaching, and joint GTM to drive partner revenue and program maturity. Conduct reviews, remediation, and cross-functional alignment while supporting partner portal operations and enablement programs.
Top Skills: Ai-Powered ToolsServicenow
4 Hours Ago
Remote or Hybrid
100K-135K Annually
Senior level
100K-135K Annually
Senior level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Lead development and administration of compensation frameworks, manage the annual compensation planning cycle, perform market benchmarking and pay equity analyses, administer equity programs, own compensation system administration and automation, support international pay benchmarking (including India), translate findings into cost/benefit recommendations, ensure compliance with compensation laws, and mentor junior team members and HR partners.
Top Skills: AIAonBettercompCulpepperData AnalyticsExcelHrisPequityRadfordUkgWtw

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account