Featherless AI Logo

Featherless AI

Machine Learning Engineer — Training Optimization

Posted 3 Days Ago
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
In-Office or Remote
Hiring Remotely in World Golf Village, FL
Mid level
The ML Engineer will optimize large-scale model training pipelines, improve distributed training strategies, build robust infrastructure, and collaborate on training techniques and performance metrics.
The summary above was generated by AI
About the Role

We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.

This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.

What You’ll Do
  • Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)

  • Improve distributed training strategies (data, model, and pipeline parallelism)

  • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)

  • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements

  • Collaborate with researchers on architecture-aware training strategies

  • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)

  • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)

  • Own training performance metrics and continuously push them forward

What We’re Looking For
  • Strong experience training large neural networks (LLMs or similarly large models)

  • Hands-on experience with training optimization (not just model usage)

  • Solid understanding of:

    • Backpropagation, optimization algorithms, and training dynamics

    • Distributed systems for ML training

  • Experience with PyTorch (required)

  • Comfort working close to hardware (GPUs, memory, networking constraints)

  • Ability to move fluidly between research ideas and production-ready code

Nice to Have
  • Experience with large-scale distributed training (multi-node, multi-GPU)

  • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks

  • Experience optimizing training on AMD or NVIDIA GPUs

  • Contributions to open-source ML infrastructure or research codebases

  • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Why Join Us
  • Real ownership at Series-A stage — your work shapes the company’s trajectory

  • Work on cutting-edge models and training systems at scale

  • Small, highly technical team with fast feedback loops

  • Strong emphasis on engineering quality and research rigor

  • Competitive compensation + meaningful equity

Top Skills

PyTorch
HQ

Featherless AI San Francisco, California, USA Office

San Francisco, California, United States

Similar Jobs

2 Hours Ago
Remote or Hybrid
United States
21-28 Hourly
Mid level
21-28 Hourly
Mid level
Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
The Healthcare Revenue Cycle Billing Specialist II delivers healthcare consulting services, analyzes financial and operational data, and ensures compliance with regulations while supporting clients and collaborating with internal teams.
Top Skills: EhrsHealthcare Analytics ToolsMicrosoft Office Suite
4 Hours Ago
Remote or Hybrid
United States
223K-414K Annually
Expert/Leader
223K-414K Annually
Expert/Leader
Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
The VP of Engineering for the Enterprise Platform will lead engineering teams, define platform architecture, and drive innovation for identity security solutions at SailPoint.
Top Skills: AICloud-Native ArchitectureEvent-Driven SystemsGraph DatabasesGraphQLIdentity/Security PlatformsMicroservicesMl
4 Hours Ago
Remote or Hybrid
United States
119K-222K Annually
Senior level
119K-222K Annually
Senior level
Artificial Intelligence • Cloud • Sales • Security • Software • Cybersecurity • Data Privacy
Design and build a machine learning platform, deploy ML models, collaborate across teams, and establish monitoring standards for AI solutions.
Top Skills: AIAmazon BedrockAmazon SagemakerAWSDockerFeastMicroservicesMlRestful Apis

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account