Sanas Logo

Sanas

Staff+ Data Engineer (ML Infrastructure)

Reposted 17 Days Ago
In-Office
Palo Alto, CA, USA
Expert/Leader
In-Office
Palo Alto, CA, USA
Expert/Leader
The Principal Data Engineer will design and implement data infrastructure for Voice AI products, lead a team of data engineers, and optimize data systems for machine learning. Responsibilities include building data pipelines, driving infrastructure strategy, and collaborating with cross-functional teams.
The summary above was generated by AI

Sanas is pioneering the future of human communication. Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world's first real-time speech AI platform capable of accent translation, noise cancellation, speech enhancement, cross-language communication, and more.
Sanas makes conversations clearer, more inclusive, and more effective, removing barriers that prevent people from being understood, regardless of accent, background noise, or native language.
Sanas is currently one of the fastest growing startups in Silicon Valley, growing from $16M to $50M ARR in 2025. The company's core business is profitable and is on track to end 2026 with >$120M ARR. Our team combines deep expertise in model innovation and systems engineering with a design-minded product engineering culture to build and ship cutting-edge AI models and experiences — entirely in-house.
Sanas is a 180-strong team, established in 2020. In this short span, we've successfully secured over $100 million in funding. Our innovation has been supported by the industry's leading investors, including Insight Partners, Google Ventures, Quadrille Capital, General Catalyst, Quiet Capital, and other influential investors. Our reputation is further solidified by collaborations with numerous Fortune 100 companies. With Sanas, you're not just adopting a product; you're investing in the future of communication.
If you’re looking to have a significant role in roadmapping and driving technical directions, if you’re looking to deploy challenging and big ideas without much overhead or slowness, if you're looking to leave your mark on an ambitious, generational mission to change how the worlds thinks about speech + AI, then Sanas is a well-suited place for you.

About the Role

Our models are only as good as the data that trains them. As a Staff Data Engineer, you'll own the infrastructure that takes raw audio — millions of hours across accents, languages, noise conditions, and recording environments — and turns it into clean, reproducible, training-ready data at scale. You'll work directly with AI research scientists and ML engineers to design systems that move fast without breaking the data quality guarantees our models depend on.

Job Description

Data pipeline & lakehouse architecture

  • Design and implement large-scale data pipelines that ingest, transform, validate, and serve high-quality audio and metadata for AI model training, evaluation, and product telemetry.
  • Own the lakehouse architecture — table format choices (Iceberg vs. Delta Lake), partitioning strategies, metadata management, and schema evolution — with a bias toward reproducibility and auditability.
  • Build and maintain batch and streaming pipelines using Spark, Flink, and orchestration tooling (Airflow or Dagster), with a clear-eyed view of when each is the right tool.
  • Extend and maintain feature store infrastructure to serve low-latency, versioned features for both training and real-time inference.

Audio data at scale

  • Develop and maintain pipelines purpose-built for the unique challenges of audio data: large file volumes, time-series feature extraction, speaker and language metadata, and annotation versioning.
  • Build tooling that supports the full audio data lifecycle — from raw ingestion and quality filtering through augmentation, segmentation, and training split generation — with reproducibility guarantees at every stage.
  • Partner with ML engineers and research scientists to design data schemas, sampling strategies, and evaluation datasets that accurately reflect production conditions.
  • Own data pipelines that feed human-in-the-loop annotation workflows — ensuring clean round-trips between raw data, labeling platforms, and training-ready outputs.

Platform reliability & governance

  • Instrument pipelines with observability, data quality checks, lineage tracking, and alerting — so failures surface fast and root causes are traceable.
  • Drive build vs. buy decisions for data quality, observability, and cataloging tooling with a clear framework grounded in Sanas's scale and roadmap.
  • Own disaster recovery design for critical data assets — training datasets, evaluation benchmarks, and model checkpoints.

Technical leadership

  • Set the technical bar for the data engineering team — review designs and code, establish patterns, and document decisions in a way that raises the floor for everyone.
  • Work cross-functionally with AI research, infrastructure, product, and legal to align data architecture with business needs and regulatory requirements.
  • Contribute to hiring — identify strong candidates, conduct technical interviews, and help define what great looks like for data engineering at Sanas.
Qualifications
  • 5+ years of experience in data engineering, ML infrastructure, or data platform roles.
  • Deep expertise building distributed batch and streaming data systems in production.
  • Strong command of data processing frameworks: Spark, Flink, and Ray; and orchestrators: Airflow or Dagster.
  • Hands-on experience with cloud data platforms — Snowflake, Databricks, or ClickHouse — and object storage (S3, GCS) on AWS or GCP.
  • Solid understanding of data lifecycle management: privacy, security, compliance, and reproducibility from ingestion through model training.
  • Proven ability to work directly with ML researchers and engineers to translate model requirements into data infrastructure decisions.
Bonus
  • Direct experience with audio data pipelines — file handling at scale, time-series features, speaker metadata, or audio annotation tooling.
  • Familiarity with ASR, TTS, or speech enhancement model training workflows and the data requirements specific to each.
  • Experience with MLOps tooling — experiment tracking, dataset versioning (DVC, LakeFS), and training pipeline orchestration.

Sanas Palo Alto, California, USA Office

Palo Alto, CA, United States

Similar Jobs

21 Days Ago
In-Office
Mountain View, CA, USA
194K-352K Annually
Senior level
194K-352K Annually
Senior level
Artificial Intelligence • Automotive • Information Technology • Robotics
Design and develop data pipelines for autonomous driving systems, create storage for data metrics, build dashboards, and maintain data integrity, focusing on ML components.
Top Skills: BigQueryC++GCPGcsPostgresPython
111K-183K Annually
Senior level
Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing
Provide technical support and troubleshooting for aircraft Environmental Control Systems across Boeing models. Liaise with customers and cross-functional teams, develop and document engineering dispositions, perform tests and root cause analysis, and create technical documentation to return airplanes to service and resolve in-service system issues.
Top Skills: Airflow AnalysisAirflow ModelingAnti-Ice ProtectionBoeing 737MaxBoeing 737NgBoeing 777Boeing 777XBoeing 787Cabin Air ConditioningCabin PressurizationCargo Smoke DetectionEnvironmental Control SystemsFire Suppression SystemsFuel Tank InertingInstrument CoolingNitrogen Generation SystemPneumatic SystemsTemperature Control SystemThermodynamics
91K-151K Annually
Junior
Aerospace • Information Technology • Software • Cybersecurity • Design • Defense • Manufacturing
Support in-service and development landing gear systems by troubleshooting, fault isolation, root cause analysis, and reliability improvement. Work with airlines, MROs, suppliers, and cross-functional teams to develop technical dispositions, documentation, repair procedures, and return aircraft to service. Participate in AOG support and supplier performance monitoring.
Top Skills: Anti-SkidBoeing Communication SystemBrakesHydraulic ActuationLanding Gear SystemsTiresWheels

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account