Aldea Logo

Aldea

Data Engineer

Posted 16 Days Ago
In-Office or Remote
Hiring Remotely in San Francisco, CA
Mid level
In-Office or Remote
Hiring Remotely in San Francisco, CA
Mid level
The Data Engineer will design and scale data pipelines for AI research, process diverse datasets, generate synthetic data, and collaborate with ML engineers to optimize data quality.
The summary above was generated by AI

About Aldea

Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.


The Role 

We are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research. You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale, process diverse data sources across language and speech domains, and generate high-quality synthetic data for model training. 

This is a high-impact role where your work directly determines training quality and efficiency. If you're passionate about building data systems that power cutting-edge AI research, this role is for you.  

What You'll Do

  • Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains 
  • Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training 
  • Generate synthetic data for model training and evaluation across diverse tasks and domains 
  • Design efficient data loading systems achieving high throughput across multi-node training clusters 
  • Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments 
  • Collaborate with ML engineers and researchers to optimize pipelines and improve data quality

Minimum Qualifications

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience 
  • 3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications 
  • Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar) 
  • Experience with data quality techniques including deduplication, filtering, and validation at scale 
  • Proven ability to optimize data pipelines for performance and throughput in distributed systems 
  • Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats

Preferred Qualifications 

  • Experience building data pipelines for LLM pretraining or large-scale ML training 
  • Hands-on experience with synthetic data generation for language or speech models 
  • Experience with text processing at scale: tokenization, deduplication (MinHash, LSH), and quality assessment 
  • Familiarity with audio/speech data processing and dataset curation 
  • Knowledge of data contamination detection and dataset versioning best practices 
  • Experience optimizing data loaders for PyTorch or TensorFlow at scale 
  • Understanding of distributed storage systems (S3, GCS, HDFS) and data streaming patterns

Compensation & Benefits

  • Competitive base salary
  • Performance-based bonus aligned with research and model milestones
  • Equity participation
  • Comprehensive health, dental, and vision coverage
  • Flexible paid time off



Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.


Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov.


Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.

Top Skills

Dask
Gcs
Hdfs
Python
PyTorch
Ray
S3
Spark
TensorFlow

Similar Jobs

8 Days Ago
Remote or Hybrid
United States
60K-120K Annually
Mid level
60K-120K Annually
Mid level
Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
The Data Engineer will build and maintain data solutions, optimize data architectures, and ensure data quality while collaborating with cross-functional teams.
Top Skills: BigQueryGoogle Cloud PlatformPythonSQL
14 Days Ago
Easy Apply
Remote
United States
Easy Apply
170K-215K Annually
Mid level
170K-215K Annually
Mid level
Artificial Intelligence • Fintech • Healthtech • Software
The Data Engineer III role involves designing and maintaining scalable data pipelines, modernizing data flows, and improving data quality within Cedar's healthcare system. Collaboration with various teams is essential to ensure accurate data delivery and governance standards are met.
Top Skills: AirflowAWSDbtPythonSnowflakeSQL
20 Days Ago
Remote
United States
150K-220K Annually
Mid level
150K-220K Annually
Mid level
Software • Defense
As a Data Engineer, you will design and maintain data pipelines, build data storage systems, and support analytics workflows, ensuring data reliability and performance. You will collaborate with engineers to define data structures and improve data management practices.
Top Skills: AWSAzureGCPPythonSQL

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account