Periodic Labs Logo

Periodic Labs

Research Engineer - Data

Reposted 20 Days Ago
In-Office
Menlo Park, CA, USA
350K-400K Annually
Mid level
In-Office
Menlo Park, CA, USA
350K-400K Annually
Mid level
You will manage the data strategy for research, sourcing datasets, building pipelines, ensuring data quality, and collaborating with researchers to optimize model training.
The summary above was generated by AI
About Periodic Labs

The most important scientific discoveries of our time won’t happen in a traditional lab. We’re an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what’s scientifically possible.

About the Role

You will build and drive the data foundation for our research efforts. This means owning data strategy end-to-end: sourcing and procuring external datasets, integrating internally generated experimental data into the training stack, and ensuring the team always has the right data — in the right shape — to train and improve frontier models.

This role sits at the intersection of data engineering, research infrastructure, and strategy. You will work closely with pretraining, midtraining, and RL researchers to understand what data the models need, then build the pipelines and systems to get it there. The work spans collecting and organizing diverse data sources, improving data quality through deduplication and preprocessing, and ensuring that new experimental results are incorporated in a structured, repeatable way that makes them useful for model development.

What You’ll Do
  • Own data strategy across the training stack — identifying gaps, evaluating new sources, and shaping the overall data roadmap in collaboration with research leads

  • Source, evaluate, and procure external datasets across scientific domains including chemistry, physics, materials science, mathematics, and lab instrumentation

  • Build and maintain robust pipelines for ingesting, processing, and versioning large-scale datasets from heterogeneous sources

  • Design and implement new evaluation datasets and new RL environments to track and improve our key capabilities

  • Integrate internally generated experimental data — from lab instrumentation, simulations, and model outputs — into the training stack in a structured and repeatable way

  • Build tooling that makes it easy for researchers to inspect, query, and understand the data that goes into training runs

  • Stay current with research on data-efficient training, synthetic data generation, and data selection methods — and bring relevant ideas into production

You Will Thrive in This Role If You Have
  • Experience building large-scale data pipelines for LLM pretraining or midtraining, including web-scale or scientific corpora

  • Familiarity with dataset versioning, lineage tracking, and reproducibility tooling such as DVC, Delta Lake, or custom solutions

  • Experience sourcing and evaluating third-party datasets, including licensing considerations and quality assessment

  • Strong Python engineering skills and comfort building production-quality tooling in a research environment

  • Experience making evaluations and RL environments

  • Experience collaborating directly with ML researchers to translate data needs into pipeline requirements and back again

  • A research-oriented mindset — you run experiments on data, measure outcomes, and iterate with rigor

Especially Strong Candidates May Also Have
  • Experience curating scientific datasets specifically for domain-adaptive continued pretraining or instruction tuning

  • Familiarity with synthetic data generation methods, including model-generated data pipelines and quality verification

  • A background in a physical science or engineering discipline that informs how you think about scientific data quality and structure

  • Experience with multimodal data — integrating text, structured numerical data, molecular representations, or spectral data into unified training pipelines

Mechanics

Minimum education: Bachelor’s degree or similar experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: $250,000-350,000 + equity

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.

Similar Jobs

5 Days Ago
In-Office
Palo Alto, CA, USA
175K-250K Annually
Mid level
175K-250K Annually
Mid level
3D Printing • Consulting • Design • Manufacturing
Design and build an end-to-end, high-throughput dataloading stack for massive multimodal datasets: formatting, preprocessing, filtering, sharding, caching, and streaming data to distributed GPU training with observability, reliability, and performance benchmarking.
Top Skills: AirflowC++CudaDagsterDockerGpuKubernetesMlflowPrefectPythonPyTorchRustW&B
5 Days Ago
In-Office
110K-130K Annually
Mid level
110K-130K Annually
Mid level
Healthtech
The role involves building and maintaining data pipelines, ensuring data quality, and collaborating with clinical researchers to provide clean datasets for analysis.
Top Skills: AzureDatabricksPythonSparkSQL
26 Days Ago
In-Office
San Francisco, CA, USA
150K-250K Annually
Mid level
150K-250K Annually
Mid level
Artificial Intelligence • Machine Learning • Software
As a Research Engineer, you'll enhance visual understanding capabilities by training and deploying models to make video datasets queryable, driving down costs, and collaborating on data taxonomy design.
Top Skills: Convolutional Perception ModelsDaftEmbedding ModelsMultimodal ModelsRaySparkVideo ModelsVision-Language ModelsVqa Models

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account