Sciforium Jobs

LLM Dataset Engineer

Sciforium

LLM Dataset Engineer

Reposted 15 Days Ago

In-Office

San Francisco, CA, USA

155K-210K Annually

Senior level

In-Office

San Francisco, CA, USA

155K-210K Annually

Senior level

The Data Scientist role involves creating and managing datasets for LLMs, developing high-throughput data processing scripts, and employing statistical analysis to ensure data quality and performance.

The summary above was generated by AI

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

Role Overview

Sciforium is seeking a highly technical and visionary LLM Dataset Engineer to lead the strategy, creation, and curation of the massive datasets that power our foundation models. We believe that in the era of LLMs, data is the primary competitive advantage. In this role, you will own the end-to-end data lifecycle—from raw web-scale crawling to the fine-grained human-alignment datasets that define model behavior.

This position is ideal for a scientist who views data as a high-scale engineering challenge and an analytical puzzle. You will not just "provide" data; you will design the taxonomies, filtering heuristics, and post-training pipelines that ensure our models are world-class in reasoning, safety, and multimodal understanding.

Key Responsibilities

Foundation Dataset Strategy: Own the end-to-end creation of pre-training datasets for LLMs. This includes defining the mix of web data, code, books, and technical papers to optimize for downstream model performance.
Petabyte-Scale Curation: Design and implement sophisticated pipelines for data cleaning, exact/fuzzy deduplication, and high-quality signal extraction from petabytes of raw, unstructured data.
Post-Training & Alignment Data: Lead the development of high-quality post-training datasets, including Supervised Fine-Tuning (SFT) instructions, multi-turn dialogues, and preference modeling data (RLHF/DPO).
Multimodal Expansion: Drive the acquisition and processing of vision and video data, navigating the complexities of multimodal alignment, video compression, and temporal data consistency.
High-Performance Engineering: Develop high-throughput data processing scripts using Python, leveraging multiprocessing and multithreading to handle massive-scale ingestion and transformation without bottlenecks.
Data Profiling & Analysis: Conduct deep-dive statistical analysis on training corpora to identify biases, gaps in knowledge, and quality regressions, ensuring the "diet" of the model is mathematically balanced.
Synthetic Data Generation: (Added Value) Design pipelines to generate high-reasoning synthetic data to augment gaps in natural datasets, utilizing existing models for data labeling and refinement.

Must-Haves

5+ years of industry experience in Data Science or Machine Learning, with a proven track record of building and managing datasets for foundation models.
Deep Proficiency in Python: Expert-level skills with a focus on high-performance code, including multiprocessing, multithreading, and efficient memory management for large-scale data tasks.
Petabyte-Scale Experience: Demonstrated experience working with petabyte-scale datasets that have been directly used to train production-grade LLMs or Large Vision Models.
Dataset Reconstruction: Experience building massive LLM training sets from scratch, including raw web crawls (e.g., Common Crawl) and specialized domain data.
Post-Training Expertise: Hands-on experience building datasets for RLHF, DPO, and multi-turn instruction following, including the management of human-labeling workflows and quality gold-sets.
Data Tooling: Mastery of data-at-scale frameworks such as Spark, Ray, or high-performance data-loading formats (e.g., WebDataset, Parquet).

Nice-to-Haves

Computer Vision (CV) Curation: Experience building large-scale image or video datasets from scratch (e.g., LAION-style pipelines).
Multimodal Crawling: Familiarity with large-scale crawling of multimodal data and the associated challenges of video processing, codecs, and compression.
Taxonomy Design: Experience in designing complex labeling schemas for reasoning, coding, and mathematical benchmarks.
Research Background: A Master’s or PhD in a quantitative field with a focus on data-centric AI or information retrieval.

Benefits include

Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

San Francisco, CA, United States

4401 El Camino Real, Los Altos, California, United States, 94022

Similar Jobs

Enverus

Owner Relations Agent - 26237

29 Minutes Ago

In-Office or Remote

United States

Mid level

Big Data • Information Technology • Software • Analytics • Energy

Answer owner relations calls about revenue, land, division orders, JIB, A/R, and A&P. Log and track inquiries in a case system, follow up on unresolved issues, build client relationships, handle difficult interactions professionally, and cross-train to expand skills.

Top Skills: MS Office

MetLife

Consultant

2 Hours Ago

Remote or Hybrid

United States

90K-105K Annually

Senior level

90K-105K Annually

Senior level

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics

Manage large group insurance client relationships with a focus on reporting and metrics. Serve as primary liaison, deliver client reports and insights, lead projects and implementations, drive strategic initiatives, mentor junior staff, and ensure accurate system data and documentation.

Top Skills: ExcelMS OfficeMicrosoft Powerpoint

MetLife

Business Analyst

2 Hours Ago

Remote or Hybrid

United States

55K-55K Annually

Mid level

55K-55K Annually

Mid level

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics

Partner with U100 sales to drive small-business growth by delivering analysis, reports, dashboards, and strategic sales support. Recommend process improvements, correct operational errors, lead projects, train sales on platforms, and support renewals and financial/contract evaluations.

Top Skills: CopilotMS OfficeSalesforce

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sciforium

LLM Dataset Engineer

Sciforium San Francisco, California, USA Office

Sciforium Los Altos, California, USA Office

Similar Jobs

Owner Relations Agent - 26237

Consultant

Business Analyst

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech