Are you passionate about pushing the boundaries of technology in the Gen AI space? Rohirrim is seeking a Senior Data Engineer to mentor engineers, provide technical direction, and drive the development of cutting-edge applications. If you thrive in a fast-paced environment and enjoy leading by example while staying hands-on with coding, we want to hear from you!
Why Join Rohirrim?At Rohirrim, we're at the forefront of innovation in the Gen AI space. Joining our team means being part of a dynamic environment where your leadership and expertise make a tangible impact on our products and team growth.
As a Data Engineer at Rohirrim, you’ll design, build, and optimize the data pipelines and infrastructure that fuel our AI products. You’ll work closely with our AI/ML teams, product teams, customer success managers,and security/compliance partners to transform complex enterprise datasets into clean, reliable, structured foundations for Rohan deployments — especially in controlled, secure, or GovTech environments.
You’ll help us scale:
- ingestion pipelines
- vector stores
- embedding workflows
- metadata & document-processing frameworks
- Azure-native data services
…in a way that is fast, compliant, and deeply reliable.
- Blend capabilities in software engineering, data engineering and devops to build and maintain scalable data ingestion pipelines for structured/unstructured data (documents, PDFs, knowledge bases, enterprise systems, APIs, etc.).
- Develop and operate ETL/ELT workflows that ensure data integrity, security, and lineage.
- Implement and optimize vector database systems and embeddings pipelines supporting RAG and AI features.
- Collaborate with ML engineers to support model training, evaluation, and feature engineering pipelines.
- Architect and manage Azure-based data infrastructure (e.g., Azure Functions, Azure Storage, Azure SQL, Azure Kubernetes Service, Azure OpenAI integrations).
- Build internal tools for metadata extraction, OCR/document parsing, text normalization, and validation.
- Ensure pipelines meet compliance, auditability, and security requirements (SOC2, FedRAMP, etc.).
- Support customer-specific data onboarding workflows for government + enterprise deployments.
- Monitor and improve pipeline performance, reliability, and scalability.
- 10+ years in Data Engineering, Software Engineering, or ML/Data Infrastructure roles.
- Strong experience with Python, SQL, and modern data engineering tools (Airflow, Dagster, dbt, Prefect, etc.).
- Experience building large-scale document extraction ETL pipelines (OCR, PDF parsing, metadata extraction, NLP preprocessing).
- Proficiency with Kubernetes, Docker, and containerized data pipelines deployed on Azure, AWS and/or Google Cloud
- Hands-on experience with relational databases (Postgres, SQL Server, MySQL) and non-relational systems such as Elasticsearch, Redis, and graph databases
- Experience with document-heavy or text-heavy data processing (OCR, parsing, NLP preprocessing).
- Strong data quality, governance, lineage, and validation mindset.
- Excellent communicator who can align with ML, engineering, and product teams.
- Experience building or supporting GenAI / LLM / RAG pipelines.
- Experience with Azure OpenAI Service.
- Experience with min.io
- Background with knowledge graphs, semantic search, or indexing at scale.
- Familiarity with CI/CD pipelines in Azure DevOps, GitHub Actions, or similar.
Top Skills
Similar Jobs
What you need to know about the San Francisco Tech Scene
Key Facts About San Francisco Tech
- Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Google, Apple, Salesforce, Meta
- Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
- Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
- Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

.png)

