Hyphen Connect Limited
LLM Pre-training & Distributed Engineer (AI Infrastructure)
Be an Early Applicant
Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:
- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
Required Skills:
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Similar Jobs
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
As a Manager in Oracle HCM, you'll help clients optimize HR processes by implementing Oracle solutions, leading teams, and ensuring project success through effective problem-solving and innovation.
Top Skills:
Cc&BEbsFusionHyperionOracle ApplicationsOracle Hcm CloudPeoplesoftRiceSiebel
AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development
Design and implement a scalable UGC framework in Unreal Engine: data models, runtime systems, scripting model, APIs, sandboxing, performance budgets, and AI-enabled content tooling. Partner across gameplay, online, tools, and AI/ML teams, drive prototypes and documentation, and mentor engineers to ensure a cohesive, extensible platform for creators.
Top Skills:
Agent-Based SystemsAsset StreamingBlueprintsC++Ci/CdEntity Component System (Ecs)LlmsLuaMultithreadingPythonRestRpcSerializationUnreal EngineVerseWorld Partitioning
Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
Manage and grow a multi-million dollar SMB account portfolio by understanding customer needs, cross-selling Square products, retaining customers, collaborating with Sales, Support and Product, and supporting managed accounts.
Top Skills:
CRMSalesforceSquare
What you need to know about the San Francisco Tech Scene
San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.
Key Facts About San Francisco Tech
- Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Google, Apple, Salesforce, Meta
- Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
- Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
- Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine


.png)
