Own and optimize the end-to-end LLM serving stack on Majestic hardware: port frameworks (vLLM/SGLang), implement batching, scheduling, paged KV cache, distributed inference, multi-modal preprocessing, speculative/prefix decoding, profile and eliminate bottlenecks across runtime, compiler, and hardware.
Description
The Role
In this high-impact role, you are the bridge between cutting-edge custom silicon and production-grade AI. You will own the end-to-end LLM serving stack on Majestic hardware, architecting everything from serving APIs down to KV cache management, batching, and scheduling. Your primary mission is to port leading frameworks like vLLM and SGLang to our accelerator and optimize them for peak performance. Because our architecture offers memory headroom, you won't just match traditional GPUs; you will shatter their limits on throughput, batch sizes, and context lengths. As you hunt down bottlenecks, your insights will directly steer our future kernel, compiler, and hardware development.
What You'll Own
- The serving stack, end to end — bring up and adapt a modern inference framework (vLLM, SGLang, or similar) to run on Majestic hardware.
- The runtime hot path — continuous batching, the scheduler, paged KV cache, and prefill/decode disaggregation.
- Distributed inference at scale — tensor, pipeline, and expert parallelism across accelerators, wired into our collective communication library (CCL).
- The multi-modal pipeline — image, audio, and video preprocessing, encoder integration, and mixed-modality batching.
- Inference-time techniques — speculative decoding, prefix caching, and structured decoding.
- End-to-end performance — profile, benchmark, and hunt down bottlenecks across the full serving path, feeding findings back to the kernel, compiler, and hardware teams.
What We're Looking For
- 3+ years building or operating production LLM inference and serving systems (5+ preferred).
- Deep, hands-on work with a modern inference framework vLLM, SGLang, TensorRT-LLM, Fireworks, or similar including its scheduler, paged attention / KV cache, model executor, and backend integration points.
- Strong Python and C++, with the ability to move fluidly between the two.
- A real grasp of transformer inference the prefill/decode split, KV cache behavior, and how batching dynamics shape latency and throughput.
- Distributed inference experience tensor and pipeline parallelism across multiple devices.
- An instinct for performance you can profile an end-to-end stack and chase a regression from the serving API all the way down to the kernel.
Similar Jobs
Fintech • Machine Learning • Payments • Software • Financial Services
Lead AI Engineer responsible for developing AI-powered products, optimizing large language model inference, and collaborating across teams to enhance customer interactions and business value.
Top Skills:
AWSAzureGoGCPHuggingfaceJavaNemo GuardrailsPythonPyTorchScalaVectordbs
Fintech • Machine Learning • Payments • Software • Financial Services
Lead AI Engineer responsible for designing and developing AI-powered products, utilizing ML algorithms, and optimizing large-scale AI systems. Collaborate with cross-functional teams to enhance customer experience and drive AI initiatives at Capital One.
Top Skills:
AWSAzureGoGCPHuggingfaceJavaNemo GuardrailsPythonPyTorchScalaVectordbs
Fintech • Machine Learning • Payments • Software • Financial Services
The Lead AI Engineer will develop and support AI software components, collaborate with cross-functional teams, and optimize large language models for performance and scalability.
Top Skills:
AWSAzureGoGCPHuggingfaceJavaNemo GuardrailsPythonPyTorchScalaVectordbs
What you need to know about the San Francisco Tech Scene
San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.
Key Facts About San Francisco Tech
- Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Google, Apple, Salesforce, Meta
- Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
- Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
- Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine
.png)
