The Voleon Group Logo

The Voleon Group

Senior Cluster Site Reliability Engineer

Reposted 14 Days Ago
In-Office or Remote
2 Locations
205K-235K Annually
Senior level
In-Office or Remote
2 Locations
205K-235K Annually
Senior level
The Senior Cluster Site Reliability Engineer will enhance the research compute cluster's uptime, reliability, and performance through engineering and operational improvements, ensuring high availability for researchers working on machine learning problems.
The summary above was generated by AI
Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. 

As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant.  Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale.  You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff.  You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon.

The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs. 

Responsibilities

  • Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
  • Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
  • Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
  • Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
  • Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
  • Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Requirements

  • 5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
  • Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
  • Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
  • Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
  • Experience with cloud infrastructure (AWS or GCP)
  • Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
  • Experience with distributed storage technologies (Lustre, Ceph, S3)
  • Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
  • Bachelor degree in computer science

Preferred Qualifications

  • Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
  • Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
  • Familiarity with hybrid/on-prem environments
  • Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
  • Experience with HPC networking (InfiniBand, RDMA)
  • Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

The base salary range for this position is $205,000 to $235,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match.
 
“Friends of Voleon” Candidate Referral Program
If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program.
 
Equal Opportunity Employer
The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Top Skills

Ansible
AWS
AWS
Ceph
Docker
Elk
GCP
GCP
Grafana
Horovod
Hpc
Infiniband
Kubeflow
Kueue
Loki
Lustre
Mlflow
Opentelemetry
Podman
Prometheus
Python
Rdma
Ruby
S3
Singularity
Slurm
Terraform
HQ

The Voleon Group Berkeley, California, USA Office

Downtown, Berkeley, CA, United States, 94704

Similar Jobs

An Hour Ago
Remote or Hybrid
US
100K-105K Annually
Junior
100K-105K Annually
Junior
Artificial Intelligence • eCommerce • Information Technology • Internet of Things • Automation
The Senior Services Process Transformation Analyst collaborates with stakeholders to analyze, document, and improve business processes, while assisting in the implementation of technology-driven changes.
Top Skills: Data AnalyticsExcelMS OfficePower BI
An Hour Ago
Remote or Hybrid
US
116K-132K Annually
Mid level
116K-132K Annually
Mid level
Artificial Intelligence • eCommerce • Information Technology • Internet of Things • Automation
The Manager leads a team focused on Microsoft and Collaboration solutions, driving sales growth, aligning strategies, and overseeing team performance. They foster collaboration, develop talent, and ensure a customer-centric approach while managing operational processes and metrics.
Top Skills: Microsoft LicensingSalesforce
3 Hours Ago
Remote or Hybrid
New York, NY, USA
210K-260K Annually
Senior level
210K-260K Annually
Senior level
Productivity • Sales • Software
Lead a team of Customer Success Managers focused on strategic accounts, influencing product and internal processes while driving customer outcomes.
Top Skills: SaaS

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account