Site Reliability Engineer
Domino has an ambitious vision for data science and machine learning. Our platform helps data science teams accelerate research, increase collaboration, and rapidly deploy predictive models. Our customers are the most sophisticated analytical organizations in the world, including Salesforce, Dell, RedHat, Gap, Bristol-Myers Squibb, and Bayer. Backed by Sequoia Capital, Zetta Venture Partners, and Bloomberg Beta, we are at the epicenter of the data science revolution, helping companies build better cars, develop more effective medicine, or simply recommend the best song to play next.
You will be joining a team of high-performance engineers and have a significant impact on managing a growing infrastructure and service delivery. You’ll be tasked to maintain the health of the Domino platform in a variety of environments, enhancing our observability systems, engineering reliability into our stack, and governing our infrastructure.
We are especially interested in engineers with experience operating services on GCP or Azure or implementing security policies and controls in cloud service providers.
Responsibilities
- Engineer reliability and performance into our product and services
- Instrument and monitor service health
- Manage and secure our cloud-based infrastructure
- Diagnose and fix issues in a distributed, containerized application
- Incident response (on-call) and root cause analysis
- Implement and manage access control and security services
- Collaborate with developers and PMs to continuously improve Domino
- Develop tools and processes to improve efficiency and reduce toil
Qualifications
Tech we use is listed in parentheses; comparable experience is OK.
- Experience with managing cloud environments (AWS, GCP, Azure)
- Strong coding ability (Python, Bash)
- Systems fluency (Linux, storage, networking)
- Experience with container management (Kubernetes, Docker)
- Observability systems (New Relic, Prometheus)
- Operating stacks based on modern software components
(Redis, ElasticSearch, RabbitMQ, MongoDB, PostgreSQL, Play) - Programming experience (Python, Go, Bash)
- Infrastructure and configuration automation (Terraform, SaltStack)
- Exceptional problem solving acumen