Onebrief

Senior Site Reliability Engineer, Hawaii

Reposted Yesterday

Be an Early Applicant

In-Office

2 Locations

180K-220K Annually

Senior level

In-Office

2 Locations

180K-220K Annually

Senior level

We are hiring a Senior Site Reliability Engineer to ensure deployment stability and service quality, working in on-premise DoD and AWS environments.

The summary above was generated by AI

About Onebrief

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient.

We take ownership, seek excellence, and play to win with the seriousness and camaraderie of an Olympic team. Onebrief operates as an all-remote company, though many of our employees work alongside our customers at military commands around the world.

Founded in 2019 by a group of experienced planners, today, Onebrief’s team spans veterans from all forces and global organizations, and technologists from leading-edge software companies. We’ve raised $123m+ from top-tier investors, including Battery Ventures, General Catalyst, Insight Partners, and Human Capital, and today, Onebrief is valued at $1.1B. With this continued growth, Onebrief is able to make an impact where it matters most.

Security Clearance, Location, and Onsite Notice:

This role requires regularly working on-site at customer locations on Oahu, Hawaii, specifically Camp H.M. Smith and Joint Base Pearl Harbor-Hickam.

If you are not currently within commuting distance, you must be willing to relocate (note that Onebrief will provide relocation assistance).

Active Top Secret Clearance required; SCI eligibility is a plus.

About The Role

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success.

You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation.

In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise.

About You

You care deeply about reliability and treat it as a core feature of any application or platform, with a bias toward “reliability over novelty.” You think about infrastructure and operability as products to be automated, well-documented, and continuously improved, and you aim to leave systems easier to operate than you found them.

You are equally comfortable leading a post-incident review, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts.

You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.

What You'll Do

You'll own the reliability, scalability, and security of the production application and/or platform. You will do this by:

Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). You won't just track metrics; you'll create the actionable insights and automated alerting that allow teams to identify and resolve issues before they impact users.
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs), increasing trust internally and externally. You will be the organization's expert on what it means for our systems to be reliable and how to measure it.
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). You will embed security and compliance controls (RMF, STIGs) directly into this automation.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will partner with other teams to share best practices for air-gapped environments and support their readiness for production.

What We Look For

An active Top Secret clearance
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.

Technical expertise

Infrastructure as Code: Terraform (or CloudFormation), Ansible.
Containers and orchestration: Kubernetes design, deployment, and operations.
CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
Scripting: proficiency with at least one of Python, Go, or Bash.
Cloud: Familiarity with AWS or AWS GovCloud.
Observability: Grafana stack, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.

Bonus points (nice to have)

Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503).
GitOps practices and toolchains.
Security‑minded design for sensitive environments.
Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems.
Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc).
Service mesh exposure (Istio, Linkerd).
Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD).
Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.

Notice to Third Party Recruitment Agencies
Please note that Onebrief does not accept unsolicited resumes from recruiters or employment agencies. In the absence of an executed Recruitment Services Agreement, there will be no obligation to any referral compensation or recruiter fee. In the event a recruiter or agency submits a resume or candidate without an agreement Onebrief explicitly reserves the right to pursue and hire those candidate(s) without any financial obligation to the recruiter or agency. Any unsolicited resumes, including those submitted to hiring managers, shall be deemed the property of Onebrief.

Top Skills

Ansible

AWS

Docker

Dod Compliance

Helm

Kubernetes

Linux

Terraform

VMware

Similar Jobs at Onebrief

Onebrief

Operations Manager

Yesterday

In-Office

Wahiawa, HI, USA

155K-175K Annually

Expert/Leader

155K-175K Annually

Expert/Leader

Software • Defense

Manage customer relationships, drive product adoption, and collaborate with military staff to improve planning workflows. Support exercises and ensure customer satisfaction.

Top Skills: Ai-Powered SoftwareMilitary Planning Software

Onebrief

Solutions Engineer

Yesterday

In-Office

Honolulu, HI, USA

190K-220K Annually

Mid level

190K-220K Annually

Mid level

Software • Defense

The Solutions Engineer at Onebrief will support the deployment of operational planning platforms in military environments, collaborating across teams to ensure secure integrations, effective communication, and alignment on technical requirements.

Top Skills: AnsibleAWSAzureBashCloudFormationDatadogDockerElk StackGoGCPGrafanaIstioKubernetesLinkerdPrometheusPythonTerraform

Onebrief

Engagement Manager

Yesterday

In-Office

Honolulu, HI, USA

145K-175K Annually

Expert/Leader

145K-175K Annually

Expert/Leader

Software • Defense

Manage customer relationships at military headquarters, expand product usage, ensure renewals, communicate user needs to the product team, and facilitate coordination between users and engineering.

Top Skills: Ai-Powered Workflow SoftwareClassified Networks

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine