Calix Logo

Calix

Staff Site Reliability Operations Engineer

Posted 3 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in USA
136K-266K Annually
Senior level
Remote
Hiring Remotely in USA
136K-266K Annually
Senior level
Lead global platform reliability and observability on GCP. Architect full-stack networking (L1-L7), scale GKE, manage high-throughput Kafka pipelines, and maintain PostgreSQL/AlloyDB/BigQuery. Deploy Grafana telemetry stack and AIOps for intelligent alerting and automated incident response. Provide technical leadership, roadmap ownership, and mentor engineers in distributed systems and observability best practices.
The summary above was generated by AI
The Calix platform enables Communication Service Providers (CSPs) of all sizes to transform and future-proof their businesses. Through real-time data, automation, and actionable insights delivered via Calix One — our cloud-first, AI-powered platform — CSPs can simplify operations, collapse cost, and accelerate innovation. Calix One brings together the automation of everything and the experience of one, empowering customers to deliver differentiated subscriber experiences while driving acquisition, loyalty, and revenue growth. This is the Calix mission: to enable CSPs of all sizes to simplify, innovate, and grow, strengthening both their businesses and the communities they serve.
We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.

Role Overview 

We are seeking a Staff Site Reliability Engineer (SRE) to lead our global platform reliability and drive our next-generation observability strategy on Google Cloud Platform (GCP). In this role, you will leverage Grafana Labs' complete telemetry stack and AIOps methodologies to build intelligent, self-healing infrastructure. You will bring deep expertise in scaling enterprise-grade Google Kubernetes Engine (GKE) topologies, managing high-throughput Kafka event streams, and maintaining high-performance PostgreSQL, AlloyDB, and BigQuery ecosystems at massive scale. Crucially, you will provide deep technical leadership across the entire networking stack, diagnosing complex issues from physical-layer transport up to application-layer protocols. 

This position is 100% fully remote. You can work from anywhere in the United States or Canada with a reliable internet connection, collaborating with a distributed engineering organization across multiple time zones. 

Key Responsibilities: 

  • Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration. 

  • Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla). 

  • AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks. 

  • GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters. 

  • Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability. 

  • Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery. 

  • Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts. 

  • Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards. 

  • Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce. 

Required Qualifications 

  • Location/Work Style: Proven track record of high autonomy and successful delivery in a 100% remote engineering environment. 

  • Experience: 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles. 

  • Networking Expertise (L1-L7): Deep technical knowledge and debugging mastery across all OSI layers, including: 

  • L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF). 

  • L4: Transport layer tuning (TCP congestion control algorithms, UDP, QUIC). 

  • L5-L7: Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC). 

  • Orchestration & Containerization: Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows. 

  • Data Infrastructure: Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery. 

  • Grafana Ecosystem: Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale. 

  • AIOps Implementation: Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda). 

  • Infrastructure as Code: Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures. 

  • Programming: High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts. 

Preferred Attributes 

  • Remote Communicator: Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment. 

  • GCP Expert: Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization. 

  • Systems Thinker: Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump). 

#LI-Remote

 

The base pay range for this position varies based on the geographic location. More information about the pay range specific to candidate location and other factors will be shared during the recruitment process. Individual pay is determined based on location of residence and multiple factors, including job-related knowledge, skills and experience.

San Francisco Bay Area:

156,400 - 265,700 USD Annual

All Other US Locations:

136,000 - 231,000 USD Annual

As a part of the total compensation package, this role may be eligible for a bonus. For information on our benefits click here.

HQ

Calix San Jose, California, USA Office

2777 Orchard Pkwy, San Jose, CA, United States

Similar Jobs

26 Minutes Ago
Remote
United States
91K-119K Annually
Senior level
91K-119K Annually
Senior level
Artificial Intelligence • Information Technology • Professional Services • Software • Analytics • Generative AI • Big Data Analytics
Design, build, and optimize production Databricks Lakehouse data pipelines using PySpark, Spark SQL, and Delta Lake. Implement medallion architectures, Unity Catalog governance, CI/CD, and cluster optimization. Collaborate with data science and analytics teams, mentor engineers, monitor production, and drive security, compliance, and performance improvements.
Top Skills: SparkAWSAzureCi/CdDatabricksDatabricks LakehouseDatabricks WorkflowsDelta LakeDelta Live TablesGCPGitInfrastructure-As-CodePysparkPythonSpark SqlSQLUnity Catalog
27 Minutes Ago
Remote
United States
91K-119K Annually
Senior level
91K-119K Annually
Senior level
Artificial Intelligence • Information Technology • Professional Services • Software • Analytics • Generative AI • Big Data Analytics
Design, build, and optimize Snowflake-based data platform: develop ELT pipelines, data models, governance, security, performance tuning, cost optimization, CI/CD, and advanced Snowflake features. Collaborate with stakeholders, mentor engineers, and deliver scalable data solutions for analytics.
Top Skills: AirflowAWSAzureChatgptClaude CodeCursorDagsterDbtDynamic TablesGCPGitGithub CopilotOpenaiPrefectPythonSnowflakeSnowparkSnowpipeSQLStreamsTasksTime Travel
45 Minutes Ago
Remote or Hybrid
161K-241K Annually
Expert/Leader
161K-241K Annually
Expert/Leader
AdTech • Digital Media • Marketing Tech
Lead enterprise-wide software architecture and strategic solutions, enforce coding and design best practices, mentor engineers, drive implementation strategy, evaluate new technologies, and ensure maintainable, operable solutions aligned with business objectives.

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account