GE Vernova Jobs

SRE Platform Engineer

GE Vernova

SRE Platform Engineer

Posted Yesterday

Remote

Hiring Remotely in USA

Senior level

Remote

Hiring Remotely in USA

Senior level

Operate and harden production EKS Kubernetes clusters across multiple AWS regions. Build IaC (Terraform, Ansible), implement policy-as-code, ensure security and compliance, manage observability (Prometheus/Grafana), perform L3 support and incident RCA, run platform-level testing and DR, automate toil, and partner with application teams for sizing and cost optimization to achieve high availability for critical cloud infrastructure.

The summary above was generated by AI

Job Description SummaryThe Platform System Reliability Engineer is the primary operations engineer and operator of our EKS Kubernetes environment, which serves as the foundation for our global grid software SaaS products. This role focuses on the "middle-mile" of software delivery, ensuring that the underlying compute, networking, and storage layers are secure, hardened, scalable, and resilient to support critical energy infrastructure in the cloud. You will be responsible for the full lifecycle of production clusters, from initial bootstrapping, performance tuning, patching and securing.

Job Description

Roles and Responsibilities

Day 0: Provision & Infrastructure Hardening

Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines.
Infrastructure as Code (IaC): Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc.
Security Architecture: Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards.
Operationalize Cloud Infrastructure: Standardize run books, operating processes required to run critical infrastructure with highest reliability.

Day 1: Platform Readiness & Scaling

Resource Governance: Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources.
Connectivity & Ingress: Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed micro services.
Acceptance Testing: Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets.
Sizing & Optimization: Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps).

Day 2: Operational Excellence & Tier 3 Support

L3 Escalation: Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions.
Incident Response: Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures.
Toil Elimination: Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements.
Observability Integration: Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health.

Technical Requirements

Kubernetes: 5 years of experience operating production-grade Kubernetes clusters at scale.
Orchestration & Observability Tools: Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc.
AWS Infrastructure: Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK).
Automation Stack: Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux.
Networking & Security: Strong understanding and hands on experience of cloud networking concepts such as VPCs, routing, load balancing and security configurations such as encryption, certificate management.

Education Qualification

Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.

Experience

Professional Background: 6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments.
Crisis Management: Proven track record as a structured incident responder who can handle production down/break the glass scenarios in mission critical applications.

Preferred Qualifications

Regulated Environments: Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context.
Certifications: AWS Certified DevOps Engineer – Professional, CKA (CertifiedKubernetes Administrator), or SRE Practitioner Certification.
Critical Infrastructure: Experience supporting mission-critical systems in energy, utilities, or other high-stakes industrial sectors.

Business Acumen:
Understand key cross-functional concepts that impact the organization; is aware of business priorities and organizational dynamics
Leadership:
Coach and mentor team members.
Familiar with concepts of costing hardware and software components. Works to assure work is on-time and within budget
Deliver tasks on-time with alignment to architectural goals. Can identify and raise issues, risks and benefits
Participate in change initiatives by implementing new directions and providing appropriate information and feedback
Personal Attributes:
High level of energy and enthusiasm with the ability to thrive in a rapidly changing environment
Demonstrated customer focus – evaluates decisions through the eyes of the customer; builds strong customer relationships; creates processes with customer viewpoint; partners with customers
Change oriented –actively generates process improvements; champions and drives change initiatives; confronts
Ability to work with global teams, act independently and as part of a team
Apply values, policies, procedures and precedent to make timely, routine decisions of limited, clear choice
Open-mindedly to new perspectives or ideas. Consider different or unusual solutions when appropriate
Resolve day-to-day issues related to strategy implementation. Escalate issues that impact the client and/or strategic initiatives
Strong analytical and strong problem solving skills - communicates in a clear and succinct manner and effectively evaluates information/data to make decisions; anticipates obstacles and develops plans to resolve

Additional Information

Relocation Assistance Provided: Yes

#LI-Remote - This is a remote position

Similar Jobs

Elastic

Site Reliability Engineer

15 Days Ago

Remote

United States

143K-175K Annually

Mid level

143K-175K Annually

Mid level

Cloud • Security • Software • Generative AI

Design, build, and automate large-scale multi-cloud infrastructure and internal SRE tools. Improve host lifecycle, observability, alerting, and reliability; operate containerized workloads; participate in on-call rotations, incident response, runbooks, postmortems, code reviews, and mentoring.

Top Skills: AnsibleArgo CdArgo WorkflowsCueDockerElastic StackGoGraphiteInfluxKubernetesLinuxPrometheusPuppetTerraformUbuntuUbuntu Live Patch

Capital One

Site Reliability Engineer

24 Days Ago

Remote or Hybrid

286K-392K Annually

Senior level

286K-392K Annually

Senior level

Fintech • Machine Learning • Payments • Software • Financial Services

The role involves leading the Card Acquisitions engineering organization, promoting engineering excellence, mentoring engineers, and delivering innovative solutions. Responsibilities include system design, hands-on coding, and developing a multi-year strategy to enhance operational efficiency and customer acquisition through advanced technologies.

Top Skills: GoJavaJavaScriptPublic Cloud TechnologiesPythonSpa FrameworksTypescript

NVIDIA

Site Reliability Engineer

6 Days Ago

In-Office or Remote

Santa Clara, CA, USA

248K-397K Annually

Expert/Leader

248K-397K Annually

Expert/Leader

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse

Design, implement, and support a large-scale Observability & Telemetry platform. Ensure reliability, monitor system health, and automate processes while engaging in incident response and postmortems.

Top Skills: DockerGoGrafanaKubernetesLinuxOpenstackOpentelemetryPerlPrometheusPythonRuby

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine