Vizcom Jobs

Senior Platform & Reliability Engineer (SRE)

Vizcom

Senior Platform & Reliability Engineer (SRE)

Reposted Yesterday

Hybrid

San Francisco, CA, USA

Senior level

Hybrid

San Francisco, CA, USA

Senior level

Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.

The summary above was generated by AI

Agency Notice: We are not currently working with recruiting agencies for this role. Please do not contact Vizcom employees regarding this position. Any resumes submitted without a prior agreement will be considered unsolicited.

About Vizcom

Vizcom is a visual creation platform that combines modern web tooling with AI-powered workflows. Our stack includes React/TypeScript frontend, Node/Koa + PostGraphile API services, PostgreSQL, Redis, BullMQ queues, and Kubernetes-based production infrastructure.

We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale.

Role Mission

Own service reliability end-to-end: prevent incidents, reduce blast radius when failures happen, and lead fast, high-quality recovery when production degrades.
This is a hands-on technical leadership role with authority to set reliability standards and enforce production guardrails.

Compensation

$200,000 – $250,000 base salary + meaningful equity

What You’ll Own

Reliability bar: Set and enforce SLIs/SLOs/error budgets for critical user flows.

Production architecture resilience: Drive failure isolation across API, workers, queues, and dependencies so one subsystem cannot take down core access.

Kubernetes runtime reliability: Define probe contracts, rollout/rollback standards, graceful shutdown behavior, scaling/resource policies, and startup safety.

Queue + job safety (BullMQ/Redis): Own poison pill containment and workload isolation.

Incident command quality: Lead Sev1/Sev2 response end-to-end (containment, communications, technical direction, RCA, corrective action execution).

Reliability operating system: Own observability quality (signals over noise), on-call effectiveness, runbooks, and postmortem discipline.

Release safety authority: Gate risky deploys and enforce reliability guardrails when production health is at risk.

Traits We’re Looking For

Calm, structured incident commander under pressure.
Thinks in failure modes and blast radius by default.
Pragmatic: can stabilize quickly, then implement durable fixes.
High ownership and strong written communication.

First 90 Days

Establish baseline reliability metrics and identify top platform risks.

Tighten incident response mechanics (roles, comms cadence, runbooks, status updates).

Deliver high-impact hardening fixes across probes/startup paths/queue safety.

Publish a prioritized 6–12 month reliability roadmap with clear ownership and milestones.

If possible please include one incident you personally led and send to [email protected] :

1) what failed,

2) how you contained it,

3) what permanent fixes you shipped, and measured.

San Francisco, California, United States, 94103

Similar Jobs

Bitdeer Group

Senior Site Reliability Engineer

11 Days Ago

In-Office

San Jose, CA, USA

Senior level

Software

Lead architecture, design, and evolution of a global multi-region cloud SRE platform for GPU/AI compute. Author and maintain platform architecture, enforce design invariants, review framework changes, run plugin framework, decide tier placements, coordinate with cloud teams and security, produce pre-flight designs, and shepherd implementations through engineering squads.

Top Skills: BmcDcgmDdnGitopsGpu OperatorInfinibandIpmiKuberayKubernetesKueueLustreMigNcclNetappNvlinkNvme-OfNvswitchPureRayRedfishRoceSlurmSubnet ManagerVastVgpuVolcanoXidZtp

Bitdeer Group

Senior Site Reliability Engineer

11 Days Ago

In-Office

San Jose, CA, USA

Senior level

Software

Lead design and implement a global public cloud SRE platform for AI and compute workloads. Own architecture and production engineering for observability, cluster health, remediation, lifecycle, secrets, CI/CD, backup/DR, and automation. Collaborate with cross-functional teams to build scalable, reliable multi-region services and run them in production (on-call).

Top Skills: ArgoAws KmsBmcCosignCrdtDatadogDcgmDdnElasticsearchFluxGcp KmsGoHashicorp VaultHelmInfinibandIpmiJaegerJavaKuberayKubernetesKubernetes Operator (Crd/Controller)KueueKustomizeLokiLustreMimirMtlsNcclNetappNvme-OfOpentelemetryPaxosPrometheusPrometheus QueryPurePythonRaftRayRedfishRoceRustSlurmSQLTempoThanosVastVictoriametricsVolcano

Anyscale

Senior Site Reliability Engineer

17 Days Ago

Hybrid

215K-275K Annually

Senior level

215K-275K Annually

Senior level

Artificial Intelligence • Software

Design, build, and scale control- and data-plane infrastructure for distributed AI workloads. Improve reliability, performance, scheduling, and observability for Ray clusters across cloud and on-prem environments. Support accelerator integration, container image management, and provide on-call troubleshooting and cross-team collaboration.

Top Skills: AWSAzureContainersGCPGoGpusGrafanaKubernetesLinuxPrometheusPythonRayTpusVms

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Vizcom

Senior Platform & Reliability Engineer (SRE)

Vizcom San Francisco, California, USA Office

Similar Jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the San Francisco Tech Scene

Key Facts About San Francisco Tech