OpenArt Logo

OpenArt

Senior Platform & Reliability Engineer

Posted 3 Hours Ago
Be an Early Applicant
Hybrid
San Francisco, CA, USA
Senior level
Hybrid
San Francisco, CA, USA
Senior level
Help design, scale, and improve platform reliability: define SLOs/SLIs, run on-call and incident response, build observability, improve resilience to external dependencies, enhance CI/CD and deploy safety, optimize cost and capacity, and influence infrastructure architecture.
The summary above was generated by AI

🧑🏼 💻 Senior Platform & Reliability Engineer

🎨 About OpenArt

OpenArt is an AI Storytelling and Visual Creation Platform used by millions worldwide. We’re building the next generation of creative tools powered by cutting-edge AI, enabling anyone to create videos, visuals, characters, and stories with unprecedented speed and imagination.

We believe the future of creativity is AI-native, and we're shaping that future.

🚀 Why Join OpenArt

  • Small team, massive surface area, senior engineers own real systems, notslices.

  • Ship at real scale, your work goes to millions of users, fast.

  • Founder-led engineering culture, both founders are technical and deeplyinvolved in product and architecture.

  • AI-native product, you’ll design how cutting-edge AI models are exposed asreal user experiences.

  • High ownership, low process, we value judgment, clarity, and speed overbureaucracy.

  • Senior Platform & Reliability Engineer 1

  • 7-10X growth in revenue for the past 2 years. Now you’ll play a critical role inhelping the company scale to the next stage.

🎯 About the Role

We’re looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization.

This is not a traditional ops or DevOps role. You’ll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency—in a fast-moving, AI-native environment.

You’ll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users—while raising the overall engineering bar.

🛠 What You’ll Do

  • Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.

  • Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items.

  • Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand “what broke” and “why.”

  • Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.

  • Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale.

  • Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning.

  • Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team.

🧑 💻 What We’re Looking For

Core Requirements

  • 5+ years building and operating production systems where reliability and scaling are important.

  • Strong software engineering skills — you can build and ship production code, not just configure infrastructure.

  • Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes).

  • Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.

  • Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).

  • Ability to communicate technical tradeoffs clearly to engineers across different domains.

  • Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.

    Nice to Have

  • Experience building internal platform abstractions (e.g., job orchestration, APIlayers, workflow systems) that improve team velocity.

  • Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost.

  • Experience working in a startup or high-growth environment, with broad ownership across systems.

⚙ Tech Stack You’ll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React /Next.js, Node.js, TypeScript, Python, etc.

💰 Compensation

  • Competitive base salary and bonus program

  • Equity - meaningful ownership in what you build

  • High autonomy, high growth environment

🌍 Work Setup

  • Bay Area preferred (hybrid allowed)

  • Visa sponsorship available

  • We’ll consider remote

Similar Jobs

8 Days Ago
Hybrid
San Francisco, CA, USA
Senior level
Senior level
Artificial Intelligence • Information Technology • Software
Lead end-to-end platform reliability: define SLIs/SLOs, harden production architecture, ensure Kubernetes runtime and queue safety, run incident command for Sev1/Sev2, own observability/on-call/runbooks, and gate risky releases while delivering a prioritized reliability roadmap.
Top Skills: BullmqKoaKubernetesNode.jsPostgraphilePostgresReactRedisTypescript
24 Days Ago
Hybrid
Redwood City, CA, USA
175K-225K Annually
Senior level
175K-225K Annually
Senior level
Artificial Intelligence • Machine Learning • Database
The role involves ensuring the reliability and performance of distributed database systems, developing monitoring strategies, and automating operations in a cloud-native environment.
Top Skills: AnsibleArgoAWSAzureDockerGCPGitlab CiGoJavaJenkinsKubernetesPythonTerraform
An Hour Ago
Hybrid
70K-114K Annually
Mid level
70K-114K Annually
Mid level
eCommerce • Fashion • Retail • Sales • Wearables • Design
Lead store operations to drive sales, profitability, and exceptional customer service. Recruit, coach, and develop a high-performing team; manage inventory, payroll, loss prevention, visual merchandising, staffing, and scheduling; resolve customer issues; execute business plans and represent the brand in the community.
Top Skills: Inventory SystemsLabor Management SystemsMS OfficeSales Reporting Tools

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account