Muvr is building the future of on-demand logistics and moving services. Our platform powers real-time booking, pricing, matching, payments, and fulfillment across customers, drivers, and partners. As we scale, infrastructure reliability and operational excellence become product requirements. This role exists to keep production stable, observable, secure, and scalable so engineering teams can ship quickly without sacrificing uptime, correctness, or customer trust.
Role OverviewThe DevOps / Site Reliability Engineer (SRE) owns the reliability foundations of Muvr’s platform. You will design and operate cloud infrastructure, improve deployment speed and safety, strengthen observability, and lead incident practices that prevent repeat failures.
This is a hands-on, production-ownership role for someone who values automation, low-toil systems, and practical guardrails that make delivery faster and safer at the same time. You will partner closely with Engineering, Security, Product, and adjacent teams to harden the platform as usage grows.
Key Responsibilities1) Platform Reliability and Production Ownership
- Own uptime, latency, availability, and error-rate outcomes for core services.
- Establish SLOs, SLIs, and alerting aligned to customer impact and service health.
- Improve reliability through resilient patterns such as retries, timeouts, circuit breakers, load shedding, and queue protections.
- Reduce operational toil by building automation and self-service tools that improve engineering velocity and operational safety.
2) Cloud Infrastructure and Infrastructure as Code
- Design, build, and maintain scalable cloud infrastructure across AWS, GCP, or Azure environments.
- Automate provisioning, configuration, and change management using Infrastructure as Code, preferably Terraform.
- Improve disaster recovery readiness through backups, restore validation, redundancy, and failover planning.
- Maintain strong environment consistency across development, staging, and production to reduce deployment surprises and configuration drift.
3) CI/CD and Release Engineering
- Build and improve CI/CD pipelines to increase deployment frequency while reducing release risk.
- Standardize deployment practices, including versioning, environment promotion, staged rollouts, canary releases, and rollback mechanisms.
- Implement release guardrails such as required test gates, policy checks, dependency scanning, and secrets detection.
- Improve developer experience through faster builds, clearer failure signals, and more reliable deployment workflows.
4) Observability and Operational Excellence
- Build and maintain observability across logs, metrics, tracing, dashboards, and service-level visibility.
- Design alerting that catches critical failures early while minimizing noise and paging fatigue.
- Create runbooks and playbooks that are actionable under pressure and linked to specific alerts or operational scenarios.
- Improve MTTR through better instrumentation, faster diagnosis paths, and clearer service ownership.
5) Incident Management and Root-Cause Discipline
- Lead or coordinate incident response, including triage, communication, mitigation, recovery, and follow-through.
- Run blameless postmortems with clear root-cause narratives, contributing factors, and prevention actions.
- Ensure corrective actions are tracked to completion and meaningfully reduce recurrence.
- Establish incident severity levels, escalation paths, and communication templates that improve consistency during outages or degradation events.
6) Security and Compliance Baselines
- Partner with Engineering to implement security best practices, including least privilege, secrets management, encryption, and audit logging.
- Improve access hygiene through MFA coverage, key rotation, access reviews, and break-glass procedures.
- Identify infrastructure risks and drive remediation with clear prioritization, ownership, and operational follow-through.
- Support audit and compliance readiness through clear documentation, logging, and evidence-friendly processes when needed.
7) AI-Enabled Productivity and Execution
- Use AI tools thoughtfully to improve productivity, troubleshooting speed, documentation quality, and automation efficiency.
- Apply AI responsibly to support analysis, scripting, incident investigation, and workflow improvement while maintaining security, accuracy, and sound operational judgment.
Required
- 3+ years of experience in DevOps, Site Reliability Engineering, Infrastructure Engineering, or similar roles supporting production systems.
- Strong experience with at least one major cloud provider: AWS, GCP, or Azure.
- Experience building or maintaining CI/CD pipelines using GitHub Actions, Jenkins, CircleCI, or similar tools.
- Familiarity with containerization using Docker and orchestration platforms such as Kubernetes.
- Strong troubleshooting skills across infrastructure, core networking concepts, deployments, and service operations.
- Ability to write automation scripts and tooling using Bash, Python, or similar languages.
- Comfortable using AI tools to improve efficiency and work quality, with a willingness to learn emerging AI workflows and apply them responsibly.
Preferred
- Experience supporting marketplace, logistics, dispatch, delivery, or other real-time operational platforms.
- Experience with observability tools such as Prometheus and Grafana, Datadog, New Relic, or similar platforms.
- Strong Infrastructure as Code experience using Terraform, CloudFormation, or equivalent tooling.
- Experience scaling distributed systems in production, including autoscaling, queue management, caching strategies, and traffic spike handling.
- Familiarity with security best practices and compliance expectations for production systems.
- Familiarity with tools and systems such as Slack, Google Workspace, ChatGPT, ClickUp, Hubstaff, GitHub, CI/CD platforms, Kubernetes, Terraform, Datadog, Grafana, cloud consoles, ticketing tools, and other infrastructure or reliability platforms.
- Own reliability and infrastructure for a fast-growing real-time logistics marketplace.
- Take on a high-impact role shaping scalability, operational readiness, and production discipline.
- Partner directly with engineering leadership to build systems that scale safely and sustainably.
- Work on meaningful infrastructure problems where uptime, speed, and correctness directly affect real-world outcomes.
- Competitive compensation.
Top Skills
Similar Jobs
What you need to know about the San Francisco Tech Scene
Key Facts About San Francisco Tech
- Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Google, Apple, Salesforce, Meta
- Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
- Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
- Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

.png)

