HiBob

AI Infrastructure & Reliability Engineer

Posted 4 Hours Ago

Be an Early Applicant

Remote or Hybrid

Hiring Remotely in Israel

Mid level

Remote or Hybrid

Hiring Remotely in Israel

Mid level

The AI Infrastructure & Reliability Engineer will manage cloud infrastructure, CI/CD processes, observability practices, and AI operations to ensure platform reliability and performance.

The summary above was generated by AI

Job Description
About UsHiBob helps modern, mid-size businesses transform the way they manage people, giving HR and managers all they need to connect, engage, develop, and retain top talent. Since 2015, we've achieved consecutive triple-digit year-over-year growth, all backed by our amazing team of Bobbers from across the globe, making us the choice HRIS of over ~5500 midsize and multinational companies and over 1 Milion users.
Our HR platform is intuitive, data-driven, and built for the way people work today: globally, remotely, and collaboratively.
What this role is really about
You'll join a 3-person platform team within our Business Technology group -owning the internal infrastructure that our AI platform and its users depend on. This isn't a product engineering role, and it isn't ticket work or babysitting pipelines someone else built. You're building and operating the internal foundation that the company runs on. The work covers the full stack of platform engineering: core cloud infrastructure (AWS, Kubernetes, IaC), CI/CD pipelines, AI-driven infrastructure components, and the SRE and observability practice that keeps it all honest -metrics, alerting, incident response, and reliability standards. As our AI capabilities grow, so does the complexity underneath them, and staying ahead of that is central to the role. If you treat infrastructure as a product -reusable, automated, observable, and built to last -this is your kind of role.
Job Requirements

2-4 years Hands-on DevOps, SRE, or infrastructure engineering in production SaaS environments.
Strong AWS experience: multi-account architecture, cross-account IAM, serverless and event-driven services (Lambda, SQS, SNS, EventBridge), and EKS cluster management.
Proven Kubernetes experience in production, including cross-account migrations and stateful workload management.
Proficiency with Terraform - repository structure design, module architecture, and CI/CD pipeline implementation.
Hands-on experience building and maintaining GitHub Actions pipelines for end-to-end CI/CD workflows.
Working Python proficiency for scripting, internal tooling, and workflow automation.
Practical experience implementing observability stacks from scratch: metrics, logging, distributed tracing, and alerting.
Experience owning reliability practices: SLOs, incident response, and postmortem culture.

Nice to have

Hands-on experience operating LLM APIs in production: rate-limit and quota management, cost attribution per team/model, latency monitoring, and resilience patterns (retries, fallbacks, circuit breakers).
FinOps experience across cloud, AI, and observability spend.
Experience introducing self-healing or auto-remediation patterns in production.

Job Responsibilities

DevOps & AI-Driven Infrastructure - own CI/CD, deployment processes, and release reliability. Build and operate cloud infrastructure that is automated, intelligent, and continuously self-improving - not just managed.
- Design and build our Terraform repository and IaC pipeline from scratch -AI-assisted generation, drift detection, and policy enforcement built in.
- Build AI-driven GitHub Actions pipelines -automated code review, risk assessment, and intelligent deployment decisions.
- Manage Kubernetes workloads across AWS accounts -zero downtime, fully automated, nothing left behind.
Embed AI into the operational layer -proactive drift detection, automated remediation, and intelligent scaling toward a self-healing runtime.
Reliability & SRE -improve uptime, resilience, and incident response.
- Define and enforce SLOs/SLIs, error budgets, and on-call practices.
- Lead incident response, postmortems, and systemic reliability improvements.
Own AI-specific reliability: model latency SLOs, token quota monitoring, rate limit handling, fallback and retry strategies, and cost-per-request alerting.
Observability & Telemetry - increase visibility, reduce noise, improve troubleshooting.
Establish and continuously evolve the observability stack: metrics, logs, distributed tracing, and alerting tuned for both application and AI workloads.
AI / LLM Operations- bringing AI systems to production and operating them at scale, with a focus on reliability, performance, and trust.
- Own the AI infrastructure layer: rate limits, quota management, latency SLOs, and fallback strategies (retries, circuit breakers).
Operate LLM APIs in production with resilience and cost attribution per team/model.
FinOps & Cost Optimization - optimize AI, infra, and logging costs at scale.
Build cost visibility and guardrails across AWS, LLM usage, and observability pipelines.

Benefits
Join our village
HiBob is a village filled with amazing people and we're especially proud of that. It's a place where Bobbers can be themselves. We're about fun, dreams, hopes and ambition, just as much as we are about precision, growth, and top performance. Becoming a Bobber means you'll receive competitive compensation, benefits, and pre-IPO equity alongside all of this:

Company share options plan
We have a flexible hybrid working model
Work from home allowance- to get your home office set up!
Payment for sick leave from the first day
2 Social Impact days per year for volunteering
Annual Headspace subscription and wellness benefits
Awesome employee referral program- $2,500 for each successful referral with an additional ambassador programme
Monthly Wolt Allowance
Transportation allowance
Dog-friendly
Temporary remote work from anywhere in the world for up to 2 months (after 6 months of employment)
Fun company and team social events (locally and virtually with our global teams)
Bob balance days - 4 additional days within a calendar year - Enjoy a company-wide long weekend at the beginning of each quarter

If this sounds like something you've been looking for, we'd love to have you. Come on, join our village!

Top Skills

AWS

Github Actions

Kubernetes

Python

Terraform

Similar Jobs at HiBob

HiBob

Finops Specialist

4 Hours Ago

Remote or Hybrid

Senior level

HR Tech • Information Technology • Professional Services • Sales • Software

The FinOps Specialist will define financial requirements and translate them into system solutions, enhancing financial workflows and ensuring cross-team collaboration.

Top Skills: ExcelNetSuiteOraclePower BISAPSQLTableau

HiBob

Back-end Engineer

4 Hours Ago

Remote or Hybrid

Senior level

HR Tech • Information Technology • Professional Services • Sales • Software

The Backend Engineer will develop product features, manage the software development cycle, work with various databases, and utilize cloud platforms.

Top Skills: AWSJavaKotlinMySQLPostgresScala

HiBob

Finops Lead

2 Days Ago

Remote or Hybrid

Senior level

HR Tech • Information Technology • Professional Services • Sales • Software

The FinOps Lead will align Finance and Technology, translating business goals into systems, leading project delivery, optimizing financial processes, and supporting M&A integration.

Top Skills: AnalyticsCpqCRMErpExcel

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Google, Apple, Salesforce, Meta
Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine