Zoom Logo

Zoom

Senior AI Engineer

Reposted 3 Days Ago
Be an Early Applicant
In-Office
San Jose, CA, USA
209K-275K Annually
Senior level
In-Office
San Jose, CA, USA
209K-275K Annually
Senior level
Develop and manage a Machine Learning platform, implementing user interfaces, ensuring security, optimizing performance, and collaborating with data scientists.
The summary above was generated by AI
Immigration sponsorship is not available for this position

Responsibilities:

• Develop the Machine Learning Platform management system.

• Design and implement intuitive user interfaces and APls for seamless interaction with the platform.

• Ensure robust access control and security measures for the Machine Learning Platform.

• Regularly evaluate and enhance platform performance, scalability, and reliability. Integrate tools for data versioning, experiment tracking, and workflow orchestration.

• Build the toolchains, service, pipeline for model development workflow, and model serving architecture.

• Create automated pipelines for data preprocessing, feature engineering, and dataset versioning.

• Develop Cl/CD pipelines for deploying models into production environments with minimal downtime.

• Enable support for distributed model training and hyperparameter optimization.

• Incorporate A/B testing frameworks for evaluating multiple model deployments.

• Collaborate with data scientists and engineers to streamline the model development lifecycle.

• Prioritize various metrics for model training and inferencing monitoring. Implement logging and monitoring tools to track model performance, resource utilization, and throughput.

• Develop dashboards to visualize key metrics such as latency, accuracy, and drift detection in realtime.

• Establish alerting mechanisms to detect and respond to anomalies or performance degradation.

• Continuously refine metric prioritization based on stakeholder feedback and evolving business goals.

• Develop and maintaining the high-performance LLM training GPU infrastructure and cluster.

• Optimize GPU utilization for large-scale training workloads, ensuring minimal resource wastage.

• Implement fault-tolerant and distributed training strategies for handling large language models (LLMs).

• Evaluate and integrate emerging hardware technologies, such as TPUs, into the training infrastructure.

• Regularly update cluster configurations to support new frameworks and model architectures.

• Manage scheduling and resource allocation for multi-tenant GPU clusters.

• Understand the auto scale for inference service and multi-models for dynamical loading.

• Design systems that dynamically allocate resources based on real-time demand for inference services.

• Develop mechanisms for loading and unloading models in memory to optimize latency and resource usage.

• Implement strategies for caching frequently used models to improve inference performance.

• Experiment with serverless architectures to further enhance scalability and cost efficiency.

• Ensure compatibility with edge devices and deploy lightweight models for edge inference.

• Support, troubleshoot, and resolve any issues during the training and inferencing.

• Create detailed runbooks for common troubleshooting scenarios to reduce resolution times.

• Perform root cause analysis for failures and implement long-term fixes to prevent recurrence.

• Collaborate with DevOps and IT teams to ensure the stability of underlying infrastructure.

• Develop self-healing systems that can automatically recover from common training or inference issues.

• Provide technical support and guidance to data scientists and engineers working on the platform.

What we're looking for:

Requires a Bachelor's degree in Communications Engineering, Artificial Intelligence, Software Engineering, a related field, or a foreign degree equivalent. Must have 2 years of experience in job offered or related occupation. Must have 2 years of experience in:

• Designing, Implementing, or optimizing large-scale distributed training systems using technologies like Horovod, DeepSpeed, PyTorch Distributed, or Ray;

• Tensor/model parallelism and pipeline parallelism;

• Utilizing cloud-native or on-prem infrastructure (Kubernetes, Docker, Slurm) to support scalable, fault-tolerant, and resource-efficient AI workloads across multi-node GPU clusters;

• Using Performance Profiling and Optimization to diagnose and improve end-to-end training performance by optimizing data pipelines (e.g., DALI, tf.data), minimizing communication overhead (e.g., NCCL, gRPC), and tuning hardware-specific kernels (e.g., CUDA, Triton);

• Systems Programming and Automation in systems-level programming with Python, Bash, and C++ or Go;

• Automating deployment and orchestration of AI workloads and monitoring using Prometheus, Grafana, Weights & Biases.

• Telecommuting work arrangement permitted one day a week. Four days in office required. Position does not require domestic or international travel

Zoom Communications, Inc.
#LI-DNI
#Ind0
 

Salary Range or On Target Earnings:

Minimum:

$209,000.00

Maximum:

$275,400.00

In addition to the base salary and/or OTE listed Zoom has a Total Direct Compensation philosophy that takes into consideration; base salary, bonus and equity value.

Note: Starting pay will be based on a number of factors and commensurate with qualifications & experience.

We also have a location based compensation structure;  there may be a different range for candidates in this and other locations.

Ways of Working
Our structured hybrid approach is centered around our offices and remote work environments. The work style of each role, Hybrid, Remote, or In-Person is indicated in the job description/posting.

Benefits
As part of our award-winning workplace culture and commitment to delivering happiness, our benefits program offers a variety of perks, benefits, and options to help employees maintain their physical, mental, emotional, and financial health; support work-life balance; and contribute to their community in meaningful ways. Click Learn for more information.

About Us
Zoomies help people stay connected so they can get more done together. We set out to build the best collaboration platform for the enterprise, and today help people communicate better with products like Zoom Contact Center, Zoom Phone, Zoom Events, Zoom Apps, Zoom Rooms, and Zoom Webinars.
We’re problem-solvers, working at a fast pace to design solutions with our customers and users in mind. Find room to grow with opportunities to stretch your skills and advance your career in a collaborative, growth-focused environment.

Our Commitment​

At Zoom, we believe great work happens when people feel supported and empowered. We’re committed to fair hiring practices that ensure every candidate is evaluated based on skills, experience, and potential. If you require an accommodation during the hiring process, let us know—we’re here to support you at every step.

We welcome people of different backgrounds, experiences, abilities and perspectives including qualified applicants with arrest and conviction records and any qualified applicants requiring reasonable accommodations in accordance with the law.

If you need assistance navigating the interview process due to a medical disability, please submit an Accommodations Request Form and someone from our team will reach out soon. This form is solely for applicants who require an accommodation due to a qualifying medical disability. Non-accommodation-related requests, such as application follow-ups or technical issues, will not be addressed.

Think of this opportunity as a marathon, not a sprint! We're building a strong team at Zoom, and we're looking for talented individuals to join us for the long haul. No need to rush your application – take your time to ensure it's a good fit for your career goals. We continuously review applications, so submit yours whenever you're ready to take the next step.

Our interviews are supported by BrightHire, a tool that helps us create a consistent and thoughtful interview experience and may include recordings. Please refer to our candidate privacy statement for more information of how we use your data.

HQ

Zoom San Jose, California, USA Office

55 Almaden Blvd, San Jose, CA, United States, 95113

Similar Jobs

2 Days Ago
Hybrid
San Francisco, CA, USA
150K-245K Annually
Senior level
150K-245K Annually
Senior level
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Design and operate scalable, event-driven backend systems in Python to power agentic AI workflows and real-time financial infrastructure. Own APIs, data architecture, reliability primitives, and platform foundations while partnering with ML and product teams to productionize AI-native financial systems.
Top Skills: BigQueryCloud SqlFastapiGCPGoogle Cloud Pub/SubGoogle Cloud RunGoogle KmsNoSQLPydanticPythonSQL
2 Days Ago
Hybrid
Mountain View, CA, USA
Senior level
Senior level
Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Build and optimize AI/ML infrastructure and production-grade systems. Integrate AI solutions across platforms, implement scalable, secure deployments, validate and monitor models, and connect diverse data sources and SaaS tools via APIs.
Top Skills: .NetAPIsAWSAzureCopilotDeep LearningDockerGCPGleanGoogle WorkspaceJavaKubernetesMachine LearningMoveworksPythonServicenowSlackStatistical ModelingWorkday
2 Days Ago
In-Office
170K-275K Annually
Senior level
170K-275K Annually
Senior level
Aerospace • Artificial Intelligence • Hardware • Machine Learning • Software • Defense • Manufacturing
Build and operate an enterprise AI platform and integrations that connect LLMs to internal systems. Develop production AI applications, manage model/provider evaluations, ensure security/compliance for government environments, and deliver company-wide AI training and enablement.
Top Skills: Agent Frameworks (CrewaiAPIsAuthentication FlowsAWSAws GovcloudAzureAzure GovernmentData PipelinesFedrampGCPIl4Il5JavaLarge Language Model ApisMcpNist 800-53On-Premises ComputePrompt EngineeringPydantic Ai)PythonRag (Retrieval-Augmented Generation)TypescriptWebhooks

What you need to know about the San Francisco Tech Scene

San Francisco and the surrounding Bay Area attracts more startup funding than any other region in the world. Home to Stanford University and UC Berkeley, leading VC firms and several of the world’s most valuable companies, the Bay Area is the place to go for anyone looking to make it big in the tech industry. That said, San Francisco has a lot to offer beyond technology thanks to a thriving art and music scene, excellent food and a short drive to several of the country’s most beautiful recreational areas.

Key Facts About San Francisco Tech

  • Number of Tech Workers: 365,500; 13.9% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Google, Apple, Salesforce, Meta
  • Key Industries: Artificial intelligence, cloud computing, fintech, consumer technology, software
  • Funding Landscape: $50.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Sequoia Capital, Andreessen Horowitz, Bessemer Venture Partners, Greylock Partners, Khosla Ventures, Kleiner Perkins
  • Research Centers and Universities: Stanford University; University of California, Berkeley; University of San Francisco; Santa Clara University; Ames Research Center; Center for AI Safety; California Institute for Regenerative Medicine

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account