Engineering Manager, AI/ML and Data Infrastructure

August 1

🏢 In-office - Bay Area

Apply Now
Logo of Chan Zuckerberg Initiative

Chan Zuckerberg Initiative

Building a more inclusive, just and healthy future for everyone.

201 - 500

Description

• Drive our MLOps processes and System Infrastructure Engineering efforts in ensuring that our GPU Cloud computing systems are highly utilized and stable, and proactively guide our team in implementing the instrumentation and observability tooling integral to our AI Platform. • Own the on-call efforts for our GPU Cloud computing systems, building out the MLOps and Systems Infrastructure Engineering alerting and monitoring efforts for our leading edge Kubernetes based AI platform, including troubleshooting problems encountered on the GPU platform infrastructure and with jobs running on the cluster and computing systems. • Build out the MLOPs and Systems Infrastructure Engineering team, growing the team to support the large scale capacity systems and AI training efforts we will be undertaking. • Responsibility for a variety of AI/ML development infrastructure, instrumentation, and telemetry projects that empower our team in supporting our users across the AI/ML lifecycle, taking a key role in simplifying and optimizing the systems and processes that are integral to our GPU Cloud Cluster operations - in an MLOps meets SRE kind of hybrid operations model. • Mentoring and managing your team in fulfilling their roles to the best of their abilities, provide skill and career coaching to help the team members keep growing along their own career and life paths, and keep the team engaged in meaningful and interesting projects in service of our north star philanthropic mission

Requirements

• Hands-on AI/ML Model Training Platform Operations experience in an environment with challenging data and systems platform challenges • MLOps experience working with medium to large scale GPU clusters in Kubernetes, HPC environments, or large scale Cloud based ML deployments (Kubernetes Preferred) • BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience • 2+ years of experience managing MLOps teams • 7+ years of relevant coding and systems experience • 7+ years of relevant coding and systems experience • 7+ years of systems Architecture and Design experience, with a broad range of experience across Data, AI/ML, Core Infrastructure, and Security Engineering • Strong understanding of scaling containerized applications on Kubernetes or Mesos, including solid understanding of AI/ML training with containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred) • Proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments • Solid coding ability with a systems language such as Rust, C/ C++, C#, Go, Java, or Scala • Extensive experience with a scripting language such as Python, PHP, or Ruby (Python Preferred) • Working knowledge of Nvidia CUDA and AI/ML custom libraries. • Knowledge of Linux systems optimization and administration • Understanding of Data Engineering, Data Governance, Data Infrastructure, and AI/ML execution platforms. • PyTorch, Karas, or Tensorflow experience a strong nice to have

Benefits

• CZI provides a generous 100% match on employee 401(k) contributions to support planning for the future. • Annual funding for employees that can be used most meaningfully for them and their families, such as housing, student loan repayment, childcare, commuter costs, or other life needs. • CZI Life of Service Gifts are awarded to employees to “live the mission” and support the causes closest to them. • Paid time off to volunteer at an organization of your choice. • Funding for select family-forming benefits. • Relocation support for employees who need assistance moving to the Bay Area. • And more!

Apply Now

Similar Jobs

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@techjobscalifornia.com
Jobs by Title
Account Executive jobsAccounting Manager jobsAccountant jobsAdministration jobsAdministrative Assistant jobsAnalytics Engineer jobsAndroid Engineer jobsAttorney jobsBackend Engineer jobsBusiness Development Rep jobsBusiness Operations & Strategy jobsChief of Staff jobsCivil Engineer jobsCloud Engineer jobsCommunity Manager jobsCompliance jobsContent Marketing Manager jobsContent Manager jobsContent Writer jobsCopywriter jobsCustomer Success jobsCustomer Support jobsData Analyst jobsDatabase Administrator jobsData Engineer jobsData Entry jobsData Scientist jobsDevOps jobsEcommerce jobsElectrical Engineer jobsEmail Marketing Manager jobsEngineering Manager jobsExecutive Assistant jobsController jobsFinancial Planning and Analysis jobsFull-stack Engineer jobsFrontend Engineer jobsGame Engineer jobsGeneral Counsel jobsGraphics Designer jobsGrowth Marketing jobsHuman Resources jobsiOS Engineer jobsInfluencer Marketing jobsInfrastructure Engineer jobsIT Support jobsMachine Learning Engineer jobsMarketing jobsMedical Writer jobsMechanical Engineer jobsOperations jobsParalegal jobsPerformance Marketing jobsProduct Analyst jobsProduct Designer jobsProduct Manager jobsProject Manager jobsProgram Manager jobsProduct Marketing jobsQA Engineer jobsSDET jobsRecruitment jobsRisk jobsSales jobsSales Development Rep jobsSales Engineer jobsSalesforce Administrator jobsSalesforce Analyst jobsSalesforce Consultant jobsSalesforce Developer jobsScrum Master / Agile Coach jobsSecurity Engineer jobsSEO Marketing jobsSite Reliability Engineer jobsSocial Media Manager jobsSoftware Engineer jobsSolutions Engineer jobsSupport Engineer jobsSystem Administrator jobsSystems Engineer jobsTax jobsTechnical Account Manager jobsTechnical Writer jobsTechnical Product Manager jobsUser Researcher jobs