April 3
🏢 In-office - Bay Area
• Architect, build and maintain the infrastructure that ensures highly available GPU workloads for training-purposes • Troubleshoot and resolve issues across GPU resources, networking, OS, drivers, and cloud environments, automate detection and recovery of such issues • Design, build, and maintain the infrastructure that powers our data curation product. • Partner with researchers and engineers to bring new features and research capabilities to our customers • Ensure that our infrastructure and systems are reliable, secure, and worthy of our customers' trust.
• Have meaningful experience with leading and building production ML infrastructure and platforms that deliver on major product initiatives. • Proficiency in Python and in the most commonly used tools in the infrastructure space: Linux, Kubernetes, Terraform / Pulumi, etc • Strong knowledge of hardening cloud native and especially K8s workloads. • Experience maintaining a high-quality bar for design, correctness, and testing. • Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed • Own problems end-to-end and are willing to pick up whatever knowledge you're missing to get the job done. • Experience running data-processing workloads in k8s (e.g spark on k8s)
• Role based in Redwood City, CA • Relocation assistance • Visa sponsorship
Apply Now