Democratizing the use of advanced driver-assistance systems (ADAS) to reduce accidents and make driving more enjoyable.
March 16
🏢 In-office - Bay Area
Democratizing the use of advanced driver-assistance systems (ADAS) to reduce accidents and make driving more enjoyable.
• Support the AI/ML cluster infrastructure on GPU focusing on systems automation, configuration management and deployment at scale • Improve our cluster health monitoring and auto-recovery pipeline • Work with users on debugging application performance issues • Work with hardware and storage vendors to tune and optimize our servers, TrueNas storage and network • Automate and Deploy GPU cluster with Ansible • Performance tuning and OS provisioning on Linux systems • Manage HPC clusters, workloads and applications • Availability 24x7 on-call
• Bachelor’s degree in computer science, electrical engineering or related field • Strong understanding of Linux fundamentals and performance optimizations (Ubuntu) • Advanced experience with SLURM configuration management systems, starting from scratch • Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols • Experience in collaborating with network and data center teams for large scale cluster builds • Experience with configuration management software systems monitoring and alerting (Prometheus, Grafana, Telegraf, Splunk, etc.) and/or administering HPC workload managers (SLURM) • Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems • Experience with Slurm and storage management of distributed parallel file systems a plus • 3+ years of additional equivalent experience or evidence of exceptional ability related to the position
• This is a contract position • Office snacks & reimbursable meals* when in-office • Equal Opportunity for Diversity & Inclusion
Apply Now