Photoreal 3D. For everyone.
3D Photography • Computer Vision • Machine Learning • Augmented Reality • 3D Computer Graphics
August 17
🏢 In-office - Bay Area
Photoreal 3D. For everyone.
3D Photography • Computer Vision • Machine Learning • Augmented Reality • 3D Computer Graphics
• Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure. • Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our 000's GPUs in many clusters. • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands. • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment. • Implement fault-tolerant and resilient design patterns to minimize service disruptions. • Build and maintain automation tools to streamline repetitive tasks and improve system reliability. • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers. • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability.
• Proven work experience 10+ yrs as an reliability engineer, production engineer, infrastructure software engineer or a similar role in a fast-paced, rapidly scaling company. • Strong proficiency in GPU cloud infrastructure, including the underlying concepts of scheduling, scaling, cloud storage, networking and security. • Proficiency in programming/scripting languages. • Experience with containerization technologies and container orchestration platforms like Kubernetes or equivalent. • Knowledge of IaC tools such as Terraform or CloudFormation or equivalent. • Excellent problem-solving and troubleshooting skills. • Strong communication and collaboration skills. • Experience with observability tools; examples include DataDog, Prometheus, Grafana, Splunk and ELK stack or similar. • Knowledge of security best practices in cloud environments. • Good to have experience as an SRE within the AI/ML space is strongly preferred.
Apply Now