Search Jobs in CA

Celestial AI

Website LinkedIn All Job Openings

AI at the speed of light.

51 - 200

Lead Reliability Engineer

May 18

🏢 In-office - Bay Area

💵 $175k - $200k / year

⏰ Full Time

🟠 Senior

👨🏻‍🔧 Site Reliability Engineer (SRE)

🛂 H1B Visa Sponsor

Apply Now

Celestial AI

Website LinkedIn All Job Openings

AI at the speed of light.

51 - 200

Description

• Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability • Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis • Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes • Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies • Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations • Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices • Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems

Requirements

• Bachelor's degree in Engineering or related field; Master's or PhD degree preferred • 15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level • Very strong understanding on physics of failures to drive material and process improvements for components • Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies • Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements • Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments • Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems • Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem

Benefits

• health, vision, dental and life insurance • collaborative and continuous learning work environment • chance to work with smart and dedicated people engaged in developing the next generation architecture for high performance computing

Apply Now