Lead Reliability Engineer

May 18

🏢 In-office - Bay Area

Apply Now
Logo of Celestial AI

Celestial AI

AI at the speed of light.

51 - 200

Description

• Develop and implement reliability strategies, standards, and processes customized for datacenter and high-performance computing applications, addressing unique challenges such as thermal management, power integrity, and workload variability • Lead reliability testing and qualification activities tailored for datacenter and HPC environments, including stress testing, thermal cycling, and performance degradation analysis • Collaborate closely with cross-functional teams, including hardware design, systems engineering, and datacenter operations, to integrate reliability considerations into product development and deployment processes • Conduct thorough reliability analyses specific to datacenter and HPC applications, such as MTBF (Mean Time Between Failures) calculations, system-level fault tolerance assessments, and risk mitigation strategies • Define reliability requirements and specifications for new products targeting datacenter and HPC markets, working closely with design teams to ensure compliance with industry standards and customer expectations • Lead root cause analysis and corrective actions for reliability issues identified in datacenter and HPC environments, driving continuous improvement initiatives and implementing best practices • Stay abreast of emerging technologies and industry trends in datacenter and HPC reliability engineering, leveraging this knowledge to enhance the reliability and performance of our systems

Requirements

• Bachelor's degree in Engineering or related field; Master's or PhD degree preferred • 15+ years of experience in reliability engineering, with a focus on datacenter and high-performance computing applications at component, board and system level • Very strong understanding on physics of failures to drive material and process improvements for components • Strong understanding of reliability principles, methodologies, and tools relevant to datacenter and HPC environments, such as reliability modeling, fault tolerance techniques, and performance optimization strategies • Experience working with industry standards and guidelines specific to datacenter and HPC reliability, such as GR-468 and other relevant datacenter component qualification requirements • Proven ability to lead cross-functional teams and drive reliability initiatives in fast-paced environments • Excellent problem-solving skills and the ability to perform detailed root cause analysis in complex systems • Effective communication skills and the ability to collaborate with internal teams and external stakeholders in the datacenter and HPC ecosystem

Benefits

• health, vision, dental and life insurance • collaborative and continuous learning work environment • chance to work with smart and dedicated people engaged in developing the next generation architecture for high performance computing

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@techjobscalifornia.com
Jobs by Title
Account Executive jobsAccounting Manager jobsAccountant jobsAdministration jobsAdministrative Assistant jobsAnalytics Engineer jobsAndroid Engineer jobsAttorney jobsBackend Engineer jobsBusiness Development Rep jobsBusiness Operations & Strategy jobsChief of Staff jobsCivil Engineer jobsCloud Engineer jobsCommunity Manager jobsCompliance jobsContent Marketing Manager jobsContent Manager jobsContent Writer jobsCopywriter jobsCustomer Success jobsCustomer Support jobsData Analyst jobsDatabase Administrator jobsData Engineer jobsData Entry jobsData Scientist jobsDevOps jobsEcommerce jobsElectrical Engineer jobsEmail Marketing Manager jobsEngineering Manager jobsExecutive Assistant jobsController jobsFinancial Planning and Analysis jobsFull-stack Engineer jobsFrontend Engineer jobsGame Engineer jobsGeneral Counsel jobsGraphics Designer jobsGrowth Marketing jobsHuman Resources jobsiOS Engineer jobsInfluencer Marketing jobsInfrastructure Engineer jobsIT Support jobsMachine Learning Engineer jobsMarketing jobsMedical Writer jobsMechanical Engineer jobsOperations jobsParalegal jobsPerformance Marketing jobsProduct Analyst jobsProduct Designer jobsProduct Manager jobsProject Manager jobsProgram Manager jobsProduct Marketing jobsQA Engineer jobsSDET jobsRecruitment jobsRisk jobsSales jobsSales Development Rep jobsSales Engineer jobsSalesforce Administrator jobsSalesforce Analyst jobsSalesforce Consultant jobsSalesforce Developer jobsScrum Master / Agile Coach jobsSecurity Engineer jobsSEO Marketing jobsSite Reliability Engineer jobsSocial Media Manager jobsSoftware Engineer jobsSolutions Engineer jobsSupport Engineer jobsSystem Administrator jobsSystems Engineer jobsTax jobsTechnical Account Manager jobsTechnical Writer jobsTechnical Product Manager jobsUser Researcher jobs