Search Jobs in CA

Xero

Website LinkedIn All Job Openings

Online accounting software. Connects to all things business: accountants, bookkeepers, banks, enterprise & apps.

Accounting • SaaS • Banking • Invoicing • Design

1001 - 5000

💰 $300M Post-IPO Debt on 2018-09

Senior Site Reliability Engineer - Reliability Enablement

2 days ago

🏡 Remote – Anywhere in California

⏰ Full Time

🟠 Senior

👨🏻‍🔧 Site Reliability Engineer (SRE)

🛂 H1B Visa Sponsor

AWS

Azure

Cloud

Distributed Systems

Google Cloud Platform

Java

JavaScript

Python

Terraform

Apply Now

Xero

Website LinkedIn All Job Openings

Online accounting software. Connects to all things business: accountants, bookkeepers, banks, enterprise & apps.

Accounting • SaaS • Banking • Invoicing • Design

1001 - 5000

💰 $300M Post-IPO Debt on 2018-09

Description

• Investigating operational surprises and supporting teams in post incident activities. • Conducting in depth incident analysis and maximizing post incident learning across the organization • Complete short term reliability consultancy and enablement engagements such as SLO reviews and facilitating pre-mortems. • Improving on call health, uplifting observability and addressing any operational hotspots • Identifying, planning and leading implementation of reliability uplift work and initiatives • Support delivery of strategic features and initiatives with reliability and distributed systems expertise • Observing and improving rituals and practices relating to production operations, incident response and incident learning

Requirements

• Solid experience in logging, monitoring and observability of a highly distributed system • Leading incident management and response and troubleshooting efforts, including critical, complex and high severity incidents • Post incident reviews, incident analysis and learning from incidents • Experience working in a tech or product company with comparable scale and complexity • Systems thinking and thinking about how systems and components interact, how they respond to failure • Proficiency in one or more object-oriented programming languages (C#, JavaScript, Java, Python etc) or experience with infrastructure-as-code (e.g. Terraform, Cloudformation) • Experience working with cloud providers such as AWS, Azure or GCP • Experience with designing, developing and operating distributed systems and large scale software systems • Strong experience delivering technical initiatives in an operational, site reliability or platform engineering capacity • The ability to solve engineering challenges outside of your own team, including using influence rather than authority to enact change • Demonstrated experience in reliability concepts like capacity management, autoscaling, deployment and release safety, software strategies for reliability, fault tolerance and graceful failure • Experienced in implementing customer focused Service Level Objectives (SLOs) • Experience using software engineering to solve operational and reliability challenges • Understanding of human factors, safety science and resilience engineering • Experience working in environments with advanced security and networks

Apply Now