You have 0 further articles remaining this month. Join LeadDev.com for free to read unlimited articles.

Partners

Achieve higher availability with leaner teams by alleviating the pressure of incident management on your developers

Imagine it's 3 am, and your phone jolts you from deep sleep with an outage alert. You fumble to decipher the urgency - is it a minor hiccup or a full-blown catastrophe? You groggily search for that runbook and deliberate over who else to wake up.

While this scenario may sound familiar, this is not a sustainable or optimal approach to incident response. With stripped back teams and fewer hands on deck, relying on just one person in the dead of night only creates stress and adds to the already-too-high developer cognitive load. Human judgement isn't perfect at normal times, and expecting a handful of folks to shoulder all the blame for any mishap is unrealistic.

This panel of engineering leaders share how they reduce the burden of incident response for their teams. They advocate for a culture of shared responsibility across the board, offering practical strategies to educate the business about engineering practices during the chaos of an outage. Emphasizing transparency and teamwork to bolster system reliability not only eases the load on developers but also puts an end to the blame game. The result? Improved predictability, system reliability, and a boost in confidence for individual developers when the inevitable incident comes knocking.

Key takeaways

  • How to instil a culture of incident awareness within the wider org 
  • Understand what a well-documented, established incident response process looks like
  • Learn how to adopt DevOps practices that actually breakdown silos 
  • Learn how to minimize the impact of incidents on productivity and performance