Skip to main content
Version: 2.0

Incident Management

Incidents. They’re inevitable. Servers crash, code deploys with bugs, and sometimes, despite our best efforts, things just break. As engineering leaders, we often focus on preventing incidents – and that’s good! But equally critical is how we lead through them. It's not just about restoring service; it’s about building trust, fostering learning, and demonstrating true technical leadership.

Too often, incident response is viewed as a purely technical exercise. Get the systems back up, then move on. But that’s a missed opportunity. A well-managed incident can be a powerful signal to the team, the organization, and even your customers. A poorly managed one? It can erode trust and create a culture of fear.

I recently witnessed a team turn a critical outage into a demonstration of resilience and trust. Instead of panicked scrambling, the incident commander calmly guided the investigation, empowered the team to explore solutions, and kept stakeholders informed with clear, concise updates. The result wasn't just a faster resolution, but a stronger, more confident team. This isn't about becoming an on-call hero yourself. It's about setting the stage for your team to be heroes. Here’s how to lead through incidents, moving beyond simply extinguishing fires to building a more resilient and effective organization.

The Four Pillars of Leadership During an Incident

I've found that effective leadership during incidents boils down to four key pillars: Clarity, Empowerment, Support, and Learning.

1. Clarity: The North Star in the Chaos

When things are going wrong, ambiguity is the enemy. As a leader, you need to provide clear direction, even if all the information isn’t available. Think of clarity as the guiding light that helps the team navigate through the storm. This means:

  • Establishing a Single Source of Truth: Tools like communication platforms (Slack, Microsoft Teams, etc.) or dedicated incident management systems are vital. Designate one place where the incident status, action items, and communication are centralized. Avoid fragmented information across multiple chats and emails.
  • Defining Roles & Responsibilities: Who's leading the investigation? Who’s communicating with stakeholders? Make these assignments clear immediately. Don't fall into the trap of everyone “helping” and no one taking ownership.
  • Prioritizing Communication: Stakeholders (customers, support teams, leadership) need to be informed, but not overwhelmed. Focus on what happened, impact, and estimated time to resolution. Avoid technical jargon unless absolutely necessary.

2. Empowerment: Trust Your Team to Solve the Problem

This is where a lot of leaders fall down. The urge to jump in and “fix it” is strong, but resist it! Your team is capable. Your job is to create the space for them to do their best work. Remember, empowerment isn't about avoiding involvement, but about shifting from directing to facilitating.

  • Delegating Authority: Give the incident commander (and their team) the authority to make decisions quickly. Don’t require endless approvals for every action.
  • Removing Roadblocks: Are there permissions issues? Do they need access to specific logs? Clear the path for them to investigate and resolve the issue.
  • Resisting the Urge to Micro-Manage: Trust their expertise. Ask clarifying questions, but avoid telling them how to solve the problem.

I remember one particularly stressful incident where a critical database connection failed. My initial instinct was to start digging through logs myself. But I stopped myself, reminded myself of the team’s expertise, and instead asked, “What information do you need from me to diagnose this?” The team quickly identified the issue and resolved it – and they felt empowered by the trust I placed in them.

3. Support: A Safe Space to Operate

Incidents are stressful. People are under pressure, and mistakes happen. Create a supportive environment where team members feel comfortable speaking up, asking questions, and acknowledging errors. Encourage questions and updates; create space for team members to voice concerns.

  • Promote Blameless Postmortems: This is crucial. The goal isn't to assign blame, but to understand why the incident occurred and prevent it from happening again. Focus on systemic issues, not individual mistakes.
  • Encourage Psychological Safety: Team members should feel safe admitting when they’re stuck or need help. A culture of fear will stifle communication and exacerbate the problem.
  • Acknowledge the Effort: Even if the outcome isn't ideal, acknowledge the team’s hard work and dedication. A simple “thank you” can go a long way.

4. Learning: Turning Crisis into Opportunity

The real value of an incident lies in the lessons learned. Don't just fix the problem and move on; take the time to analyze what went wrong and how to prevent it from happening again.

  • Conduct a Thorough Postmortem: This should be a detailed analysis of the incident, including a timeline of events, root cause analysis, and action items. Tools like project management/documentation tools (Hygger, Confluence, etc.) can be useful here.
  • Identify Systemic Issues: Don’t focus solely on the immediate cause; look for underlying patterns and vulnerabilities. Is there a lack of monitoring? Are there gaps in our testing?
  • Implement Corrective Actions: This is where the rubber meets the road. Prioritize action items and assign owners. Track progress and ensure that the lessons learned are actually implemented.

Leading through incidents isn't about being a hero; it's about empowering your team to be heroes. By focusing on clarity, empowerment, support, and learning, you can turn crisis into opportunity and build a more resilient and effective organization.

Genuine care underpins each of these pillars. Genuine care means providing psychological safety during Support, trusting your team’s judgment during Empowerment, communicating clearly during Clarity, and investing in learning and improvement through the Learning pillar. It's about showing your team that you value them as individuals, not just as problem-solvers.