Fires are situations where an issue or bug has escalated to the point of major impact on the business. In startups, fires occur often and can have crucial ramifications for the business. As a result, firefighting is an easy way for PMs to quickly build equity with the business and become a trusted team member.
Firefighting comes down to 4 key skills:
Diagnosing
Mitigating
Communicating
Resolving
This article goes into firefighting and how to consistently diagnose, mitigate, communicate, and resolve issues.
Diagnosing
Diagnosing comes down to 2 key ideas: having the right people and environment in place to diagnose and following a structured path that has a high likelihood of resulting in a correct diagnosis.
Environment and People:
Diagnosis can be the most challenging step as the anxiety and stress from the issue can get in the way of critical thinking. As much as you can create an environment for thoughtful problem-solving this will significantly improve the odds of success and the speed at which you find a solution. Set yourself and others up for success by using calm tones when discussing the issue, remind and reassure everyone that you’ll arrive at a solution, and, when applicable, use humor to defuse tension and contextualize the issue.
Most importantly, ensure that the right stakeholders are involved with the right level of involvement. Ensure that you have people actively working on the issue who have the ability to solve the issue and that you’re in contact with teams that are dealing with the backlash of the issue. If you don’t know who should be working on the issue escalate the question to someone who can make that determination.
I’ve found that the single most valuable thing a PM brings to diagnosing is focus. It is hard to stay focused on doing the right things due to stress, incomplete information, desire to know more, and uncertainty. As a PM your job is to constantly align the team on identifying the best next step, pursuing that step, and reassessing the path. If the team is consistently doing this you’ll effectively diagnose and resolve any issues you face.
Path to Diagnosis:
I’ve found that the most reliable path to a diagnosis is to create a timeline of events that can effectively explain why the issue occurred and why we’re seeing the current impact. Timeline building is a great way to ensure that you don’t miss any key variables in the diagnosis and once you have that narrative defined it’ll be much easier to identify the solution. The narrative should answer a few questions:
What is happening and where is it happening?
How does it happen?
Who is affected now and who might be affected later?
Why did it start happening and why didn’t we notice it until now?
When going through this process you should be capable of proving your narrative is right by checking your assumptions against existing cases and finding consistency across every case you see. If every impacted user or instance is consistent with your assumptions then you likely have proven yourself right. Finally, you need to ask yourself if you were wrong why would it be? This extra step of challenging your assumption can be key to expanding your understanding of your problem and identifying missing pieces to your problem.
With this information, you can work with technical and operations teams to correctly define the solution for the issue and the follow-through.
Mitigating
Mitigating is about reducing the breadth and depth of impact of the fire, in some cases, you can skip this step altogether and just focus on solving the issue. Mitigation is valuable when fully solving the issue will take too long and where stopping the bleeding can help buy you time and focus. If you can’t get affected users back to normal ask yourself if you can prevent further users from being exposed as this often can make a huge difference in the total damage of the issue.
Mitigation should not be done by default and should instead be a step decided upon after analysis. I’ve found the following rules as a good way to decide if it’s worth investing:
Do invest if it aligns with the ultimate solution.
Do invest if it significantly improves the quality of life for those impacted or reduces further damage.
Don’t invest if it significantly gets in the way of solving the issue.
Don’t invest if there’s any risk the mitigation compounds the issue.
If you are pursuing mitigation ensure that your stakeholders understand the nature of this action and that they don’t interpret this step as actually solving the problem.
Communicating
Throughout the issue, you need to keep people up to speed on what’s happening so that they can tackle their end of the problem-solving. Great communication can often be a game-changer during fires. There are three key ideas to great communication:
Communicate what you know
Communicate what you’re doing
Communicate regularly
Note: When communicating you want to avoid misleading teams, especially if they’re going to bring information to customers, as misinformation can often make the problem worse. You should not make assumptions or guesses at the cause or impact of the issue if you do not understand it as it may create further confusion and stress, but you can provide not fully validated information if you properly caveat it and the likelihood that you’re right is >=80% as this can help teams navigate the issue.
Deliver your updates in structured messages calling out what’s happening, who’s impacted, what’s been done to address it, what’s next, and key things that need to be done (if applicable).
You want to communicate what you reliably know as well as what the team’s next steps are. Specifically, you want to make sure the following things are understood:
The reach and depth of the issue.
Who else may be impacted in the future.
How you’re mitigating that impact.
What the team is doing next to mitigate or resolve the issue fully (and ideally what the perceived timeline of that is).
Ultimately you’re trying to ensure that everyone is up to speed as close to the real-time as possible but you should always make resolving the issue your first priority. When it comes to how often to share updates shoot for every 30-60 minutes during the early stages of an issue as this will: 1. build trust across the group that you will keep everyone up-to-speed as needed (and as a result minimize distractions) and 2. empower people to properly follow-up with customers/stakeholders.
Resolving
When it comes to resolving the issue, the key thing a PM can do is provide the business and customer context so that the right solution is pursued. In any resolution decision-making process options will be presented and these options will vary in their user impact and costs. By understanding your business and customers you can help teams make the right decision for how to solve the issue.
When looking for the ideal solution remember that:
The ideal solution to any issue will remove the problem entirely for the affected users.
It will often do this while not creating a secondary more significant problem although sometimes it may create a less significant one.
Oftentimes, to fully resolve the issue a more significant solution, system refactoring, or revamping will be needed.
It’s important to delineate what the today part of the solution is and what the tomorrow part of the solution is so that everyone is aligned on how to proceed.
When pursuing a solution, do NOT attempt to release an untested, or minimally tested, solution, especially under heavy fatigue.
The single worst thing you can do in a fire situation is attempt a solution you haven’t fully vetted as this can, and will, often make things worse. The wrong solution can easily turn a moderate problem into a huge problem and it can easily reset your problem solving furthering the damage and frustration.
As the technical solution is being implemented, work with ops teams on the customer follow-through for the solution. Oftentimes, the lengths you and the team go to to put back what was broken and make amends with customers can significantly improve the standing of your business and the relationship with your customers. Nail down the customer communication and any operational follow-through whether it’s cleanup, reimbursement, or committing to preventative work.
Post Fire
Once an issue is fully resolved invest in doing a retrospective about what things went wrong and how can you prevent the same thing from happening again. Identify in these sessions the work that needs to be done to minimize future issues as well as the protocol to be followed.
Whenever possible put monitoring or tracking in place for the issue area as this will help you identify and resolve issues faster in the future. Strong monitoring is the best way to prevent issues from escalating and every fire is an opportunity to improve how you’re tackling prevention.
As mentioned before, follow-up work will likely be necessary to prevent the issue from happening again. Use the momentum of the fire to tackle any system issues that may lead to further fires. Fires create momentum to improve the overall health and safety of your system, use that momentum to get the necessary projects in the roadmap.
Overall, firefighting is a stressful part of PM’ing but it can also be an immensely powerful tool for building reliability as a team and in your system. Take each fire as an opportunity to learn more about your system, your team, and yourself and you’ll find that these learnings will help you become a better PM.