Errors happen. They did and they will. And we cannot avoid it — also in the IT world. The increasing complexity of our systems and solutions makes them more vulnerable to human, machine, and integration mistakes. We can (and should!) do our best to avoid them by writing good tests, conducting some simulated malfunctions, doing regular code reviews, etc but we should also be prepared for the problems. And not only to fix them — the more important is to learn from them. In this article, I wanna share my thoughts about the process called Root Cause Analysis — one of the most useful technics for learning your lesson after problems.
Let’s be honest — no one likes talking about their mistakes. Also, we (as developers) prefer building staff than doing long and arduous discussions and analyses. But trust me — fixing our processes is worth that effort. There is a popular analogy describing why you need RCAs. Imagine a huge company producing cars. One day one of the employees slipped on the floor and broke his leg. The immediate action will be to take him to the hospital and put the leg into plaster. But is it the end of the story? Not yet. Let’s say the company changed nothing after that event. And the next person broke their leg a few days later and after an additional week, one of the machines stopped working causing delays for the whole production. As you may expect it’s not a random series of events. The investigation discovered that people slipped on oil leakage from the machine. And missing oil caused the seizure of the piston in the machine. Finding the real root cause of the first accident could have prevented the injury, production delays, and huge financial problems for the company. The same story can happen to your system. If you do not find the real root of errors in your system but just patch them superficially, they will reoccur. Also, these small problems may be just the first symptoms of the real problem. Do not ignore them! The faster you react, the lower the effects and repair costs will be! You should not only fix the problem but also take a while to understand why it happened and how to improve your processes to avoid it in the future.
OK, we know why it’s important but the question is — how to conduct useful root cause analyses? Fortunately, many people asked that question before and we can use their knowledge. There are many different approaches and frameworks but I’ll describe how I approach those problems — based on the 6-step method defined by the American Society for Quality (ASQ) and the 5-why method developed by Sakichi Toyoda from Toyota. Those two approaches are not exclusive — in fact, they complement each other. ASQ describes the big picture and steps to take when Toyota focuses on 3rd step from ASQ’s system.
Let’s be honest — this system is not creative or breathtaking. After some time you could probably come up with sth similar. But I like simple and well-defined checklists to follow —thanks to that you know you didn’t miss any important step and can focus on more problematic or specific to your problem areas. (Btw I strongly recommend “The Checklist Manifesto” book to fully understand why they are so great). So, what is the RCA framework defined by ASQ?
Let’s focus on the third step from the upper list “Finding the root cause”. Sounds easy but how to do so you may ask. Here, we can use the 5 why method mentioned previously. It’s also super easy to remember and perform. The only thing you have to do is ask several times (in original 5 but sometimes the number is smaller or bigger) the short but powerful question “why?”. Let’s see it in the previous example of an employee who broke their leg in the company.
Our event: One of the employees broke their leg during work in the manufactory.
As you may see solving only our first issue would be ineffective — the oil will leak anyway causing additional problems. The same story can be applied to our systems. It’s important to not only put a band-aid on the wound but heal the real problem. And it’s why the RCAs are so useful!
The best way to learn is from mistakes… made by others ☺ Be a good citizen — share the knowledge from your mistakes with others to help them avoid your mistakes. Create a post at your company’s slack, share the email with thoughts, or create a Medium article. Who knows? Maybe you’ll become a superhero for someone and save their system?