Whenever you release code into the production environment and shortly after realize that you’ve also injected a number of defects into production, you’re more than likely going to want to take a 3 phased approach to resolve the trouble.
- The first phase is going to be emergency triage and all the initial tasks that you need to accomplish to ensure your customers are still able to use your software.
- The second phase is going to be around understanding the scope and severity of the defects that were injected and putting in place tasks for your team to ensure that these defects are resolved prior to attempting another deployment into production.
- The third phase is going to be around setting in motion new processes and control measures to prevent a similar event to what had just occurred from happening in the future.
At Lighthouse Technologies we’ve seen many clients facing struggles around code quality and injecting defects into their production environment, and we’ve helped them successfully triage their production issues, while also putting in place control measures just like those we’re about to cover on this blog.
So as you continue reading know that these tactics, and the overall framework that we’re laying out are time-tested and client approved, but that’s not to say that what we would recommend for you is exactly the same thing. So, if you’re facing issues with defects in production or overall code quality, please give us a call so we can go over your unique situation and ensure that you are successful. Because the alternative where you take action on your own carries with it risk because you won’t have our experts advising your team as you implement what we’re suggesting here today.
That said, let’s go over our three-phased approach to triaging this hypothetical, yet all too common scenario. Then if you have any questions please feel free to reach out to us and we’ll be very happy to address you specifically.
Step 1: Triage
The very first thing you’re going to want to understand when you realize that your deployment either has defects in it, or that your deployment has failed, is whether or not you’re going to revert back to the previous version in use prior to the deployment or if you’re going to live with the changes and fix the issues present in either your quality assurance environment or in production directly. In more occasions than not, this is going to require a leadership notification or direct involvement from leadership to ensure that they are at the very least informed of what’s going on and okay with your approach and recommendations.
Once you’ve gotten your leadership buy in, or communicated with leadership about your course of action, your next step is going to either be reverting back to the previous version, retesting that version in production, and then moving on to phase two; or alternatively, if you decide to fix what is been injected into production in either the quality assurance environment or production directly you’re more often than not going to want it to find the developers who coded the features that are defective and call them into an emergency bridge to begin working on the solutions in real-time.
Pro Tip: be sure to let the developers know ahead of time that their code is about to go into production and provide them a link to the deployment call, so should there be an issue while deploying the code they can hop on the call and address the issues head-on and it in the quickest amount of time possible.
With your developers actively working to fix the issues in either the quality assurance environment or production, you should then gain an understanding on the level of effort involved in fixing the problems moving forward. To do this you’re going to want a quick, down and dirty, impact assessment completed so that your leadership team and customer support team effectively communicate to impact the clients, while not over-communicating to unaffected clients and causing undo reputation damage.
Presuming you have your impact assessment completed, and the developers are actively working on solutions in an open bridge call, your follow on actions are going to be focused on ensuring your leadership, and the clients are effectively communicated with, and the solutions put in place to fix the bugs actually fix the problem without causing additional issues.
Pro Tip: many times, we see clients at this point fail to update technical specs, or architectural diagrams, and in doing so they lose track of how their application is built. So be sure to follow up after your triage calls with updated documentation so that over time you don’t lose sight of how your applications are built.
For step 3, it comes down to ensuring that we learn from our mistakes and prevent them from happening in the future. It’s important to note here that you as the leader or as a facilitator need to view all this from the perspective of wanting to fix the process, not punish the people involved. Everyone has good intentions and it’s not that we want these things to happen or that we allow them to happen, it’s that we don’t have a trigger or mechanism in place that would prevent it from happening. This is a core belief for us at Lighthouse Technologies. It’s that a person, or people, are not the problem it’s the process they’re put into that is causing these problems. If we put a person in a high-quality process, they’re far more likely to consistently produce high-quality results.
So here are some things that we recommend you put in place to ensure that your production deployments are always successful and your process is high-quality.
- Test Automation: If you haven’t already done so, you need to not just have a program that builds out automated test cases but also understand why you choose to automate test cases for certain aspects or areas of your application. Don’t apply a broad brush to test automation, it should be that you understand the investment involved in test automation and you maximize the return on that investment by automating test cases for portions of the application that are stereotypically week, or areas of the application that have a large number of dependencies.
- Test Sooner: involve your testers earlier in the process. Have them do quality checks on user stories that are still in your dev team’s runway and haven’t made it into the Sprint yet. Also have your testers work with your business analysts and Scrum Masters to ensure that they are thinking of alternatives when writing acceptance criteria, business rules, or when considering how the individual user stories contribute to the overall feature and whether something might be missing. High quality user stories result in fewer bugs and rework.
- Always Test: our last hot tip also focuses on work that’s done much earlier in the process and that’s that your testers should have access to a production-like environment where they can continuously test the application for usability and performance. This test environment is another big investment typically but ensuring your application remains scalable and reliable is the payoff.
This is our down and dirty look into how to triage and fix some of the most common quality assurance issues we see. If you’ve made it this far, you’re probably hoping to get additional details. To put it frankly though, we could write entire books on this subject, and people have made it their entire careers to ensure software is of high quality. In fact, we’ve made it our entire careers!
This is why if you have follow-up questions, or you would like to know something more specific, all it takes is for you to reach out to us to get the answers that you need.
Best of luck in your next production deployment and if you have any questions please feel free let us know.