Defect Prioritization: Everything you ever wanted to know but were too afraid to admit that you needed to ask.
One of the biggest agile religious debates that seems to get people up in arms is backlog prioritization when planning has to balance known defects (especially in production) against new feature work. Let’s dig in and find a sensible approach here.
First of all, realize that it isn’t a fun situation for developers: fixing a defect older than two weeks, even if you wrote the code yourself, is like looking at someone else’s work. Especially if it is the cause of a problem, it hurts inside to look at it. You wish you could re-write the whole thing because you’ve grown and learned and you’re a better developer than you were then! I get it. Naturally, it sucks that you can’t do that because you’d end up down a refactoring rabbit hole due to all the other code that depends on your code. So you end up feeling like you’re adding duct tape to a hole in the hoover dam. If you’re fixing a bug in code you’ve never seen before, its like your Product Owner told you to fix a hole in the Hoover dam, but said it in a language you don’t speak, while handing you a box of duct tape and shoving you in the opposite direction of where you need to go. Support engineers – if they actually like what they do – are a very special breed and should be your best friend. Get them a Snickers bar and a thank you card sometime.
Second, the “triage” work for identifying the importance of bugs (if that happens at all) in which a manual QA tester writes up a ticket and picks a “Criticality” level is a joke. Even if you had an elaborate definition for each one, who cares? A crash that impacts 70% of users is “critical” why? The defect is critical to… what? To who? How many people? How much money?
Now, the lazy moral high ground of newly trained agilists is to insist you should’ve never let the bugs out at all. Six Sigma Quality baby!!! That’s a nice thought, and a very valuable standard if you are starting a completely new project on the latest and greatest stacks. You know, an iOS 9+ iPhone 6s or later mobile application from scratch. Then, I do advise you to build less than you think you should, think harder about whether or not each feature is actually important, and ensure that no defect gets into the App Store.
That isn’t most software and that isn’t the problem established software companies are grappling with while in the middle of an agile-at-scale transformation. If are a Product Owner for 10% of an application older than three years, you definitely have defects and you definitely need a rule of thumb for what to do about them. It isn’t your fault, but it is your responsibility. Operating systems evolved underneath you. Hardware was replaced. Vendors changed. SDKs stopped getting updated. People changed. Deprecations occurred. Now you have a list 1,000 decisions to make. In that scenario, the moral high ground “you shouldn’t have made any defects!” is lazy and unhelpful. That’s not the reality and it provides no answer for what to do once you already have defects in production. There’s really four approaches to consider.
1 – All About the Money:
On the one hand, calculating the ROI of every User Story then attempting to apply the same methods to your production or other leftover defects will require a pretty rigorous approach to finance, accounting, and statistics. A simple example – if a LinkedIn share crashes every 100th time I cross-post to Twitter, what is the ROI of fixing it? I’m not a paying customer nor is Twitter. Should I just leave the crash and hope people don’t complain too much? No, I don’t think anyone would suggest that. That said, there definitely is a statistical algorithm for whether or not that crash is likely to impact my decision to become a Premium Member in the future. But if the crash takes 11hrs to fix, test, and deploy while the data mining, analysis, etc takes 60hrs to gain a certainty of 75% – why on earth would anyone not fix the defect and move on? Eric Reis popularized the saying “Metrics are people too” while Ash Maurya adapts this to say “Metrics are people first.” That is to say, if you have crash count of 13, not a very actionable metric. If you have a percentage of engaged users experiencing a crash in version 1.9.3 – you have an actionable metric, but that metric represents real people who are annoyed in real life about that crash! Quantitative data needs to drive qualitative insight, not ever-more-complicated quantitative analysis.
On the other hand, there are important occasions when the money makes a difference. If you have customers with a Service License Agreement or paid SAAS users threatening to leave or your actionable product metrics are moving in a scary direction on account of a defect or the perception of poor performance, the money should be the incentive you need to prioritize fixes over any new feature. Once a customer is gone they are incredibly unlikely to come back.
2 – Actionable Product Metric: Oops
Oops! We stopped talking about money and started talking about product metrics! There is a good reason for that – if you prioritize the development effort that improves Acquisition, Activation, Retention, Referral & Revenue (AARRR!) then you are by default increasing the money. ROI is not even the money question to solve, is it? If you have a fixed team contributing to the revenue of a product, ROI variability or Gross Margin variability is what you actually want to track – as long as the costs per month to maintain and improve my product is outpaced by the growth of revenue from paying customers, the ROI is there and the Margins are there and everyone is happy! The problem is that revenue, ROI, and Gross Margin are extremely lagging indicators of success. They are a good indication of the stability of the performance of a company over time, which gives investors the confidence they need to keep the money there, but multi-year lagging indicators are very poor metrics for the decision-making of the teams that managing, maintaining, and improving the product. Growth of total users or active users can be a great indication of possible network effects long-term. Both of these long-term performance indicators are symptoms of competitive advantage. NOT the cause.
Now we have the cart before the horse though. If you have legacy production defects in your backlog and want to move to cohort-based split-testing, your defects are the NOISE in your SOUND. In the example crash above, if you knew cross-posting to Twitter was an important proxy metric for Referral, the possibility of a crash is also the possibility that I don’t try to engage a second time. If that defect existed before you began using cohort analysis and split-testing, your viral coefficient is already distorted. So if your agile release train is making an exciting and important stop at the actionable metrics station, make sure to prioritize any defects that could distort the reliability of your pirate metrics and future experimentation.
3 – Fuzzy-Weighted Economic Value Algorithm:
A core concept for how the backlog should be prioritize in a lean-agile environment is maximizing the flow of value-add and minimizing waste. If you are continuously deploying, the scope of feature releases can take a back seat to actually committing to the long-term awesomeness of the product. Of course, as Ash Maurya says, the product isn’t the product, the product is the sustainable business model – and that’s not a fixed-duration project, that’s a commitment to continuous improvement of your unique value proposition. So in good lean-agile, at any given Program Increment planning session, you do not need to be certain that your ROI calculations are perfect or your developers will be fully allocated or your that your schedule is on track. You have a fixed release cadence, only scope may vary. You have a large backlog, selecting the best possible thing to build matters. As we said above, if your revenue growth outpaces your cost growth, the ROI takes care of itself.
The Weighted Shortest Job First approach is very useful for exactly this. Because everything else is a lagging indication of good decisions, the Relative Cost of Delay and the Relative Job Size are the most important factors in job sequencing. Let me reiterate. Sequence = prioritization. Under the assumption that small legacy defects require small individual effort while large value-add features take multiple sprints to roll out, you’ll likely always fix your bugs FIRST. Which is good. No one likes a crappy product, no matter how much “SOCIAL!!” you add to it. Burn your customers long enough and they will abandon you. Every software product is replaceable if you make the pain of use greater than the pain of switching to an alternative.
Over time, three things will happen. You’ll get to a point where the outstanding defects will require large-scale refactoring effort while your stakeholder (hopefully) get wise to the fact that a smaller improvement with more certain economic value add is more likely to get prioritized. At that point, you have flow and hopefully rational planning and discussion will rule the day on deciding what to do next. On that note….
4 – Politically-Intelligent Fuzzy-Weighted Economic Value Algorithm:
If you aren’t entirely lean-agile, aka you are still mid-transformation, aka “the top” still works using their old plan-driven paradigm while somewhere down the line an agile-savvy person tries to smooth the flow of that work, you need to add something for portfolio-level politics that impact your program-level prioritization. In this case, while you may not share it too publicly, adding more scores for which stakeholder you are pleasing should be considered. You can then weight each of the relative scores, like proxy-voting for your stakeholders.
Yes, that means the old-school problems will be continued because you are giving the important people at the top some blank-check preferred stock when it comes to your backlog prioritization. The unfortunate alternative is that you live in silly denial that their perception matters or that backlog prioritization is not a political question as much as an economic value you question and the people “at the top” or “in the business” continue to hate agile and second-guess every decision you make. Concessions to a powerful VP today help earn you the trust you need in order to move prioritization to a more rational approach later. Admit where you are, challenge it bit-by-bit, and work to improve it.
What does that look like in practice? Hopefully there is an IT leader at the top too in your large-scale silo-heavy organization. Hopefully that IT person or a Quality person up “at the top” can be one of the political variables in your weighting approach. Don’t pit the CEO against the CTO as a Product Owner, that just makes you look like a chump who can’t make difficult decisions yourself. Gain buy-in and provide enough visibility before planning sessions that no one gets blindsided by your decision to prioritize refactoring over that VP’s screams for “SOCIAL!”
To reiterate – THE BACKLOG IS A POLITICAL ARTIFACT.
Your solution is only partially economic or financial or social. This is not a democracy. Don’t go asking for votes. It really isn’t a democratic republic, either. And it is definitely isn’t just user story meritocracy. Slapping a relative business value tag and sorting is begging for failure and distrust at your methods. It looks lazy and it is. You have to influence the right people to make the right decisions and that takes work. If you’re lucky, its close to a full-time job and you have job security now. Congratulations. But seriously go fix those bugs. They’re lame.