Software quality and delivery speed are quite antagonistic in their nature. Quality assurance usually takes a considerable share of time and effort in the software development cycle, so it naturally tends to slow down delivery speed. Increase in feature delivery speed often results with quality issues piling up.
Frequent delivery of quality software is crucial in building consumer facing products, such as HomeToGo, which heavily depend on a fast A/B testing cycle to measure customer response and make data-informed product decisions.
In this article, I share a few key techniques and tips, which help to ship new features and code changes into production multiple times a day while still ensuring a high quality level.
It is definitely worth investing a considerable effort into bug prevention and quality assurance before the release. However, some bugs and malfunctions will happen, sometimes the critical ones, no matter how hard you will try to prevent them upfront. Besides your own software, there are many other risk factors at play in production environments, like hardware crashes or 3rd party APIs that start throwing errors or unexpectedly go out of service .
Therefore, I find it useful to complement prevention measures with monitoring and reaction capabilities, to quickly detect and resolve the bugs and malfunctions that escape the prevention mechanism.
In the end, you build two main lines of defense. The prevention defense line prevents the absolute majority of bugs from reaching production. In case some bugs still get into production, the monitoring defense line helps to quickly catch them and reduce the negative impact to the minimum.
Let’s iterate through the main components of the prevention defense line and how they help to catch the bugs before the release.
Automated test suites are the backbone of bug prevention. Those tests provide you with a safety net against regressions and breaking the existing functionality, as you change or add code in small baby steps. Fast feedback is essential. You make a step and you want to immediately know if everything is still working as expected, before taking another step. Therefore, your test suite must be fast.
A good rule is to keep running time of the whole test suite under 5–10 mins, always preferring the shorter time.
Tests that run longer than 10 mins already introduce significant unproductive time intervals when a developer is waiting for feedback. Such slow test suites discourage from coding in small baby steps, and even worse, might result in frequent context switches as a developer starts working on something else to save time, but needs to come back to the first feature if the tests don’t pass.
Make sure your test suite has the form of a real pyramid, with a big fat bottom made of fast unit tests and, if applicable, only a small set of user interface tests at the top. Integration and user interface tests are much slower than unit tests, so you have to limit the number of such tests to the necessary minimum to keep the running time of the whole suite below 5–10 mins. There’s a useful article about the test pyramid in Martin Fowler’s Bliki.
Automated tests are part of the codebase and deserve the same treatment as your code. Keeping test suites well-factored and fast requires the same software engineering skills as writing the code, so why not let developers write all those tests in the first place?
In my experience, developers proved to be in the best position to keep tests always in sync with ongoing functionality changes as well as to understand the underlying dependencies between features.
Continuous Integration is another key practice to ensure both delivery speed and software quality. Have a central continuous integration server, which builds the complete software service and runs all its tests automatically after each commit to the main branch.
Remember, you only do Continuous Integration when you integrate your code into the main branch at least once a day. Preferably, you integrate more frequently.
Keeping the main build green must be the top priority of all developers. This way, you learn to keep the development version of your software stable and deployable at any moment, which unlocks a remarkable maturity achievement of your development team - you actually practice Continuous Delivery!
Use trunk-based development to really reap the benefits of Continuous Integration. In my experience, it works very well with in-house development teams. Trunk-based development makes the whole software delivery cycle so much simpler and faster. After all, there’s just one code branch to maintain, test and deploy. In reality, ensuring the quality of small frequent deployments directly from the main branch is so much easier than of those bigger deployments once or twice a week.
Feature toggles help to elevate quality assurance and testing to another level. A feature toggle is a development technique, which allows you to easily switch a feature on or off via a configuration flag. It decouples deployment from release. A new feature is deployed into production disabled, so it is not exposed to users yet. After the deployment, a developer can use an internal dashboard to switch the feature on just for themselves and check if the feature actually works as expected in the real production environment. If the developer finds a bug, they can still fix it before the feature is switched on for the users. Feature toggles can be further used as a tool for canary releases or A/B testing when a feature is enabled for a subgroup of users to gather information about the feature impact to the systems and the users, before releasing it to all users.
At any given moment, HomeToGo production websites contain multiple disabled features that are in active development and not yet finished. Our users will see them only after they are finished, but developers and product managers can test and tune the feature in production already much earlier. This way, even work-in-progress features are being continuously integrated into the whole service. That’s Continuous Integration :)
In addition, an automated 1-click deployment process greatly increases speed and reduces human errors. As you deploy multiple times a day, you want to automate all the manual time-consuming error-prone steps as much as possible.
The foundation of the second line of defense is building extensive real-time monitoring of all important tech & business metrics and an alert system which triggers real-time warnings if key metrics drop. If a malfunction occurs in production, especially a critical one, you want to detect it right away and quickly deal with it.
Let’s face it, when you have a big complex distributed system, it is rather impossible to thoroughly test all the possible workflow scenarios within a reasonable time on a frequent basis. That’s when the monitoring & alert system comes to rescue. You tune your automated test suite to cover for most important workflows and functionality, and rely on monitoring to catch the rest of the malfunctions in production in real time.
At HomeToGo, we continuously run tens of A/B tests in parallel. Each of our users is exposed to multiple new features at the same time. Running automated tests for each possible feature combination would literally take days and our delivery speed would drop dramatically.
The monitoring system helps to better manage risks in everyday development. You implement a feature change, your tests are green, you test the feature on your development and maybe staging environment, and you are ready to deploy it to production. You push the deploy button, then you check the relevant monitoring dashboards to see if the metrics are still healthy. In case any metrics turn red, you have a few options. If the situation looks serious, it is a no-brainer to hit the rollback button and analyse the reasons later. In case the malfunction is small and does not actually affect users, you might also choose to quickly prepare a fix and deploy it.
At HomeToGo, rollbacks are rare and the absolute majority of the deployments go nice and smoothly. We have a fully automated 1-click rollback, similar to 1-click deployment.
If an incident occurs, it is important to perform a post-mortem and adjust your defense lines accordingly. Whenever possible, prefer tuning your prevention defense line to catch the bug already before the release. Adding a fast unit test against the regression is one of the best options.
Ensuring both fast delivery and a high level of quality is much easier when development teams accept end-to-end ownership of the whole software delivery cycle, including deployment, monitoring and maintenance. There are many resulting benefits.
Having a full overview and control of both defense lines, developers can make sure that important use cases are covered by an optimal combination of both defense lines. Wherever the monitoring defense line is thinner, they can build the prevention defense line thicker, and vice versa.
Putting developers in charge of deployment also sends a clear signal to the developers themselves that they are ultimately responsible for the quality of their work.
The moment a developer decides to push the deploy button is an important psychological moment. It is a conscious act of accepting responsibility. It makes the developer think about what could possibly go wrong, how they will notice if anything goes wrong and take measures to minimize the risks, before they actually hit the deploy button. A developer should not be blamed in case they use the rollback functionality. However lessons must be learned, rollbacks cannot become a “normal” practice.
When developers own deployment, monitoring and maintenance, they get a very fast and direct feedback from the production environment and their users, so they become well aware of how their software behaves in production. They can use this feedback to increase performance, stability and quality of their software, as well as to improve their judgement calls to manage the risks. There are no handoffs, no sign-offs, no other internal cycles to QA specialists and back to slow them down or delay feedback.
Developers build both defense lines against malfunctions, continuously tune them as they learn from incident post-mortems. This way, developers can leverage both defense lines to ensure quality while maintaining fast shipping speeds. At the time of writing this article, the uptime of HomeToGo websites is consistently at least 99,99% over the last 12 months, while we continuously ship features and improvements >50 times a day.
Thanks to Audrius Bugas, VP Technology at HomeToGo and Stephan Claus, Head of Data Analytics at HomeToGo, who provided valuable inputs and reviewed the article.
Follow us on Medium for more insights on the tech industry.