QA Instability Implies Production Instability

Many companies that have trouble delivering software on time exhibit a common pathology. Developers working on the next release are frequently interrupted for production support issues with the current release. These interrupts never appear in project schedules but can take up half of the developers' hours. When you include the cost of task-switching, this means less than half of their available time is spent on the new feature work.

Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system's behavior in QA is not a good predictor of its behavior in production.

QA environments are often configured differently than production. Not just in the usual sense of consuming a QA version of external services, but also in more basic ways. Server topology may be different. Memory settings, capacity ratios, and the presence of network components can all vary. QA often has a much simpler traffic routing scheme than production, particularly when a CDN is involved.

The other major source of QA unavailability has to do with data refreshes. QA environments either run with a miniscule, curated test data set, or they use some form of backward migration from production data. Each backward migration can be very disruptive, leading to one or more days where QA is not available.

Disruption arises when testers have to do manual data setup in order to test new features. These setups get overwritten with the next refresh. Sometimes, production data must be cleansed or scrubbed of PII before use in QA. This cleansing process often introduces its own data quality problems. The backward migration process must also be kept up to date so it can propagate data back into the schema for the next release. This requires copying data and schema into QA, then forward-migrating the schema according to the new release.

When many teams contend to get into a QA environment, that contention can result in lost time as well. Time is lost in delays when one team cannot move their code into QA during another team's test. It is also lost when one team overwrites test data that a different team had set up. And it can be lost when one team's code has bugs that prevent other teams from proceeding with their tests. Suppose one team works on login and registration, while another team works on friend requests. Clearly, the friend requests team cannot do their testing when login is broken. This last issue also applies across service boundaries: a service consumer may not be able to test because the QA version of their service provider is broken.

Finally, problems in QA simply take a lower priority than problems in production. Thus, the operations team may be fully consumed with current production issues, leaving the QA environment broken for extended periods. In a vicious feedback loop, this makes it likely that the next release will also create production problems.

My recommendations are these:

Give priority to well-functioning test environments.
Virtualize your test environments, so you can avoid inter-team dependencies on a QA environment.
Automate the backward data propagation, and make it part of spinning up a QA environment. When you must scrub PII, automate that process so that every QA environment can draw from a snapshot of cleansed data without impinging on the production DBAs.
If your QA stays unavailable because there are too many production issues, recognize that this is a self-sustaining pattern. You can temporarily redirect a "SWAT" team to fix QA and it will pay dividends for all future releases.