Together with the Software Improvement Group we set out to investigate whether the quality of the developer test code influences the development process of software. Now, why should the quality of the unit and/or integration tests actually matter?
First of all, why should having high quality tests matter at all? Let’s first establish that you produce production code, or the code that actually delivers the functionality, and test code, the code that you write to (unit) test the production code. Whether you follow a test-driven, test-with or test-after approach to writing your tests, you want to make sure that you have your production code covered.
Does coverage tell the story?
The big question here is whether coverage really tells the complete story? Can we be sure that if we have a high level of test coverage, that we will have good tests? And what are good tests? Are these tests that (1) completely test the system, (2) find a lot of bugs and make it easy to locate them, and (3) are easy to evolve? Test coverage will give you a partial answer here, as it mainly answers our first point: completeness.
More insights on why we should keep using test coverage, but why we should also be careful while interpreting it can be found in this blog post by Arie van Deursen: Test Coverage: Not for Managers?
A Test Code Quality Model
Together with colleagues we constructed a test code quality model that tries to address the 3 aforementioned criteria for testing:
- How completely is the system tested?
- How effectively is the system tested?
Does the test code enable developers to detect defects and locate the cause of these defects?
- How maintainable is the test code?
Completeness. For the first item, completeness, we rely on test coverage and the assertion-McCabe ratio. We explicitly don’t only rely on test coverage, because it is easy to reach high levels of coverage, without really testing all (special) circumstances. That is why we add the assertion-McCabe ratio. The Assertions-McCabe ratio metric indicates the ratio between the number of the actual points of testing in the test
code and of the decision points in the production code. The metric is inspired by the Cyclomatic-Number test adequacy criterion.
Effectiveness. In order to measure how effective unit test code is, we rely on the assertion density and directness metrics. Firstly, assertion density aims at measuring the ability of the test code to detect defects in the parts of the production code that it covers. This could be measured as the actual testing value that is delivered given a certain testing effort. The actual points where testing is delivered are the assertion statements. At the same time, an indicator for the testing effort is the lines of test code. Which brings us the assertion density metric which is defined as #assertions/LOC_testcode.
Another important part of evaluating the effectiveness is determining whether the tests are able to provide the location of the defect to facilitate the fixing process. When each unit is tested individually by the test code, a broken test that corresponds to a single unit
immediately pinpoints the defect. Directness measures the extent to which the production code is covered directly, i.e. the percentage of code that is being called directly by the test code. This metric tries to distinguish pure unit testing from integration testing.
Maintainability. For the third item, we rely on the expertise of the Software Improvement Group to determine the maintainability of the test code (albeit with some changes, see the paper); explained very briefly, we look at the amount of duplication in test cases, the size of each test case, the complexity of unit test code and the coupling between tests.
Fixing bugs, implementing features… does test code quality play a role?
We first calibrated the test code quality model above with 86 software systems (14 open source, 72 closed source). Once we had a calibrated test code quality model, we applied our model on 18 open source software systems representing 75 years of development.
We subsequently looked at the Issue Tracking System (ITS) of each of the 18 open source software systems. For each system we looked at the average “open” time of a bug report or feature request, i.e., we looked at how long it took for developers to work on an item.
Finally, we tried to correlate whether high-quality test code means that teams of software engineers are able to implement new features more quickly and whether bug reports can be dealt with more quickly.
First observation. Test code quality does not seem to influence the speed by which defects can be resolved. Our assumption here is that having high-quality test code means that the easiest to fix bugs (the “low-hanging fruit”) are already out of the system and the most difficult to fix bugs remain.
Second observation. High-quality test code quality is positively correlated with the number of issues (valid for both feature requests and bug reports) per month that a team can process.
Take home message
Testing goes beyond merely detecting defects. If high-quality test code is present, it does influence the number of issues that a development team can handle. In contrast, and at first sight seemingly counter-intuitive, having high-quality tests in your project does not decrease the time needed to solve a bug, which we assume to be related to the fact that the easiest to solve bugs have already been caught by the tests, thereby leaving the more complicated bugs in the system, which take more time to solve.
Interested in more?
This is joint work with Dimitrios Athanasiou, Ariadi Nugroho and Joost Visser. More information about this investigation can be found in our paper which is scheduled for publication in the IEEE Transactions on Software Engineering. A pre-print copy can be found here