Old Habits Die Hard: Why Refactoring for Understandability Does Not Give Immediate Benefits

Large, long-lived software systems often cope with the problem of technical debt caused by shortcuts that were taken to increase development speed that result in code that is difficult to maintain. A technical debt can be justifiable, e.g., for quickly getting a new product on the market. However, just like actual debts, if the debt is not repaid, one has to pay interest, which takes the form of additional time that developers need to understand and change complex code. Intuitively then, we could refactor the code to pay off the technical debt that has accumulated.

This also fits with the argumentation of Robert C. Martin in his book Clean Code in which he urges developers to take care of their code and start writing clean code as he claims that the structure of code has an impact on the development productivity.

Recently then, one of my MSc students (Erik Ammerlaan, @eammerlaan) set out to investigate whether refactoring actually increases the understandability of code. He performed his investigation in an industrial environment and involved 30 developers from one company (Exact). Those 30 developers are working in 11 different teams spread out over 2 countries.

In order to measure the understandability, he set up 5 experiments that required the participants to make small changes to the code. Starting the hypothesis that “If given the refactored code, developers finish the coding task earlier than if the original code were given.“, the 30 developers had to work with either the existing version of the code, or the refactored version.

Clean code: influencing understandability both positively and negatively

When the original and refactored code were shown side-by-side after each experiment, most developers appreciated the refactored code and preferred it over the original code. They recognized that having ‘clean code’ would make the codebase more understandable. Yet, this seem to contrast our results. In the 5 experiments, we observed that the productivity of developers can be influenced both positively and negatively by code that is refactored for understandability when performing small coding tasks. Depending on the task and the individual, so-called clean code does not immediately improve one’s understanding of code, if one is used to working with the old structures present in source code. As such, we can say that old programming (understanding) habits do die hard.

A matter of quality?

We noticed an important difference in the quality of the solutions implemented in original and refactored code, serving as a first indicator that good quality code prevents developers from writing sloppy code. This is in line with the broken windows theory, a criminological theory that can also be applied to refactoring: a dirty code base makes developers feel that they can get away with ‘quick and dirty’ code changes, while they might be less inclined to do so in a clean code base.

Conclusion

Our experiments showed that having refactored code in general provides no instant benefits with respect to understandability and developers need time to adjust. We fully acknowledge that in the longer term, refactoring benefits might become much more clear and are likely to go beyond understandability (e.g., extensibility, reusability, testability, …)

More information?

Erik Ammerlaan, Wim Veninga, Andy Zaidman. Old Habits Die Hard: Why Refactoring for Understandability Does Not Give Immediate Benefits.
Proceedings of the 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER), pages xxx-xxx. IEEE, Montreal, Canada, March 2015.
[LINK TO PDF]

First Steps in Testing Analytics: Does Test Code Quality Matter?

Together with the Software Improvement Group we set out to investigate whether the quality of the developer test code influences the development process of software. Now, why should the quality of the unit and/or integration tests actually matter?

First of all, why should having high quality tests matter at all? Let’s first establish that you produce production code, or the code that actually delivers the functionalityand test code, the code that you write to (unit) test the production code. Whether you follow a test-driven, test-with or test-after approach to writing your tests, you want to make sure that you have your production code covered.

Does coverage tell the story?

The big question here is whether coverage really tells the complete story? Can we be sure that if we have a high level of test coverage, that we will have good tests? And what are good tests? Are these tests that (1) completely test the system, (2) find a lot of bugs and make it easy to locate them, and (3) are easy to evolve? Test coverage will give you a partial answer here, as it mainly answers our first point: completeness.

More insights on why we should keep using test coverage, but why we should also be careful while interpreting it can be found in this blog post by Arie van Deursen: Test Coverage: Not for Managers? 

A Test Code Quality Model

Together with colleagues we constructed a test code quality model that tries to address the 3 aforementioned criteria for testing:

  1. How completely is the system tested?
  2. How effectively is the system tested?
    Does the test code enable developers to detect defects and locate the cause of these defects?
  3. How maintainable is the test code?

Completeness. For the first item, completeness, we rely on test coverage and the assertion-McCabe ratio. We explicitly don’t only rely on test coverage, because it is easy to reach high levels of coverage, without really testing all (special) circumstances. That is why we add the assertion-McCabe ratio. The Assertions-McCabe ratio metric indicates the ratio between the number of the actual points of testing in the test
code and of the decision points in the production code. The metric is inspired by the Cyclomatic-Number test adequacy criterion.

Effectiveness. In order to measure how effective unit test code is, we rely on the assertion density and directness metrics. Firstly, assertion density aims at measuring the ability of the test code to detect defects in the parts of the production code that it covers. This could be measured as the actual testing value that is delivered given a certain testing effort. The actual points where testing is delivered are the assertion statements. At the same time, an indicator for the testing effort is the lines of test code. Which brings us the assertion density metric which is defined as #assertions/LOC_testcode.

Another important part of evaluating the effectiveness is determining whether the tests are able to provide the location of the defect to facilitate the fixing process. When each unit is tested individually by the test code, a broken test that corresponds to a single unit
immediately pinpoints the defect. Directness measures the extent to which the production code is covered directly, i.e. the percentage of code that is being called directly by the test code. This metric tries to distinguish pure unit testing from integration testing.

Maintainability. For the third item, we rely on the expertise of the Software Improvement Group to determine the maintainability of the test code (albeit with some changes, see the paper); explained very briefly, we look at the amount of duplication in test cases, the size of each test case, the complexity of unit test code and the coupling between tests.

Fixing bugs, implementing features… does test code quality play a role?

We first calibrated the test code quality model above with 86 software systems (14 open source, 72 closed source). Once we had a calibrated test code quality model, we applied our model on 18 open source software systems representing 75 years of development.

We subsequently looked at the Issue Tracking System (ITS) of each of the 18 open source software systems. For each system we looked at the average “open” time of a bug report or feature request, i.e., we looked at how long it took for developers to work on an item.

Finally, we tried to correlate whether high-quality test code means that teams of software engineers are able to implement new features more quickly and whether bug reports can be dealt with more quickly.

First observationTest code quality does not seem to influence the speed by which defects can be resolved. Our assumption here is that having high-quality test code means that the easiest to fix bugs (the “low-hanging fruit”) are already out of the system and the most difficult to fix bugs remain.

Second observation. High-quality test code quality is positively correlated with the number of issues (valid for both feature requests and bug reports) per month that a team can process.

Take home message

Testing goes beyond merely detecting defects. If high-quality test code is present, it does influence the number of issues that a development team can handle. In contrast, and at first sight seemingly counter-intuitive, having high-quality tests in your project does not decrease the time needed to solve a bug, which we assume to be related to the fact that the easiest to solve bugs have already been caught by the tests, thereby leaving the more complicated bugs in the system, which take more time to solve.

Interested in more?

This is joint work with Dimitrios Athanasiou, Ariadi Nugroho and Joost Visser. More information about this investigation can be found in our paper which is scheduled for publication in the IEEE Transactions on Software Engineering. A pre-print copy can be found here

Web API Growing Pains: Stories from Client Developers and Their Code

Together with my PhD student Tiago Espinha we have been investigating what it means for developers to rely on so-called Web APIs. Let’s first get one thing out of the way: what is a Web API? A (server-side) Web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server.” [Wikipedia].

So, what does it mean? First of all: it opens up possibilities. In particular, remote data sources become available to all programmers. You want to incorporate Google Maps in your application? No problem, there is a Web API for that. You want to get up-to-date stock information? Just find the right Web API, integrate it in your application and you are ready to go.

Sounds like a great deal, right? Without importing entire libraries, you can integrate data sources in a fairly light-weight manner. But what does it really mean for software developers?

Let’s take a deeper look. First of all, software engineers are no strangers to APIs, they are used to them. No single software engineer or a team of software engineers creates everything from scratch: they always rely on standard functionality offered to them through a framework or an API. They have come to love (or hate) the XML parser they are using… but they know perfectly well what this library can do for them (and what it cannot). If these software engineers are lucky, the provider of the API sometimes releases a new version of the API, fixing some of the bugs of the previous version or adding some new functionality. In some cases, this means rewriting part of their own application, because the changes to the API are breaking, meaning that the interface to the API has changed to such extent that the old functionality is no longer working or no longer available.

However, the great thing is that if there are these breaking changes, the software engineers can (in most cases) decide for themselves whether they upgrade to this new version. And, perhaps more importantly when they perform this upgrade.

Is this also true for Web APIs? In our paper we started investigating this. So, what did we find out for APIs from Google, Facebook, Twitter and Netflix? Our main finding is that software engineers are really at the mercy of the Web API providers. The Web API provider can dictate pretty much everything: from which functionality survives into the next version of the Web API, to how much time the software engineers get to cope with the new version of the Web API. What we’ve seen is that Facebook provides a 90-day window to cope with breaking changes. Google on the other hand announced a 6 month transition period, but when the deadline came closer and closer, they succumbed to pressure from the developers to extend that period.

Loosely coupled, strongly tied?

While we’ve seen that there is no real consensus or industry standard on how Web API evolution should happen, we did realize something quite disturbing. While Web APIs are typically promoted because of their loose coupling, we have seen signs that software engineers are really at the mercy of the Web API providers when it comes to upgrading. Old versions of the Web API are typically disregarded pretty quickly, thus forcing to upgrade, effectively creating a strong tie.

Some recommendations for Web API providers

  1. Communicate what will change
  2. Communicate when these changes become permanent
  3. Do not change to often (or you can lose your customers)
  4. Do not let old version linger too long (you will get a hard time pulling plug on the old version)
  5. Keep usage data of your system to see who is affected by changes
  6. Provide documentation (and more documentation!)
  7. Indicate the stability status per Web API feature
  8. Organize blackout tests that show which features will be brought offline soon

More reading?
Tiago Espinha, Andy Zaidman, Hans-Gerhard Gross Web API Growing Pains: Stories from Client Developers and Their Code.
Proceedings of the joint meeting of the Conference on Software Maintenance and Reengineering and the Working Conference on Reverse Engineering (CSMR-WCRE), pages 84-93. IEEE, Antwerp, February 2014.
[PDF link]

Link

In my research I try to understand how we can take away obstacles for software engineers to test more. Test more, test more efficiently with the ultimate goal to produce higher quality software.

Earlier this week I had a nice conversation with one of my MSc students and his company supervisors (tnx Erik, Wim, Jerry and Wietze). One of the observations we were discussing is that some companies hardly experience any improvements from automated testing when compared to manual testing [1]. This observation was made by Kasurinen et al. when studying 31 organizations in 2009. 10% of those organization were using automated testing; 30% of those organization were using agile development methodologies.

The study by Kasurinen et al. is not the only one to note that automated testing is not always seen as beneficial compared to manual testing. Engstrom and Runeson report that 50% of the respondents their survey perform as much automated testing as they perform manual testing [2].

Now, it is easy to understand that manual testing remains very important, as manual testing is a lot more agile, i.e., the tester can follow a hunch, something a machine (an automated test) cannot do. However, the benefit of the automated test lies in the fact that it can be repeated, over and over again. As such, it is perfectly suited to detect regressions… but that is the long term view, right? So, in a sense, manual testing leads to short-term benefits, while automated testing has more rewards in the long term, of which detecting regressions, creating software with a better structure and having (executable) documentation are probably the most important ones.

This leads me to my main point of this blog post. Why is the adoption of agile methodologies happening in more organizations than automated testing is? Both topics are “in fashion” and are much talked about, but it seems that everyone is using SCRUM or some other methodology, but not everyone is actually doing automated testing.

My feeling is that this might have to do with the fact that the benefits of SCRUM and other agile methodologies can be felt in the short-term through better control of the process and the ability to better respond to changes in the requirements. On the other hand, the true benefits of automated testing can only be seen in the long term.

When considering the definition of a SCRUM Master (taken from Wikipedia on December 19th, 2013; http://en.wikipedia.org/wiki/Scrum_(software_development)):

Scrum is facilitated by a Scrum Master, who is accountable for removing impediments to the ability of the team to deliver the product goals and deliverables. The Scrum Master is not a traditional team lead or project manager, but acts as a buffer between the team and any distracting influences.

… my advice would be to also install a Test Master, an independent team member that is concerned with finding the right balance between manual and automated testing. With short-term quality assurance goals and longer term quality goals.

Of course, I might be totally wrong… but I am looking forward to discussing this further with you over the coming weeks and months!

[1] Jussi Kasurinen, Ossi Taipale, and Kari Smolander. Software test automation in practice: empirical observations. Advances in Software Engineering, 2010.
[2] Emelie Engstrom and Per Runeson. A qualitative survey of regression testing practices.In Product-Focused Software Process Improvement, pages 3–16. Springer, 2010.

First blog post

One of my New Year’s resolutions for 2014 is to start a blog. Here it is!

The catalyst for this idea to start blogging is the TestRoots project which will be starting in January 2014. As the name suggests, TestRoots is about software testing, with a particular focus on unit testing. I will use this blog to discuss some of my ideas concerning unit testing and my research. So stay tuned for that first blog post.

For now, my best wishes for 2014!