Test Education: The Writing is On the Wall

A few days ago, I launched a very simple Twitter poll.


As the results show, around half of the 60 respondents answered they learned about software testing “on the job”. Obviously, this is but a Twitter poll, but it strengthens some intuitions about testing: why are people doing so little testing. And before you go into that, there are many, many factors that influence this.

But let’s focus on CS education for a minute. Many computer science curricula do not contain a dedicated testing course. I have no firm evidence of this, but I looked at quite a few curricula. Instead, these universities opt for a more spread-out way of teaching testing. A bit of testing is taught during the introduction course on programming, a bit during their software engineering class and the rest will just come with the many projects students are doing.  I am absolutely not saying that this is a wrong strategy, but some important thoughts here:

  • are we sure that students get the importance of testing when it is just a tiny topic in one or two courses?
  • are we sure that students actual do software testing in their projects?
  • are we sure that students are sufficiently trained in testing? This goes from professors knowing what exactly is taught w.r.t. testing in each of the courses…

Back to my Twitter poll, I also got a lot of feedback (thanks for that!). Part of that feedback was a pointer to a paper entitled “The Impact of Software Testing Education on Code Reliability: An Empirical Assessment” (see the paper here (unfortunately behind paywall)). While the paper contains many interesting insights and some good arguments, the thing that struck me most is their observation that CS teachers are not always well-trained in testing themselves (while their way of measuring this might be questionable, might perhaps be too theoretical). This is frightening if you ask me.

At TU Delft, we start by teaching the students to write very simple unit tests in week 3 of their first year. Programming assignments without any unit tests are simply rejected. The project that immediately follows their basic programming courses is again test-focused. The real testing course then comes at the end of their first year, where @avandeursen teaches them the basic of testing. At the Master level we have a dedicated “advanced software testing” course. I am absolutely not saying that his is the way to go, but we do hope it shows students that testing is not something you might want to do, it is something that is essential.

If we are every going to produce software engineers capable of writing reliable software, the writing is on the wall if you ask me.


Is Testing (on StackOverflow) Dead?

I just picked up on this blogpost from StackOverflow that introduces “StackOverflow Trends” (see https://stackoverflow.blog/2017/05/09/introducing-stack-overflow-trends/). The StackOverflow Trends Tool allows you to track how the ~8000 programming language and technology questions shift over time.

How about testing?

I immediately started by filling in some keywords related to (unit) testing and the following graph came out.


Now what is striking to me is that general questions on “unit-testing” and “testing” show a considerable drop in terms of the percentage of questions being asked on SO on these topics. So, I thought this could be related to the fact that questions shift towards more specific technologies. So I also included some more specific technologies, like jUnit, Cucumber, MSTest, etc. These are all showing a flat line, so the percentage of questions pertaining to these technologies remains pretty much constant.

So, I tried to imagine which new technologies have come into place or at least could have become more important. Maybe testing of web applications? So, if I then look at web application testing and think of Selenium, I get the following graph:


There is a clear increase in these technologies! It even seams that the term “selenium” is as popular as the more general term “testing” in the previous graph!

What about mobile app testing then? I found this website with the 10 most popular Android and iOS testing tools, so I ran all 10 of them through the StackOverflow Trend tool. Only 1 out of the 10 showed up: Appium. Stack_Overflow_Trends3.png

Apparently, either not that much mobile app testing is going on, or everyone knows what they are doing…

@TimvdLippe suggested me to also look into “mocking” and “JavaScript testing”. So I also explored these and found this for mocking:


It seems here that the more general search term “mocking” is being replaced by more specific, tool oriented search terms.

So what about JavaScript then?

Going by this website http://stateofjs.com/2016/testing/, I tried all testing tools mentioned. A lot of them don’t show up in the available tags on the StackOverflow trend tool… @TimvdLippe also suggested looking at Github, so I also included the JS testing tools with most stars. This is what comes out:


Seems JavaScript testing is gaining traction, but if you look at the overall percentage of questions pertaining to it, it is still very, very low if you ask me.


So, what’s my take?

Well, either people really know what they are doing when it comes to testing, or people don’t ask the test-related questions on StackOverflow. Or, and this might actually be the most likely one: testing is really not happening that often.

I would love to hear your thoughts!

Are we unit testing or integration testing with jUnit?

At the recent SCAM 2016 in Raleigh there was a discussion on what kind of tests are being written in jUnit, a popular unit test framework for Java. The discussion was also held on Twitter and it made me think of some work that a MSc student did some while ago. More precisely, Joep Weijers did this work during thesis work at the TOPDesk company in 2012. A link to his MSc thesis is here.

A short summary of what he did follows here.

Joep Weijers considered test classes from Maven, JFreeChart, Apache commons and an industrial software system. The results for the open source systems are as follows:

Project Total jUnit tests Unit tests Integration tests % integration tests
Apache Commons Collections 1005 232 773 76.7%
Apache Commons Digester 200 17 183 91.5%
JFreeChart 2209 236 1973 89.3%
Maven core 237 15 222 93.7%
Maven model 148 96 52 35.1%

What is clear from the results is that 4 out of the 5 systems have a majority of integration tests. Only Maven model has a share of 35.1% integration tests.

Behind the scenes

In his MSc thesis Joep used a heuristics technique to determine whether a jUnit test is a unit or integration test.

A brief description of this heuristic follows: JUnitCategorizer first determines the Class Under Test for a jUnit test. Sometimes it is a clear-cut case where only 1 test is exercised in a jUnit test, but sometimes multiple classes are used.  When multiple classes are considered in a jUnit test, we need to take additional steps. We need to determine whether mock objects are used. Two situations can be distinguished here: a mocking framework is used or manually created mock objects. The framework based situation is easier to determine, while in the case of manual mock objects, we rely on naming conventions to find mock objects.

Finally, if we remove the mock objects and classes from e.g. standard libraries, we count the number of remaining classes in the jUnit test. If that is one, we label the test as being a unit test, if > 1, we classify it as an integration test.

This brief summary is far from complete and the reader is invited to read more in Joep’s thesis. A manual investigation of the heuristic indicates that its classification precision is 95%.

Additional reading: Joep Weijers, Extending Project Lombok to improve JUnit tests. MSc thesis. Delft University of Technology, 2012.


Old Habits Die Hard: Why Refactoring for Understandability Does Not Give Immediate Benefits

Large, long-lived software systems often cope with the problem of technical debt caused by shortcuts that were taken to increase development speed that result in code that is difficult to maintain. A technical debt can be justifiable, e.g., for quickly getting a new product on the market. However, just like actual debts, if the debt is not repaid, one has to pay interest, which takes the form of additional time that developers need to understand and change complex code. Intuitively then, we could refactor the code to pay off the technical debt that has accumulated.

This also fits with the argumentation of Robert C. Martin in his book Clean Code in which he urges developers to take care of their code and start writing clean code as he claims that the structure of code has an impact on the development productivity.

Recently then, one of my MSc students (Erik Ammerlaan, @eammerlaan) set out to investigate whether refactoring actually increases the understandability of code. He performed his investigation in an industrial environment and involved 30 developers from one company (Exact). Those 30 developers are working in 11 different teams spread out over 2 countries.

In order to measure the understandability, he set up 5 experiments that required the participants to make small changes to the code. Starting the hypothesis that “If given the refactored code, developers finish the coding task earlier than if the original code were given.“, the 30 developers had to work with either the existing version of the code, or the refactored version.

Clean code: influencing understandability both positively and negatively

When the original and refactored code were shown side-by-side after each experiment, most developers appreciated the refactored code and preferred it over the original code. They recognized that having ‘clean code’ would make the codebase more understandable. Yet, this seem to contrast our results. In the 5 experiments, we observed that the productivity of developers can be influenced both positively and negatively by code that is refactored for understandability when performing small coding tasks. Depending on the task and the individual, so-called clean code does not immediately improve one’s understanding of code, if one is used to working with the old structures present in source code. As such, we can say that old programming (understanding) habits do die hard.

A matter of quality?

We noticed an important difference in the quality of the solutions implemented in original and refactored code, serving as a first indicator that good quality code prevents developers from writing sloppy code. This is in line with the broken windows theory, a criminological theory that can also be applied to refactoring: a dirty code base makes developers feel that they can get away with ‘quick and dirty’ code changes, while they might be less inclined to do so in a clean code base.


Our experiments showed that having refactored code in general provides no instant benefits with respect to understandability and developers need time to adjust. We fully acknowledge that in the longer term, refactoring benefits might become much more clear and are likely to go beyond understandability (e.g., extensibility, reusability, testability, …)

More information?

Erik Ammerlaan, Wim Veninga, Andy Zaidman. Old Habits Die Hard: Why Refactoring for Understandability Does Not Give Immediate Benefits.
Proceedings of the 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER), pages xxx-xxx. IEEE, Montreal, Canada, March 2015.

First Steps in Testing Analytics: Does Test Code Quality Matter?

Together with the Software Improvement Group we set out to investigate whether the quality of the developer test code influences the development process of software. Now, why should the quality of the unit and/or integration tests actually matter?

First of all, why should having high quality tests matter at all? Let’s first establish that you produce production code, or the code that actually delivers the functionalityand test code, the code that you write to (unit) test the production code. Whether you follow a test-driven, test-with or test-after approach to writing your tests, you want to make sure that you have your production code covered.

Does coverage tell the story?

The big question here is whether coverage really tells the complete story? Can we be sure that if we have a high level of test coverage, that we will have good tests? And what are good tests? Are these tests that (1) completely test the system, (2) find a lot of bugs and make it easy to locate them, and (3) are easy to evolve? Test coverage will give you a partial answer here, as it mainly answers our first point: completeness.

More insights on why we should keep using test coverage, but why we should also be careful while interpreting it can be found in this blog post by Arie van Deursen: Test Coverage: Not for Managers? 

A Test Code Quality Model

Together with colleagues we constructed a test code quality model that tries to address the 3 aforementioned criteria for testing:

  1. How completely is the system tested?
  2. How effectively is the system tested?
    Does the test code enable developers to detect defects and locate the cause of these defects?
  3. How maintainable is the test code?

Completeness. For the first item, completeness, we rely on test coverage and the assertion-McCabe ratio. We explicitly don’t only rely on test coverage, because it is easy to reach high levels of coverage, without really testing all (special) circumstances. That is why we add the assertion-McCabe ratio. The Assertions-McCabe ratio metric indicates the ratio between the number of the actual points of testing in the test
code and of the decision points in the production code. The metric is inspired by the Cyclomatic-Number test adequacy criterion.

Effectiveness. In order to measure how effective unit test code is, we rely on the assertion density and directness metrics. Firstly, assertion density aims at measuring the ability of the test code to detect defects in the parts of the production code that it covers. This could be measured as the actual testing value that is delivered given a certain testing effort. The actual points where testing is delivered are the assertion statements. At the same time, an indicator for the testing effort is the lines of test code. Which brings us the assertion density metric which is defined as #assertions/LOC_testcode.

Another important part of evaluating the effectiveness is determining whether the tests are able to provide the location of the defect to facilitate the fixing process. When each unit is tested individually by the test code, a broken test that corresponds to a single unit
immediately pinpoints the defect. Directness measures the extent to which the production code is covered directly, i.e. the percentage of code that is being called directly by the test code. This metric tries to distinguish pure unit testing from integration testing.

Maintainability. For the third item, we rely on the expertise of the Software Improvement Group to determine the maintainability of the test code (albeit with some changes, see the paper); explained very briefly, we look at the amount of duplication in test cases, the size of each test case, the complexity of unit test code and the coupling between tests.

Fixing bugs, implementing features… does test code quality play a role?

We first calibrated the test code quality model above with 86 software systems (14 open source, 72 closed source). Once we had a calibrated test code quality model, we applied our model on 18 open source software systems representing 75 years of development.

We subsequently looked at the Issue Tracking System (ITS) of each of the 18 open source software systems. For each system we looked at the average “open” time of a bug report or feature request, i.e., we looked at how long it took for developers to work on an item.

Finally, we tried to correlate whether high-quality test code means that teams of software engineers are able to implement new features more quickly and whether bug reports can be dealt with more quickly.

First observationTest code quality does not seem to influence the speed by which defects can be resolved. Our assumption here is that having high-quality test code means that the easiest to fix bugs (the “low-hanging fruit”) are already out of the system and the most difficult to fix bugs remain.

Second observation. High-quality test code quality is positively correlated with the number of issues (valid for both feature requests and bug reports) per month that a team can process.

Take home message

Testing goes beyond merely detecting defects. If high-quality test code is present, it does influence the number of issues that a development team can handle. In contrast, and at first sight seemingly counter-intuitive, having high-quality tests in your project does not decrease the time needed to solve a bug, which we assume to be related to the fact that the easiest to solve bugs have already been caught by the tests, thereby leaving the more complicated bugs in the system, which take more time to solve.

Interested in more?

This is joint work with Dimitrios Athanasiou, Ariadi Nugroho and Joost Visser. More information about this investigation can be found in our paper which is scheduled for publication in the IEEE Transactions on Software Engineering. A pre-print copy can be found here

Web API Growing Pains: Stories from Client Developers and Their Code

Together with my PhD student Tiago Espinha we have been investigating what it means for developers to rely on so-called Web APIs. Let’s first get one thing out of the way: what is a Web API? A (server-side) Web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server.” [Wikipedia].

So, what does it mean? First of all: it opens up possibilities. In particular, remote data sources become available to all programmers. You want to incorporate Google Maps in your application? No problem, there is a Web API for that. You want to get up-to-date stock information? Just find the right Web API, integrate it in your application and you are ready to go.

Sounds like a great deal, right? Without importing entire libraries, you can integrate data sources in a fairly light-weight manner. But what does it really mean for software developers?

Let’s take a deeper look. First of all, software engineers are no strangers to APIs, they are used to them. No single software engineer or a team of software engineers creates everything from scratch: they always rely on standard functionality offered to them through a framework or an API. They have come to love (or hate) the XML parser they are using… but they know perfectly well what this library can do for them (and what it cannot). If these software engineers are lucky, the provider of the API sometimes releases a new version of the API, fixing some of the bugs of the previous version or adding some new functionality. In some cases, this means rewriting part of their own application, because the changes to the API are breaking, meaning that the interface to the API has changed to such extent that the old functionality is no longer working or no longer available.

However, the great thing is that if there are these breaking changes, the software engineers can (in most cases) decide for themselves whether they upgrade to this new version. And, perhaps more importantly when they perform this upgrade.

Is this also true for Web APIs? In our paper we started investigating this. So, what did we find out for APIs from Google, Facebook, Twitter and Netflix? Our main finding is that software engineers are really at the mercy of the Web API providers. The Web API provider can dictate pretty much everything: from which functionality survives into the next version of the Web API, to how much time the software engineers get to cope with the new version of the Web API. What we’ve seen is that Facebook provides a 90-day window to cope with breaking changes. Google on the other hand announced a 6 month transition period, but when the deadline came closer and closer, they succumbed to pressure from the developers to extend that period.

Loosely coupled, strongly tied?

While we’ve seen that there is no real consensus or industry standard on how Web API evolution should happen, we did realize something quite disturbing. While Web APIs are typically promoted because of their loose coupling, we have seen signs that software engineers are really at the mercy of the Web API providers when it comes to upgrading. Old versions of the Web API are typically disregarded pretty quickly, thus forcing to upgrade, effectively creating a strong tie.

Some recommendations for Web API providers

  1. Communicate what will change
  2. Communicate when these changes become permanent
  3. Do not change to often (or you can lose your customers)
  4. Do not let old version linger too long (you will get a hard time pulling plug on the old version)
  5. Keep usage data of your system to see who is affected by changes
  6. Provide documentation (and more documentation!)
  7. Indicate the stability status per Web API feature
  8. Organize blackout tests that show which features will be brought offline soon

More reading?
Tiago Espinha, Andy Zaidman, Hans-Gerhard Gross Web API Growing Pains: Stories from Client Developers and Their Code.
Proceedings of the joint meeting of the Conference on Software Maintenance and Reengineering and the Working Conference on Reverse Engineering (CSMR-WCRE), pages 84-93. IEEE, Antwerp, February 2014.
[PDF link]


In my research I try to understand how we can take away obstacles for software engineers to test more. Test more, test more efficiently with the ultimate goal to produce higher quality software.

Earlier this week I had a nice conversation with one of my MSc students and his company supervisors (tnx Erik, Wim, Jerry and Wietze). One of the observations we were discussing is that some companies hardly experience any improvements from automated testing when compared to manual testing [1]. This observation was made by Kasurinen et al. when studying 31 organizations in 2009. 10% of those organization were using automated testing; 30% of those organization were using agile development methodologies.

The study by Kasurinen et al. is not the only one to note that automated testing is not always seen as beneficial compared to manual testing. Engstrom and Runeson report that 50% of the respondents their survey perform as much automated testing as they perform manual testing [2].

Now, it is easy to understand that manual testing remains very important, as manual testing is a lot more agile, i.e., the tester can follow a hunch, something a machine (an automated test) cannot do. However, the benefit of the automated test lies in the fact that it can be repeated, over and over again. As such, it is perfectly suited to detect regressions… but that is the long term view, right? So, in a sense, manual testing leads to short-term benefits, while automated testing has more rewards in the long term, of which detecting regressions, creating software with a better structure and having (executable) documentation are probably the most important ones.

This leads me to my main point of this blog post. Why is the adoption of agile methodologies happening in more organizations than automated testing is? Both topics are “in fashion” and are much talked about, but it seems that everyone is using SCRUM or some other methodology, but not everyone is actually doing automated testing.

My feeling is that this might have to do with the fact that the benefits of SCRUM and other agile methodologies can be felt in the short-term through better control of the process and the ability to better respond to changes in the requirements. On the other hand, the true benefits of automated testing can only be seen in the long term.

When considering the definition of a SCRUM Master (taken from Wikipedia on December 19th, 2013; http://en.wikipedia.org/wiki/Scrum_(software_development)):

Scrum is facilitated by a Scrum Master, who is accountable for removing impediments to the ability of the team to deliver the product goals and deliverables. The Scrum Master is not a traditional team lead or project manager, but acts as a buffer between the team and any distracting influences.

… my advice would be to also install a Test Master, an independent team member that is concerned with finding the right balance between manual and automated testing. With short-term quality assurance goals and longer term quality goals.

Of course, I might be totally wrong… but I am looking forward to discussing this further with you over the coming weeks and months!

[1] Jussi Kasurinen, Ossi Taipale, and Kari Smolander. Software test automation in practice: empirical observations. Advances in Software Engineering, 2010.
[2] Emelie Engstrom and Per Runeson. A qualitative survey of regression testing practices.In Product-Focused Software Process Improvement, pages 3–16. Springer, 2010.