Identifying a finite test set that adequately captures the essential behaviour of a program such that all faults are identified is a well-established problem. This problem is traditionally addressed with syntactic adequacy metrics, but these can be impractical, and may be misleading even if they are satisfied. One intuitive notion of adequacy, which has been discussed in theoretical terms over the past three decades, is the idea of behavioural coverage: If it is possible to infer an accurate model of a system from its test executions, then the test set must be adequate. Despite its intuitive basis, it has remained almost entirely in the theoretical domain because inferred models have been expected to be exact (generally an infeasible task), and have not allowed for any pragmatic interim measures of adequacy to guide test set generation. To bridge the gap to practice, we present a technique to quantify behavioural adequacy using k-folds cross validation, a common technique in machine learning. We show that this technique is not only suitable to measure behavioural adequacy, but also to guide search-based test generation in automatically producing test sets that optimise this adequacy. Experiments with our BESTEST prototype indicate that such test sets not only come with a statistically valid measurement of adequacy, but also detect significantly more defects.
Testing Software Behaviour
Neil Walkinshaw, Gordon Fraser