Data Generators for Testing the Numerical Correctness of Software
Introduction
Facilities are provided here to help users and developers of scientific software implement a general methodology for evaluating the numerical correctness of the results produced by the software.
The software to be tested (referred to here as the test software) is assumed to be an implementation of an algorithm for achieving a stated computational aim, e.g., the computational aim might be to calculate the sample mean and sample standard deviation of a data set. The basis of the approach is the design and use of reference data sets and corresponding reference results to undertake “black-box” testing of the test software. The reference data sets and corresponding reference results are generated in a manner consistent with the stated computational aim of the test software. The results returned by the test software for the reference data sets are then compared objectively with the reference results using quality metrics or performance measures. Finally, the performance measures are interpreted in order to decide whether the test software meets requirements and is fit for its intended purpose.
The facilities take the form of data generators implemented in such a way as to provide portability of the generators across computer platforms, as well as flexibility and reproducibility of the reference data sets. There are two modes of use for the data generators as follows:
Access to the Data Generators
The data generators are available here. If you have not used them before please read the rest of this page first.
Mode 1: Generate a reference data set with corresponding reference results
The user supplies values for input parameters to the generator that define the reference results and other properties of the reference data set. The data generator provides as its output a reference data set with corresponding reference results and other information necessary for the calculation of performance measures. In this way the user is able to obtain a reference data set that mimics one that is likely to be encountered in the user’s own application and for which the reference results are known. The ability to generate reference data sets that are representative of those likely to be encountered in practice is important in order to establish “fitness for purpose” of the test software. This mode of use of the data generator, in which the reference results are part of the output, can be helpful in the preliminary stages of testing the software. In this mode of use, quality metrics and performance measures are calculated by the user using the information provided.
Mode 2: Generate a reference data set and calculate performance measures
n the same way as in Mode 1, the user specifies values for input parameters that define properties of the reference data set. The data generator provides as its only output a reference data set. The reference results corresponding to the reference data set are set by randomly perturbing the values of the input parameters, and are not provided as part of the output of the generator. Having applied the test software to the reference data set to obtain test results, the user may then upload the test results to the data generator for the calculation of performance measures. This mode of use of the data generator, in which the reference results are not part of the output, but performance parameters are calculated, provides independent testing of the software.
For each data generator, procedures are provided to describe each mode of use of the generator. In particular, the procedures describe the format of the output of the generator, how the output may be used in the calculation of performance measures (Mode 1), and the format in which test results are required to be uploaded to the generator (Mode 2). Information about the calculation of quality metrics and performance measures, and their interpretation, is available.
Quality metrics and performance measures
The output of the procedures provided with each data generator is, for each reference data set, the value of a quality metric that quantifies the numerical performance of the test software for that data set. The quality metric may take the form of:
- An absolute measure of the numerical correctness of the test result returned by the test software.
- A performance measure that indicates for the test result the number of significant figures of accuracy lost by the test software compared with reference software for the specified computation. For example, a performance measure near zero indicates that the accuracy of the result returned by the test software is the same as for reference software, whereas a performance measure of eight indicates that the test software is losing eight decimal digits of accuracy more than reference software. A large value of the performance measure may indicate the use of an unstable parametrisation of the problem, the use of an unstable algorithm or inappropriate formula, or that the test software is defective in some way.
More information about quality metrics and performance measures is available.
To calculate the above performance measure, the user will need to know the computational precision of the environment in which the user’s software is working. Guidance on setting this value is available.
References
M G Cox and P M Harris. The design and use of reference data sets for testing scientific software.
H R Cook, M G Cox, M P Dainton and P M Harris. A methodology for testing spreadsheets and other packages used in metrology.
M G Cox, P M Harris, E G Johnson, P D Kenward and G I Parkin. Testing the numerical correctness of software.
