Generation, exploitation and understanding of synthetic data sets
There is currently a strong drive to better use data collected within healthcare. However, health data is often complex to access due to privacy concerns and legal reasons. It is often also not typically available in a standardised format as it relies on labelling that is often hospital, clinician, and system dependent.
Synthetic data can generally be defined as data which has been generated based on key characteristics of an original dataset. As such, it maintains many of the essential properties of said dataset without some of the drawbacks such as privacy issues. Additionally, such datasets can be engineered to be both larger than the original dataset and to have more so-called ‘edge cases’ - cases that occur rarely naturally and so are harder to diagnose as they are not as recognisable. Due to these advantages there has been an increased interest in such datasets in recent years, as they can be very valuable for purposes such as training machine learning algorithms.
This case study uses over 23,000 images from CT and megavoltage CT scanners, gathered from 800 patient cases acquired for the VoxTox study between 2007 and 2017 (Burnet et al. 2017) as the base dataset for trialling methods for the generation of synthetic data from CT to megavoltage CT and vice versa. This provides a unique opportunity to quantitatively determine the quality of the synthetic generated images, as they are representing measurands in the same units as the original images. We propose metrics for the assessment of generated output that are based on image and batch characteristics. In particular, we examine the advantages and disadvantages of cyclical generative adversarial networks, an increasingly popular method for the production of synthetic data. The parameter space of such networks is large and often fine-tuned on a case by case basis, with seemingly small changes producing large effects on the output. Such effects need to be understood and addressed if these methods are to be used for clinical purposes.
This work is ongoing and we are keen to work with as many people as possible. Please contact us if you would like to know more or want to discuss the challenges in your work environment.