Metadata contains detailed information about the primary measurement data, such as when and where it was acquired or under what conditions, and is vital in forming data provenance. The lack of digital recording of metadata, and linking this to the primary measurement data, is currently a barrier to reproducible science across all sectors. It reduces:
- data interoperability - we do not understand the data characteristics
- data traceability - we do not know its history and lineage
- The potential to reuse the data in the future directly or in meta-analysis - we are missing the information for historic data or to compare multiple datasets
This need for linked metadata applies to any measurement data and therefore cross cuts all sectors, but is particularly prevalent in the Health and Life Sciences, Energy and Environment, and Advanced Manufacturing sectors which often feature complex systems with large datasets from many different sources and long term experiments.
NPL have developed software tools to capture and combine metadata with the primary data at the point of measurement, which can be applied to any measured data. As an example, we used mass spectrometry data from the CRUK Rosetta project led by NPL. We developed software that automatically collated all available information from an instrument (in this example we used Waters’ mass spectrometry imaging instruments) and generated a standardised and structured metadata file. To capture information that is not digitally available, such as manual operations, we also developed a simple user webform for experimentalists to digitally record this in a structured way. We have packaged this entire workflow into a simple button operated user interface for ease of use.
Our solution automatically combined the webform and instrument metadata and linked with an existing sample management system to form a standardised, machine readable, comprehensive and rich metadata for an experiment. The primary measurement data, and any associated measurements, such as calibration data, data from complementary techniques and standard operating procedures, are linked with the metadata in a single ‘data container’. Linking the data in this way provides traceability via an unbroken chain of data provenance and measurements, which will improve confidence in data, repeatability of experiments and reuse of data.
Comprehensive and standardised data provenance can address issues in reproducibility by identifying potential differences between datasets that may affect the results. To improve findability and longevity of data, our tools upload the data container to a database where it is tagged with its metadata providing automated curation that is fully searchable with standard database queries.
We are now able to digitally capture metadata in a standardised way, which links to the primary data at the point of measurement. Furthermore, we can add multiple associated datasets within a single data container allowing large cohort, multi-modal, longitudinal, and multi-site studies to be linked for efficient data searches and better reuse of data. These tools enable the creation of a FAIR (Findable, Accessible, Interoperable, and Reusable) compliant database that can address reproducibility and traceability issues. Additionally, capturing this metadata in a structured way can be used for automated generation of text outlining experimental settings and parameters for efficient and accurate writing of reports and publications in line with minimum reporting standards.
These tools were applied to data from the Rosetta CRUK Grand Challenge Project led by NPL’s NiCE-MSI group where we established a curated database of terabytes of complex experimental data. The work to establish the database included the digital recording of manual operations, storage of large volumes of data, creation of machine-readable metadata and data provenance chains, and centralising data by linking associated complimentary techniques for future use. By establishing a curated database for the Rosetta Project, we can have as much knowledge and confidence in these data in 20 years’ time as we do now, which has enormous potential for future opportunities and impact from these datasets.