Skip to main content

SYNTHIA set to support synthetic data generation

Ultimately, SYNTHIA hopes to build trust regarding the usefulness of synthetic data and facilitate its responsible use by researchers.

11 September 2024
Thin purple wavy lines with small, brightly coloured dots, all against a dark background, representing data flows. Image by SkillUp via Shutterstock.
Image by SkillUp via Shutterstock.

As the name suggests, synthetic data is data that has been artificially generated to mimic real patient data. It offers a potential solution to several issues in health research, including a lack of real, high-quality datasets that can be used in research and concerns about patient privacy. However, questions remain regarding the quality of synthetic datasets, and the best synthetic data generation methods to use in different situations.

The aim of new IHI project SYNTHIA is to deliver validated, reliable tools and methods for synthetic data generation (SDG). The tools will cover multiple data types including lab results, clinical notes, genomics, imaging and m-health data. SYNTHIA also hopes to make possible the generation of longitudinal data.

To focus its efforts, and to showcase the utility of synthetic data in different areas, the project will address six diseases: two solid tumours (lung cancer and breast cancer), two blood cancers (multiple myeloma and diffuse large B-cell lymphoma), one neurodegenerative disease (Alzheimer’s disease) and one metabolic disease (type 2 diabetes).

‘In the era of precision medicine, with drugs targeting specific gene mutations, new tools are required to deal with patient's data privacy. Whole genome sequencing, digital imaging, and electronic health record data are the ID of any individual person. All of them are required to being able to provide the patient with the best available treatment. Nevertheless, personal data privacy is a must,’ said the project’s academic lead, Guillermo Sanz of IISLaFe in Spain. ‘Generation of efficient synthetic databases by using artificial intelligence is the unique way to pursue the goals of maintaining data privacy while offering the tools to advance in precision medicine. The SYNTHIA project, a new pioneering public-private partnership, is the first IHI synthetic data project to deal with this urgent need.’

The project outputs will be made available to the research community through a dedicated online platform. In addition to synthetic data generation workflows that can be used in different situations, the platform will include assessment frameworks to help users evaluate the synthetic data generated for privacy (risk of re-identification), quality (representativeness and similarity to real data), and applicability (i.e. its suitability for its intended use). Finally, the platform will boast a repository of high-quality synthetic data sets, each of which will be labelled with its suitability for specific applications.

Ultimately, the platform will help to build trust among stakeholders regarding the usefulness of synthetic data, and facilitate the responsible use of synthetic data by the health research community.

‘Synthetic data has huge potential to enhance research and product development in healthcare by augmenting available data,’ said GE HealthCare’s VP for AI Smart Devices, Gopal Avinash. ‘Along with GE HealthCare’s AI strategy, synthetic data can help mitigate bias and drifts in algorithms and reduce privacy risks. Synthetic data can also help speed up the development of robust and generalisable AI models in the healthcare industry. We are excited to explore and develop these methods while enhancing data standards and guidelines to build safe and effective synthetic data-based models with our expert collaborators within SYNTHIA.’