General Information
Conflict of Interest
- SV_57KskaKADT3U9Aq-R_34ClOo3Ba3mERRn.pdf
- SV_57KskaKADT3U9Aq-R_27xIbAr89RxUBOQ.pdf
- SV_57KskaKADT3U9Aq-R_8CSYSOCUjHIppKC.pdf
- SV_57KskaKADT3U9Aq-R_8agkcRxlJS2QLm2.pdf
- COI FORM GB (pdf)
Request Clinical Trials
Associated Trial(s):- NCT00361335 - A Multicenter, Randomized, Double-blind, Placebo-controlled Trial of Golimumab, a Fully Human Anti-TNFa Monoclonal Antibody, Administered Intravenously, in Subjects with Active Rheumatoid Arthritis Despite Methotrexate Therapy
- NCT03090100 - A Phase 3, Multicenter, Randomized, Double-blind Study Evaluating the Comparative Efficacy of CNTO 1959 (Guselkumab) and Secukinumab for the Treatment of Moderate to Severe Plaque-type Psoriasis
Request Clinical Trials
Data Request Status
Status: OngoingResearch Proposal
Project Title: Validation of the data generation method using Variational Autoencoders (VAEs) based on real clinical trial data.
Scientific Abstract:
Background: Generative artificial intelligence, particularly Variational Autoencoders (VAEs), has been shown by studies such as those by Chadebec et al. to significantly improve the accuracy of classifying complex medical images, even with small datasets, creating reliable new observations even in scenarios characterized by high dimensionality and low sample size while preserving the underlying structure and characteristics of the population. These methods are being tested with small-scale tabular clinical data.
Objective: We aim to assess the robustness and reliability of our data augmentation methodology using individual datasets from past clinical trials.
Study Design: (1) train our VAE models with a dataset to generate artificial data, (2) evaluate the consistency of the generated data, (3) conduct medical and mathematical validation of the generated data by experts.
Participants: Artificial patient datasets generated from real patient databases of past clinical trials conducted by Janssen Laboratory, as referenced on YODA and three specialists in their relevant pathology for validation.
Primary and Secondary Outcome Measures : Our research will produce artificial clinical data files and a methodology for qualitative and quantitative human supervision to check the relevance of the data generated.
Statistical Analysis: Classification accuracy will measure experts' ability to distinguish real from artificially generated patients (reliability), while multivariate histograms and covariance data will evaluate the augmented data's representativeness and confidentiality.
Brief Project Background and Statement of Project Significance:
The ORIGA project, led by Professor Stéphanie Allassonnière and Dr. Jean-Louis Fraysse from BOTdesign, aims to create artificial patients using generative AI (VAEs). ORIGA is the first European platform for augmenting artificial patient data, which will feature a user interface for clients to upload their databases, which will then be prepared by BOTdesign teams and augmented according to the addressed pathology and data typology (selection of augmentation model). The augmented data will then be validated by human and mathematical guarantee committees before being made available to the client.
Professor Allassonnière and her team augmented real imaging data in Alzheimer's disease in 2021 and 2023, demonstrating the reliability and representativeness of the artificial patients by comparing them to a cohort of real patients. An ongoing scientific publication (Professor Allassonnière, Université Paris Cité; Professor Minville, CHU de Toulouse; Dr. Ferré, CHU de Toulouse) describes the augmentation of tabular data in anesthesia. Virtual patient cohorts are credible alternatives to real patients, supported by the French agency AIS (Agence de l'Innovation en Santé) under the France 2030 framework. A White Paper, with Janssen Laboratory participation, was published on this topic at the end of April.
To ultimately gain acceptance of these methods as evidence by regulators (MAA for drugs and approval of Medical Devices), we must produce scientific results demonstrating that data generated from real completed clinical studies are as reliable as the original real patients. To achieve this, we must first demonstrate the mathematical reliability of our model for generating artificial patients as well as clinical relevance, relying on robust datasets such as those proposed on the YODA platform.
Specific Aims of the Project: As part of the development of the ORIGA platform, we utilize a generative AI method (VAEs) to generate artificial health data from small clinical datasets of real patients. Following an initial proof of concept on imaging data (in Alzheimer's disease) and tabular data (pre-anesthesia consultation), we seek to validate our algorithmic data augmentation model in other increasingly complex contexts (number of parameters, other pathologies, etc). We would like to access the databases of the clinical studies NCT00361335 in rheumatology and NCT03090100 in dermatology. We need 2 databases because we do not know in advance their richness and the challenges they will represent (types of measurements, composite scores, etc). The aim of this study is to validate and refine our method of statistical data augmentation.
Study Design: Methodological research
What is the purpose of the analysis being proposed? Please select all that apply.: Develop or refine statistical methods
Software Used: Python
Data Source and Inclusion/Exclusion Criteria to be used to define the patient sample for your study: All the individual data made available in each clinical trial, without exclusion, will be used to prove the effectiveness of our statistical method of data augmentation.
Primary and Secondary Outcome Measure(s) and how they will be categorized/defined for your study:
Our research will produce: artificial clinical data files, a methodology for qualitative human supervision (to check the relevance of the data generated) and quantitative (histograms of descriptive statistical analyses on the augmented data); please refer to the statistical analysis plan below.
Main Predictor/Independent Variable and how it will be categorized/defined for your study: Our study will involve, for each cohort, 2 randomized populations: the control population (C) and the treated population (T). We will carry out cross-validation using our own classifier. Taking a sample of x% of the C population, we'll vary the augmentation multiple and compare the augmented base with the characteristics of the T population to determine the "humanly acceptable" augmentation limit (representativeness and reliability scores). We will also be evaluating a confidentiality score to validate data anonymization regarding the augmented population.
Other Variables of Interest that will be used in your analysis and how they will be categorized/defined for your study:
All variables will be used to augment and form part of a classifier. NCT03090100 data specification file is not available. For NCT00361335, data domains will be used as follows :
- DEMO : all data used for augmentation except RACE ;
- AE, CONDHIS, CONMEDS, CONPROC, CRITERIA, CVINFO, DEATH, DIAGINFO, DIS_EVAL, DRUGPREP, DSTATUS, EXPHX, EXPOSURE, EXPOSUREPK, HAQ, IFRTX, IVRSDOSE, IVRSRAND, JNTSCORE, LAB, MANIFST, MEDRVIEW, NOTTX, QOL, RXHX, SUBSTAT, TB_INFO, VISITS, VITALS, : all data used for augmentation.
Statistical Analysis Plan:
To build artificial patient data, our method uses deep neural networks to generate high-quality synthetic health data. We employ a model called Variational Autoencoder (VAE), which imposes a probability distribution on the latent space, allowing for more flexible and controlled data sampling. For details on our method:
1. VAE: Incorporates generative hierarchical models, enhancing flexibility in data sampling via latent variables with parametric distributions.
2. Latent Space Geometry: Treats latent space as a Riemannian manifold, improving sampling reliability and diversity by capturing data's geometric structure.
3. Riemannian Sampling: Utilizes Hamiltonian Monte Carlo, leveraging latent space geometry for accurate, diverse synthetic data that mirrors real-world health data.
The statistical analysis plan relies on the study of 3 scores - for more details on each score's formula, please refer to attached document "YODA Project - Acceptance criteria":
1. Reliability: Generated data will be assessed by a panel of medical experts. The experts will review the artificial data to determine its compliance with clinical trends observed in real data. Specific criteria include the consistency of diagnoses, treatments, and reported patient outcomes. Acceptance thresholds are:
- Very Good: Synthetic data is almost indistinguishable from real data, with clinical consistency above 95%.
- Average: Synthetic data shows minor but acceptable discrepancies, with clinical consistency between 75% and 95%.
- Poor: Synthetic data shows significant discrepancies, with clinical consistency below 75%.
2. Representativeness will be evaluated by several statistical scores : Mean (μ) of each variable of the patient vector, Standard Deviation (σ) for all variables independently, Skewness, Kurtosis, Frobenius distance between Covariance Matrices, Conditional Means and covariances.
For each statistical score, item level thresholds are:
- First and second-order moments (mean and standard deviation): maximum deviation of 5%
- Analysis of the differences for third-order and fourth-order moment (hard to determine a good threshold, it is strongly case dependent)
- Covariance matrices Frobenius distance: below 1
- Conditional means: tolerance of 5%
In addition to moment measures, three tests will be applied
- Kolmogorov-Smirnov Test (KS Test):
- Wasserstein Distance
- Jensen-Shannon Divergence (JS Divergence)
Global representativeness score acceptance thresholds are:
- Very Good Representativeness: Tolerance below 5% for all metrics and statistical tests.
- Good Representativeness: Tolerance between 5% and 10% for all metrics and
statistical tests.
- Poor Representativeness: Tolerance above 10% for all metrics and statistical tests
3. Confidentiality will be evaluated through the following distances : Euclidean Distance, Manhattan Distance, Cosine Distance, Nearest Neighbor Distance Ratio (NNDR). By using these four metrics together, we can comprehensively demonstrate that synthetic data points are sufficiently distant from the original data points, ensuring that confidentiality is maintained. This combination of metrics covers absolute geometric distance, dimension specific differences, and directional similarity, providing a robust assessment of the protection of individual data records. Acceptance thresholds are:
- Very Good Confidentiality: above 1 for Euclidean and Manhattan distances, above 0.3 for Cosine distance, and NNDR tolerance below 0.01.
- Good Confidentiality: between 0.5 and 1for Euclidean and Manhattan distances, between 0.2 and 0.3 for Cosine distance, and NNDR tolerance between 0.01 and 0.05.
- Poor Confidentiality: below 0.5 for Euclidean and Manhattan distances, below 0.2 for Cosine distance, and NNDR tolerance above 0.05.
Narrative Summary: The rise of AI in healthcare opens new opportunities, especially in generating artificial data from small real datasets. This can address data scarcity while preserving quality. Our study aims to validate our method using Variational Autoencoders (VAEs) to generate artificial data from clinical trials. We will train models, assess data consistency, and verify medical plausibility with experts. This could revolutionize artificial data use, providing larger, reliable datasets for significant medical research advancements.
Project Timeline:
Project start date: July 1st, 2024
Data augmentation and completion of results analysis: September 30, 2024
Manuscript writing: October 30, 2024
First submission for publication: November 15, 2024
Publication of results May 15, 2025
Dissemination Plan: The results will be published in open-source scientific journals.
Bibliography:
(1) AD Course Map charts Alzheimer’s disease progression (avril 2021) Igor Koval # 1 2 3, Alexandre Bône # 1 2, Maxime Louis 1 2, Thomas Lartigue 1 2 3, Simona Bottani 1 2, Arnaud Marcoux 1 2, Jorge Samper-González 1 2, Ninon Burgos 1 2, Benjamin Charlier 1 2 4, Anne Bertrand 1 2 5, Stéphane Epelbaum 1 2 5, Olivier Colliot 1 2 5, Stéphanie Allassonnière 6 3, Stanley Durrleman 7 8
(2) Forecasting individual progression trajectories in Alzheimer’s disease (fev 2023) Etienne Maheux 1, Igor Koval 1, Juliette Ortholand 1, Colin Birkenbihl 2 3, Damiano Archetti 4, Vincent Bouteloup 5 6, Stéphane Epelbaum 7, Carole Dufouil 5 6, Martin Hofmann-Apitius 2 3, Stanley Durrleman 8
(3) Data Augmentation in High Dimensional Low Sample Size Setting Using a Geometry-Based Variational Autoencoder (mars 2023) Clement Chadebec, Elina Thibeau-Sutre, Ninon Burgos, Stephanie Allassonniere
Supplementary Material: YODA-Project-Acceptance-criteria.pdf