| Title: | Synthetic Clinical Data Generation and Privacy-Preserving Validation |
|---|---|
| Description: | Generates synthetic clinical datasets that preserve statistical properties while reducing re-identification risk. Implements Gaussian copula simulation, bootstrap with noise injection, and Laplace noise perturbation, with built-in utility and privacy validation metrics. Useful for privacy-aware data sharing in multi-site clinical research. Validates synthetic data quality via distributional similarity (Kolmogorov-Smirnov), discriminative accuracy (real-vs-synthetic classifier), and nearest-neighbor privacy ratio. Methods described in Jordon et al. (2022) <doi:10.48550/arXiv.2205.03257> and Snoke et al. (2018) <doi:10.1111/rssa.12358>. |
| Authors: | Cuiwei Gao [aut, cre, cph] |
| Maintainer: | Cuiwei Gao <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-05-19 10:00:21 UTC |
| Source: | https://github.com/cuiweig/syntheticdata |
Runs all three synthesis methods on the same data and returns a comparative validation table.
compare_methods(data, n = nrow(data), seed = NULL)compare_methods(data, n = nrow(data), seed = NULL)
data |
A data frame of real data. |
n |
Number of synthetic records. Default: same as input. |
seed |
Random seed passed to |
A method_comparison object (tibble) with columns:
method, metric, value, interpretation.
Jordon J, et al. (2022). Synthetic Data – what, why and how? arXiv preprint arXiv:2205.03257. doi:10.48550/arXiv.2205.03257
set.seed(42) real <- data.frame(x = rnorm(100), y = rnorm(100)) compare_methods(real, seed = 42)set.seed(42) real <- data.frame(x = rnorm(100), y = rnorm(100)) compare_methods(real, seed = 42)
Trains a predictive model on synthetic data and evaluates it on real data. Compares to a model trained on real data (gold standard). Measures whether synthetic data preserves predictive signal.
model_fidelity(x, outcome, predictors = NULL)model_fidelity(x, outcome, predictors = NULL)
x |
A |
outcome |
Character. Name of the outcome column. |
predictors |
Character vector (optional). Predictor columns. Default: all other numeric columns. |
The real-data baseline uses in-sample evaluation (train and test on the same real data) to provide an upper bound on achievable performance. The synthetic-data model is also evaluated on real data, so the comparison reflects how well the synthetic data preserves predictive signal.
A tibble with columns: train_data, metric, value.
For binary outcomes the metric is AUC; for continuous outcomes
it is R-squared.
Jordon J, et al. (2022). Synthetic Data – what, why and how? arXiv preprint arXiv:2205.03257. doi:10.48550/arXiv.2205.03257
set.seed(42) real <- data.frame( x1 = rnorm(200), x2 = rnorm(200), y = rbinom(200, 1, 0.3)) syn <- synthesize(real, seed = 42) model_fidelity(syn, outcome = "y")set.seed(42) real <- data.frame( x1 = rnorm(200), x2 = rnorm(200), y = rbinom(200, 1, 0.3)) syn <- synthesize(real, seed = 42) model_fidelity(syn, outcome = "y")
Evaluates re-identification risk of synthetic data through multiple privacy metrics: nearest-neighbor distance ratio, membership inference accuracy, and attribute disclosure risk.
privacy_risk(x, sensitive_cols = NULL)privacy_risk(x, sensitive_cols = NULL)
x |
A |
sensitive_cols |
Character vector (optional). Columns considered sensitive for attribute disclosure assessment. |
A privacy_assessment object (tibble) with columns:
metric, value, risk_level.
Snoke J, et al. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society A, 181(3):663–688. doi:10.1111/rssa.12358
set.seed(42) real <- data.frame(age = rnorm(100, 65, 10), sbp = rnorm(100, 130, 20)) syn <- synthesize(real, seed = 42) privacy_risk(syn)set.seed(42) real <- data.frame(age = rnorm(100, 65, 10), sbp = rnorm(100, 130, 20)) syn <- synthesize(real, seed = 42) privacy_risk(syn)
Creates a synthetic version of the input data that preserves marginal distributions and pairwise correlations while adding controlled noise for privacy protection.
synthesize( data, method = c("parametric", "bootstrap", "noise"), n = nrow(data), noise_level = 0.1, seed = NULL )synthesize( data, method = c("parametric", "bootstrap", "noise"), n = nrow(data), noise_level = 0.1, seed = NULL )
data |
A data frame of real clinical data. |
method |
Synthesis method:
|
n |
Number of synthetic records. Default: same as input. |
noise_level |
For |
seed |
Random seed for reproducibility. If non-NULL, the global RNG state is saved before and restored after synthesis so that calling code is not affected. |
The parametric method uses a Gaussian copula approach: marginal distributions are estimated empirically and the joint dependence structure is captured via the correlation matrix of normal scores. This preserves both marginal shapes and pairwise associations while generating genuinely new observations.
A synthetic_data object (list) with components:
$synthetic (tibble of synthetic records), $real (tibble of
the original data, retained for downstream validation),
$method, $n_original, $n_synthetic, $variables.
Jordon J, et al. (2022). Synthetic Data – what, why and how? arXiv preprint arXiv:2205.03257. doi:10.48550/arXiv.2205.03257
set.seed(42) real <- data.frame( age = rnorm(200, 65, 10), sbp = rnorm(200, 130, 20), sex = sample(c("M", "F"), 200, replace = TRUE), outcome = rbinom(200, 1, 0.3) ) syn <- synthesize(real, method = "parametric", seed = 42) synset.seed(42) real <- data.frame( age = rnorm(200, 65, 10), sbp = rnorm(200, 130, 20), sex = sample(c("M", "F"), 200, replace = TRUE), outcome = rbinom(200, 1, 0.3) ) syn <- synthesize(real, method = "parametric", seed = 42) syn
Computes utility and privacy metrics comparing synthetic data to the original real dataset.
validate_synthetic( x, metrics = c("distributional", "correlation", "discriminative", "privacy") )validate_synthetic( x, metrics = c("distributional", "correlation", "discriminative", "privacy") )
x |
A |
metrics |
Character vector of metrics:
|
Utility metrics assess how well the synthetic data preserves statistical properties. Privacy metrics assess the risk of re-identification.
Discriminative accuracy near 0.5 means the synthetic data is indistinguishable from real data. Privacy ratio > 1 means synthetic records are not closer to real records than real records are to each other.
A synthetic_validation object (tibble) with columns:
metric, value, interpretation.
Snoke J, et al. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society A, 181(3):663–688. doi:10.1111/rssa.12358
set.seed(42) real <- data.frame(age = rnorm(100, 65, 10), sbp = rnorm(100, 130, 20)) syn <- synthesize(real, seed = 42) validate_synthetic(syn)set.seed(42) real <- data.frame(age = rnorm(100, 65, 10), sbp = rnorm(100, 130, 20)) syn <- synthesize(real, seed = 42) validate_synthetic(syn)