| Title: | Lineage Frequency Dynamics from Genomic Surveillance Counts |
|---|---|
| Description: | Models pathogen lineage frequency dynamics from genomic surveillance count data. Provides a unified interface for multinomial logistic regression, hierarchical partial-pooling models, and the Piantham approximation for relative reproduction number estimation. Features include rolling-origin backtesting, standardized forecast scoring, Compositional Adaptive Prediction Sets (CAPS) for horizon-aware calibrated forecasting, lineage collapsing, emergence detection, and sequencing power analysis. Designed for real-time public health surveillance of any variant-resolved pathogen. Methods described in Abousamra, Figgins, and Bedford (2024) <doi:10.1371/journal.pcbi.1012443>. |
| Authors: | Cuiwei Gao [aut, cre, cph] |
| Maintainer: | Cuiwei Gao <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.6.0 |
| Built: | 2026-05-19 06:12:43 UTC |
| Source: | https://github.com/cuiweig/lineagefreq |
Reallocates sequencing resources across regions in real time, directing effort toward strata where uncertainty reduction has the highest decision value. Unlike static Neyman allocation (which optimises for a single time point), this function adapts to evolving variant dynamics across multiple surveillance rounds.
adaptive_design( data, capacity, n_rounds = NULL, strategy = c("thompson", "ucb"), target_lineage = NULL, exploration = 2, seed = NULL )adaptive_design( data, capacity, n_rounds = NULL, strategy = c("thompson", "ucb"), target_lineage = NULL, exploration = 2, seed = NULL )
data |
An |
capacity |
Total sequencing capacity per round (integer). |
n_rounds |
Number of allocation rounds to simulate. Default
|
strategy |
Allocation strategy: |
target_lineage |
Character; lineage to optimise detection
for. Default |
exploration |
Exploration parameter for UCB. Default 2.0. Larger values explore more; smaller values exploit current best regions. |
seed |
Random seed. Default |
Thompson sampling draws from the posterior distribution of frequency estimates and allocates proportional to the posterior variance. Regions with high uncertainty receive more sequences, but the stochastic draws naturally balance exploration (sampling uncertain regions) and exploitation (sampling where variants are most prevalent).
UCB allocates proportional to the upper confidence bound
of the estimation error: where is the
current estimation uncertainty in region , is
the cumulative allocation, is the round, and
is the exploration parameter.
An adaptive_allocation S3 class with components:
Tibble with round, region,
n_allocated, uncertainty, frequency.
Tibble with per-region totals and mean allocation.
Character; strategy used.
Integer; per-round capacity.
surveillance_value for EVOI analysis.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 12, seed = 1) ad <- adaptive_design(sim, capacity = 200, n_rounds = 8) adsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 12, seed = 1) ad <- adaptive_design(sim, capacity = 200, n_rounds = 8) ad
Applies a sequential probability ratio test (SPRT) or CUSUM procedure to lineage frequency data, determining when accumulated evidence is sufficient to declare a variant "emerging" rather than sampling noise. Controls the false alarm rate while minimising detection delay.
alert_threshold( data, method = c("sprt", "cusum"), alpha = 0.05, beta = 0.1, delta_0 = 0, delta_1 = 0.03, threshold = 5 )alert_threshold( data, method = c("sprt", "cusum"), alpha = 0.05, beta = 0.1, delta_0 = 0, delta_1 = 0.03, threshold = 5 )
data |
An |
method |
Detection method: |
alpha |
False alarm probability. Default 0.05. |
beta |
Missed detection probability. Default 0.10. |
delta_0 |
Null hypothesis growth rate (no emergence). Default 0 (frequency is stable). |
delta_1 |
Alternative hypothesis growth rate (emergence). Default 0.03 (3 percent per-week increase on logit scale). |
threshold |
CUSUM decision threshold. Default 5.0.
Only used when |
SPRT (Wald, 1945) computes the log-likelihood ratio
between the alternative (lineage is growing at rate
) and the null (frequency is stable). The test
stops when the cumulative log-ratio crosses the upper boundary
(declare emerging) or the
lower boundary (declare stable).
CUSUM accumulates deviations from expected frequency
under the null:
where is the observed frequency change and
is the allowance (half the shift to detect). An alert is raised
when .
A tibble with columns lineage, date,
statistic (log-likelihood ratio or CUSUM value),
alert (logical), direction (emerging/declining/
stable), and confidence (1 - alpha).
Wald A (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2), 117–186.
summarize_emerging for non-sequential
trend tests, detection_horizon for prospective
power analysis.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.5, "B" = 0.8), n_timepoints = 15, seed = 1) alerts <- alert_threshold(sim) alertssim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.5, "B" = 0.8), n_timepoints = 15, seed = 1) alerts <- alert_threshold(sim) alerts
Generic function to convert various data formats into lfq_data objects. Methods can be defined for specific input classes to enable seamless interoperability with other genomic surveillance packages.
as_lfq_data(x, ...) ## S3 method for class 'lfq_data' as_lfq_data(x, ...) ## S3 method for class 'data.frame' as_lfq_data(x, ...)as_lfq_data(x, ...) ## S3 method for class 'lfq_data' as_lfq_data(x, ...) ## S3 method for class 'data.frame' as_lfq_data(x, ...)
x |
An object to coerce. |
... |
Additional arguments passed to methods. For the
|
An lfq_data object.
An lfq_data object.
An lfq_data object.
lfq_data() for the primary constructor.
df <- data.frame( date = rep(as.Date("2024-01-01") + c(0, 7), each = 2), lineage = rep(c("A", "B"), 2), count = c(80, 20, 60, 40) ) x <- as_lfq_data(df, lineage = lineage, date = date, count = count) xdf <- data.frame( date = rep(as.Date("2024-01-01") + c(0, 7), each = 2), lineage = rep(c("A", "B"), 2), count = c(80, 20, 60, 40) ) x <- as_lfq_data(df, lineage = lineage, date = date, count = count) x
Convert lfq_data to long-format tibble
## S3 method for class 'lfq_data' as.data.frame(x, ...)## S3 method for class 'lfq_data' as.data.frame(x, ...)
x |
An lfq_data object. |
... |
Ignored. |
A tibble with all columns.
data(sarscov2_us_2022) x <- lfq_data(sarscov2_us_2022, lineage = variant, date = date, count = count, total = total) as.data.frame(x)data(sarscov2_us_2022) x <- lfq_data(sarscov2_us_2022, lineage = variant, date = date, count = count, total = total) as.data.frame(x)
Augment data with fitted values from an lfq_fit object
augment.lfq_fit(x, ...)augment.lfq_fit(x, ...)
x |
An |
... |
Ignored. |
A tibble with columns: .date, .lineage, .fitted_freq,
.observed, .pearson_resid.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) augment.lfq_fit(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) augment.lfq_fit(fit)
Plot lineage frequency model results
## S3 method for class 'lfq_fit' autoplot( object, type = c("frequency", "advantage", "trajectory", "residuals"), generation_time = NULL, ... )## S3 method for class 'lfq_fit' autoplot( object, type = c("frequency", "advantage", "trajectory", "residuals"), generation_time = NULL, ... )
object |
An |
type |
Plot type: |
generation_time |
Required when |
... |
Ignored. |
A ggplot object.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) autoplot(fit) autoplot(fit, type = "advantage", generation_time = 5)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) autoplot(fit) autoplot(fit, type = "advantage", generation_time = 5)
Plot a lineage frequency forecast
## S3 method for class 'lfq_forecast' autoplot(object, ...)## S3 method for class 'lfq_forecast' autoplot(object, ...)
object |
An |
... |
Ignored. |
A ggplot object.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) fc <- forecast(fit, horizon = 14) autoplot(fc)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) fc <- forecast(fit, horizon = 14) autoplot(fc)
Evaluates forecast accuracy by repeatedly fitting models on historical data and comparing predictions to held-out observations. This implements the evaluation framework described in Abousamra et al. (2024).
backtest( data, engines = "mlr", origins = "weekly", horizons = c(7L, 14L, 21L, 28L), min_train = 42L, ... )backtest( data, engines = "mlr", origins = "weekly", horizons = c(7L, 14L, 21L, 28L), min_train = 42L, ... )
data |
An lfq_data object. |
engines |
Character vector of engine names to compare.
Default |
origins |
How to select forecast origins:
|
horizons |
Integer vector of forecast horizons in days.
Default |
min_train |
Minimum training window in days. Origins earlier
than |
... |
Additional arguments passed to |
Implements the rolling-origin evaluation framework described in Abousamra et al. (2024), Section 2.4. At each origin date, the model is fit on data up to that date and forecasts are compared to held-out future observations. This avoids look-ahead bias and provides an honest assessment of real-time forecast accuracy.
An lfq_backtest object (tibble subclass) with columns:
Date used as the training cutoff.
Date being predicted.
Forecast horizon in days.
Engine name.
Lineage name.
Predicted frequency (median).
Lower prediction bound.
Upper prediction bound.
Observed frequency at target_date.
Abousamra E, Figgins M, Bedford T (2024). Fitness models provide accurate short-term forecasts of SARS-CoV-2 variant frequency. PLoS Computational Biology, 20(9):e1012443. doi:10.1371/journal.pcbi.1012443
score_forecasts() to compute accuracy metrics,
compare_models() to rank engines.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) btsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) bt
Assesses whether prediction intervals from backtesting are
well-calibrated by computing PIT (Probability Integral Transform)
values, reliability diagrams, and uniformity tests. A perfectly
calibrated forecaster produces PIT values that are uniformly
distributed on .
calibrate(x, observed = NULL, n_bins = 10L) ## S3 method for class 'lfq_backtest' calibrate(x, observed = NULL, n_bins = 10L) ## S3 method for class 'lfq_forecast' calibrate(x, observed = NULL, n_bins = 10L)calibrate(x, observed = NULL, n_bins = 10L) ## S3 method for class 'lfq_backtest' calibrate(x, observed = NULL, n_bins = 10L) ## S3 method for class 'lfq_forecast' calibrate(x, observed = NULL, n_bins = 10L)
x |
An |
observed |
For |
n_bins |
Number of bins for the PIT histogram and reliability diagram. Default 10. |
The PIT value for a single forecast-observation pair is defined as
the quantile of the observation within the forecast distribution.
Under the assumption of Gaussian prediction intervals (which the
parametric simulation in forecast approximates),
the PIT is computed as
where is the predicted median,
is derived from the prediction interval width, and is
the standard normal CDF.
The reliability diagram plots observed coverage against nominal coverage at levels 10 through 90 percent. Perfect calibration lies on the diagonal.
A calibration_report object (S3 class) with
components:
Numeric vector of PIT values in .
Tibble with bin, count,
density, expected columns.
Tibble with nominal, observed
coverage at each level.
List with statistic and p_value
from the Kolmogorov-Smirnov test for uniformity.
Integer; number of forecast-observation pairs used.
Gneiting T, Balabdaoui F, Raftery AE (2007). Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B, 69(2), 243–268. doi:10.1111/j.1467-9868.2007.00587.x
recalibrate for post-hoc recalibration,
score_forecasts for proper scoring rules.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) cal <- calibrate(bt) calsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) cal <- calibrate(bt) cal
Assess calibration of multivariate forecasts using the energy score, joint coverage, and multivariate rank histograms.
calibrate_joint(bt, n_bins = 10L)calibrate_joint(bt, n_bins = 10L)
bt |
An |
n_bins |
Integer; number of rank histogram bins (default 10). |
The energy score is a multivariate proper scoring rule that
generalises the CRPS (Gneiting & Raftery, 2007, Section 5).
For a deterministic forecast and observation :
where the norm is the Aitchison distance on the simplex.
Joint coverage at level : the fraction of observed
composition vectors falling within Aitchison distance
of the forecast, where is derived
from the empirical distribution of distances.
The multivariate rank histogram uses the pre-rank approach: the rank of the observation among an ensemble is computed using the Aitchison distance to the ensemble mean.
A joint_calibration_report S3 object (list) with:
Tibble with origin_date, horizon, energy_score.
Overall mean energy score.
Tibble with nominal level, observed coverage, using Aitchison distance from backtest results.
Tibble with bin, count, density, expected.
Number of forecast-observation vectors.
Real surveillance data covering the Omicron BA.1 to BA.2 variant replacement in the United States, December 2021 through June 2022. This is one of the best-documented variant replacement events and serves as an independent validation dataset.
cdc_ba2_transitioncdc_ba2_transition
A data frame with 150 rows and 4 columns:
Biweek ending date (Date).
Lineage name: BA.1, BA.2, BA.2.12.1, BA.4/5, Other.
Approximate sequence count per biweek (integer).
CDC weighted proportion estimate (numeric).
CDC COVID Data Tracker (data.cdc.gov, public domain).
Lyngse FP, et al. (2022). Household transmission of SARS-CoV-2 Omicron variant of concern subvariants BA.1 and BA.2 in Denmark. Nature Communications, 13:5760. doi:10.1038/s41467-022-33498-0
data(cdc_ba2_transition) vd <- lfq_data(cdc_ba2_transition, date = date, lineage = lineage, count = count) fit <- fit_model(vd, engine = "mlr", pivot = "BA.1") # BA.2 Rt ~ 1.34 (consistent with published estimates) growth_advantage(fit, type = "relative_Rt", generation_time = 3.2)data(cdc_ba2_transition) vd <- lfq_data(cdc_ba2_transition, date = date, lineage = lineage, count = count) fit <- fit_model(vd, engine = "mlr", pivot = "BA.1") # BA.2 Rt ~ 1.34 (consistent with published estimates) growth_advantage(fit, type = "relative_Rt", generation_time = 3.2)
Real surveillance data from the CDC's national genomic surveillance program covering the emergence and dominance of the SARS-CoV-2 JN.1 lineage in the United States, October 2023 through June 2024.
cdc_sarscov2_jn1cdc_sarscov2_jn1
A data frame with 171 rows and 4 columns:
Biweek ending date (Date).
Lineage name (character): JN.1, XBB.1.5, EG.5.1, HV.1, HK.3, BA.2.86, KP.2, KP.3, JN.1.11.1, Other.
Approximate sequence count per biweek (integer).
CDC weighted proportion estimate (numeric).
Derived from CDC's published weighted variant proportion estimates.
Approximate biweekly sequence counts were reconstructed from
proportions using a reference total of 8,000 sequences per period.
The original proportions are retained in the proportion column.
CDC COVID Data Tracker, SARS-CoV-2 Variant Proportions. Dataset ID: jr58-6ysp. https://data.cdc.gov/Laboratory-Surveillance/SARS-CoV-2-Variant-Proportions/jr58-6ysp
Public domain (U.S. Government Work, 17 USC 105).
Ma KC, et al. (2024). Genomic Surveillance for SARS-CoV-2 Variants. MMWR, 73(42):941–948. doi:10.15585/mmwr.mm7342a1
data(cdc_sarscov2_jn1) vd <- lfq_data(cdc_sarscov2_jn1, date = date, lineage = lineage, count = count) fit <- fit_model(vd, engine = "mlr") growth_advantage(fit, type = "relative_Rt", generation_time = 5)data(cdc_sarscov2_jn1) vd <- lfq_data(cdc_sarscov2_jn1, date = date, lineage = lineage, count = count) fit <- fit_model(vd, engine = "mlr") growth_advantage(fit, type = "relative_Rt", generation_time = 5)
Extract coefficients from a lineage frequency model
## S3 method for class 'lfq_fit' coef(object, type = c("growth_rate", "all"), ...)## S3 method for class 'lfq_fit' coef(object, type = c("growth_rate", "all"), ...)
object |
An |
type |
What to return: |
... |
Ignored. |
Named numeric vector.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) coef(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) coef(fit)
Merges lineages that never exceed a frequency or count threshold into a single group (default "Other"). Useful for reducing noise from dozens of low-frequency lineages.
collapse_lineages( x, min_freq = 0.01, min_count = 10L, other_label = "Other", mapping = NULL )collapse_lineages( x, min_freq = 0.01, min_count = 10L, other_label = "Other", mapping = NULL )
x |
An lfq_data object. |
min_freq |
Minimum peak frequency a lineage must reach at any time point to be kept. Default 0.01 (1%). |
min_count |
Minimum total count across all time points to be kept. Default 10. |
other_label |
Label for the collapsed group. Default "Other". |
mapping |
Optional named character vector for custom grouping.
Names = original lineage names, values = group names. If provided,
|
An lfq_data object with rare lineages merged.
sim <- simulate_dynamics(n_lineages = 6, advantages = c(A = 1.3, B = 1.1, C = 0.95, D = 0.5, E = 0.3), n_timepoints = 12, seed = 1) collapsed <- collapse_lineages(sim, min_freq = 0.05) attr(collapsed, "lineages")sim <- simulate_dynamics(n_lineages = 6, advantages = c(A = 1.3, B = 1.1, C = 0.95, D = 0.5, E = 0.3), n_timepoints = 12, seed = 1) collapsed <- collapse_lineages(sim, min_freq = 0.05) attr(collapsed, "lineages")
Summarises and ranks engines across horizons based on forecast accuracy scores.
compare_models(scores, by = "engine")compare_models(scores, by = "engine")
scores |
Output of |
by |
Grouping variable(s). Default |
A tibble with average scores per group, sorted by MAE.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) sc <- score_forecasts(bt) compare_models(sc)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) sc <- score_forecasts(bt) compare_models(sc)
Produces distribution-free prediction intervals with finite-sample
coverage guarantees using split conformal inference. Unlike the
parametric intervals from forecast, conformal
intervals require no distributional assumptions on the residuals
and are valid under exchangeability.
conformal_forecast( fit, data, horizon = 28L, ci_level = 0.95, method = c("split", "aci"), cal_fraction = 0.3, gamma = 0.05, seed = NULL )conformal_forecast( fit, data, horizon = 28L, ci_level = 0.95, method = c("split", "aci"), cal_fraction = 0.3, gamma = 0.05, seed = NULL )
fit |
An |
data |
The |
horizon |
Number of days to forecast. Default 28. |
ci_level |
Target coverage level. Default 0.95. |
method |
Conformal method: |
cal_fraction |
Fraction of the data reserved for the calibration set (split conformal only). Default 0.3. |
gamma |
Learning rate for adaptive conformal inference. Default 0.05. Controls how quickly the coverage target adjusts in response to observed miscoverage. |
seed |
Random seed for the calibration split. Default
|
Split conformal prediction partitions the training data
into a proper training set and a calibration set. The model is
refit on the training set, and conformity scores (absolute
residuals) are computed on the calibration set. The prediction
interval at a new point is the point forecast plus or minus the
quantile of the
calibration scores. This provides exact marginal
coverage under exchangeability (Vovk et al. 2005).
Adaptive conformal inference (ACI) (Gibbs and Candes, 2021) adjusts the miscoverage level online to maintain long-run coverage even when the data distribution shifts over time, as is typical in surveillance data during variant replacement waves.
An lfq_forecast object with conformal prediction
intervals. The object is fully compatible with
autoplot and other forecast methods.
Vovk V, Gammerman A, Shafer G (2005). Algorithmic Learning in a Random World. Springer.
Gibbs I, Candes E (2021). Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34.
forecast for parametric prediction
intervals, calibrate for calibration diagnostics.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) fit <- fit_model(sim, engine = "mlr") fc_conf <- conformal_forecast(fit, sim, horizon = 21) fc_confsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) fit <- fit_model(sim, engine = "mlr") fc_conf <- conformal_forecast(fit, sim, horizon = 21) fc_conf
Produces a compositional prediction region that respects the simplex constraint (frequencies sum to 1) using Aitchison geometry. Unlike marginal conformal prediction, joint prediction guarantees that the entire frequency vector is covered, not just individual lineages.
conformal_forecast_joint( fit, data, horizon = 28L, ci_level = 0.95, cal_fraction = 0.3, seed = NULL )conformal_forecast_joint( fit, data, horizon = 28L, ci_level = 0.95, cal_fraction = 0.3, seed = NULL )
fit |
An |
data |
An |
horizon |
Integer; forecast horizon in days (default 28). |
ci_level |
Numeric in (0,1); coverage target (default 0.95). |
cal_fraction |
Numeric in (0,1); fraction of dates for calibration (default 0.3). |
seed |
Optional integer for reproducibility. |
The nonconformity score is the Aitchison distance between predicted and observed compositions. The Aitchison distance equals the Euclidean distance in the isometric log-ratio (ILR) transformed space, which respects the geometry of the simplex (Aitchison, 1986).
The prediction region is the set of all compositions within
Aitchison distance of the point forecast, where
is the quantile of calibration distances.
Marginal intervals are obtained by projecting this region onto
each coordinate axis, then intersecting with .
An lfq_conformal_joint S3 object (list) with:
The point forecast (lfq_forecast).
Conformal radius in Aitchison distance.
Tibble with .date, .lineage, .lower_joint, .upper_joint — marginal bounds projected from the joint region.
Tibble with .lower_marginal, .upper_marginal from standard marginal conformal prediction.
Tibble indicating whether joint intervals are wider or narrower than marginal intervals per lineage.
Aitchison distances on calibration set.
Nominal coverage level.
Number of calibration compositions.
Aitchison J (1986). The Statistical Analysis of Compositional Data. Chapman & Hall.
Vovk V, Gammerman A, Shafer G (2005). Algorithmic Learning in a Random World. Springer.
Given current sequencing capacity and a hypothetical variant at
initial prevalence growing at rate ,
estimates the number of surveillance periods (weeks) until the
probability of detection exceeds a specified threshold.
detection_horizon( initial_prev = 0.001, growth_rate = 1.3, n_per_period = 500L, n_periods = 26L, detection_threshold = 1L, confidence = 0.95 )detection_horizon( initial_prev = 0.001, growth_rate = 1.3, n_per_period = 500L, n_periods = 26L, detection_threshold = 1L, confidence = 0.95 )
initial_prev |
Initial prevalence of the emerging variant. Default 0.001 (0.1 percent). |
growth_rate |
Per-week multiplicative growth rate on the frequency scale. Default 1.3 (30 percent per week). |
n_per_period |
Sequences collected per surveillance period. Default 500. |
n_periods |
Maximum number of periods to evaluate. Default 26 (6 months of weekly data). |
detection_threshold |
Minimum number of sequences of the target lineage required to declare detection. Default 1. |
confidence |
Detection probability threshold. Default 0.95. |
At each period , the variant prevalence is modelled as
logistic growth: where is the per-period growth rate.
The probability of detecting at least sequences in
draws at prevalence is
.
Cumulative detection is .
A tibble with columns period, prevalence,
detection_prob (cumulative detection probability), and
detected (logical). An attribute weeks_to_detection
contains the first period where detection probability exceeds
the confidence threshold, or NA if not reached.
sequencing_power for static sample size
calculations, alert_threshold for sequential
detection.
dh <- detection_horizon(initial_prev = 0.005, growth_rate = 1.2, n_per_period = 300) dhdh <- detection_horizon(initial_prev = 0.005, growth_rate = 1.2, n_per_period = 300) dh
Simulates real-time deployment: at each origin, all preceding origins form the conformal calibration set, and the forecast at the next origin is evaluated. No future data is ever used for calibration — this is a genuine expanding-window evaluation.
evaluate_prospective( data, engine = "mlr", horizons = c(7L, 14L), min_train = 42L, min_cal = 5L, ci_level = 0.95, gamma = 0.05, ... )evaluate_prospective( data, engine = "mlr", horizons = c(7L, 14L), min_train = 42L, min_cal = 5L, ci_level = 0.95, gamma = 0.05, ... )
data |
An |
engine |
Character; estimation engine (default |
horizons |
Integer vector; forecast horizons in days
(default |
min_train |
Integer; minimum training window in days (default 42). |
min_cal |
Integer; minimum calibration origins before conformal intervals are computed (default 5). |
ci_level |
Numeric in (0,1); nominal coverage (default 0.95). |
gamma |
Numeric; ACI learning rate (default 0.05). |
... |
Passed to |
At each origin with min_cal:
Fit the model on data up to .
Compute residuals at all previous origins
(the calibration set).
Static conformal: radius =
quantile of absolute calibration residuals.
ACI: radius adjusted by online update of .
Evaluate coverage at .
This differs from conformal_forecast() which uses a
fixed temporal split. Here, the calibration set expands over time,
mimicking a surveillance agency accumulating validation data.
An lfq_prospective S3 object (list) with:
Tibble with origin_date, target_date, horizon, lineage, predicted, observed, lower_param, upper_param, lower_static, upper_static, lower_aci, upper_aci, radius_static, radius_aci, alpha_aci.
Tibble summarising cumulative coverage by method (parametric, static conformal, ACI) at each step.
Tibble with method, overall coverage, mean width, Winkler score.
Total number of evaluation origins.
Removes time points with very low total counts and lineages observed at too few time points.
filter_sparse(x, min_total = 10L, min_timepoints = 3L)filter_sparse(x, min_total = 10L, min_timepoints = 3L)
x |
An lfq_data object. |
min_total |
Minimum total sequences per time point. Default 10. |
min_timepoints |
Minimum number of time points a lineage must appear to be retained. Default 3. |
An lfq_data object with sparse entries removed.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) filtered <- filter_sparse(sim, min_total = 100)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) filtered <- filter_sparse(sim, min_total = 100)
A penalised multinomial logistic regression engine that incorporates Deep Mutational Scanning (DMS) escape scores as informative priors on variant fitness. This is valuable for early-emergence scenarios where a new lineage has few observed sequences but laboratory-measured phenotypic data (e.g., ACE2 binding affinity, antibody escape) are available.
fit_dms_prior(data, dms_scores, lambda = 1, pivot = NULL, ci_level = 0.95)fit_dms_prior(data, dms_scores, lambda = 1, pivot = NULL, ci_level = 0.95)
data |
An |
dms_scores |
Named numeric vector of DMS-derived fitness priors. Names correspond to lineage identifiers. Values are on the log growth rate scale (positive = fitter than average). Lineages not in the vector receive a prior of 0. |
lambda |
Regularisation strength (penalty weight). Default
1.0. Larger values pull estimates more strongly toward the DMS
prior. At |
pivot |
Reference lineage name. Default |
ci_level |
Confidence level. Default 0.95. |
The approach uses penalised maximum likelihood where the penalty is proportional to the squared difference between the estimated growth rate and the DMS-derived prior. This implements an empirical Bayes shrinkage: with abundant data, the penalty has little effect; with sparse data, estimates are pulled toward the DMS prior.
The penalised log-likelihood is:
where is the standard multinomial log-likelihood,
is the growth rate for lineage , and
is the DMS prior. The Hessian is adjusted
accordingly, ensuring correct confidence interval widths.
An lfq_fit object compatible with all downstream
functions (forecast, growth_advantage, etc.).
Dadonaite B, Crawford KHD, Radford CE, et al. (2023). A pseudovirus system enables deep mutational scanning of the full SARS-CoV-2 spike. Cell, 186(6), 1263–1278. doi:10.1016/j.cell.2023.02.001
Bloom JD, Neher RA (2023). Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evolution, 9(2), vead055. doi:10.1093/ve/vead055
fit_model for the standard MLR engine.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 8, total_per_tp = 100, seed = 1) # DMS suggests lineage A has fitness advantage dms <- c("A" = 0.04, "B" = -0.02) fit_dms <- fit_dms_prior(sim, dms_scores = dms, lambda = 2) growth_advantage(fit_dms)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 8, total_per_tp = 100, seed = 1) # DMS suggests lineage A has fitness advantage dms <- c("A" = 0.04, "B" = -0.02) fit_dms <- fit_dms_prior(sim, dms_scores = dms, lambda = 2) growth_advantage(fit_dms)
Unified interface for modeling lineage frequency dynamics. Supports multiple engines that share the same input/output contract.
fit_model( data, engine = c("mlr", "hier_mlr", "piantham", "fga", "garw"), pivot = NULL, horizon = 28L, ci_level = 0.95, ... )fit_model( data, engine = c("mlr", "hier_mlr", "piantham", "fga", "garw"), pivot = NULL, horizon = 28L, ci_level = 0.95, ... )
data |
An lfq_data object. |
engine |
Model engine to use:
|
pivot |
Reference lineage name. Growth rates are reported relative to this lineage (fixed at 0). Default: the lineage with the highest count at the earliest time point. |
horizon |
Forecast horizon in days (stored for later use by
|
ci_level |
Confidence level for intervals. Default 0.95. |
... |
Engine-specific arguments passed to the internal engine
function. For |
The MLR engine models the frequency of lineage at time
as:
where is the intercept, is the
growth rate per time_scale days (default 7), and the pivot
lineage has . Parameters are estimated
by maximum likelihood via optim(method = "BFGS") with the
Hessian used for asymptotic Wald confidence intervals.
The constant growth rate assumption is appropriate for monotonic
variant replacement periods (typically 2–4 months). For longer
periods or non-monotonic dynamics, use the window argument to
restrict the estimation window, or consider the "garw" engine
which allows time-varying growth advantages.
An lfq_fit object (S3 class), a list containing:
Engine name (character).
Named numeric vector of growth rates per
time_scale days (pivot = 0).
Named numeric vector of intercepts.
Name of pivot lineage.
Character vector of all lineage names.
Tibble of fitted frequencies.
Tibble with observed, fitted, Pearson residuals.
Variance-covariance matrix.
Model fit statistics.
Sample and model sizes.
As specified.
The matched call.
Abousamra E, Figgins M, Bedford T (2024). Fitness models provide accurate short-term forecasts of SARS-CoV-2 variant frequency. PLoS Computational Biology, 20(9):e1012443. doi:10.1371/journal.pcbi.1012443
Piantham C, Linton NM, Nishiura H (2022). Predicting the trajectory of replacements of SARS-CoV-2 variants using relative reproduction numbers. Viruses, 14(11):2556. doi:10.3390/v14112556
growth_advantage() to extract fitness estimates,
forecast() for frequency prediction, backtest() for
rolling-origin evaluation.
sim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) fit <- fit_model(sim, engine = "mlr") fitsim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) fit <- fit_model(sim, engine = "mlr") fit
Partitions the observed growth advantage of each lineage into
two components: intrinsic transmissibility (fitness in a fully
naive population) and immune escape (additional fitness gained
from evading population immunity). This decomposition follows the
framework of Figgins and Bedford (2025), where the effective
fitness of lineage is modelled as
with representing intrinsic transmissibility and
the proportion of the population with neutralising
immunity against .
fitness_decomposition(fit, landscape, generation_time)fitness_decomposition(fit, landscape, generation_time)
fit |
An |
landscape |
An |
generation_time |
Generation time in days (required for converting growth rates to reproduction numbers). |
The decomposition proceeds as follows. For each non-pivot
lineage, the observed growth rate from the MLR
fit is taken as the total fitness difference relative to the
reference. The immune escape component is estimated from the
immunity differential:
where is the mean population immunity against
lineage over the fitted period. The intrinsic
transmissibility component is the residual:
When cross-immunity data are available in the landscape object, effective immunity against each lineage is computed as the weighted sum of immunity sources, discounted by the cross-immunity matrix.
A fitness_decomposition object (S3 class) with
components:
Tibble with columns: lineage,
observed_advantage (total growth rate), beta
(intrinsic transmissibility), escape_contribution
(immune escape component), transmissibility_fraction
(proportion due to intrinsic fitness),
escape_fraction (proportion due to immune escape).
The input lfq_fit object.
The input immune_landscape.
Generation time used.
Figgins MD, Bedford T (2025). Jointly modeling variant frequencies and case counts to estimate relative variant severity. medRxiv. doi:10.1101/2024.12.02.24318334
immune_landscape for constructing the
immunity input, selective_pressure for
population-level early warning signals.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") imm_data <- data.frame( date = rep(unique(sim$.date), each = 3), lineage = rep(c("A", "B", "ref"), length(unique(sim$.date))), immunity = c(rep(c(0.2, 0.5, 0.4), length(unique(sim$.date)))) ) il <- immune_landscape(imm_data, date, lineage, immunity) fd <- fitness_decomposition(fit, il, generation_time = 5) fdsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") imm_data <- data.frame( date = rep(unique(sim$.date), each = 3), lineage = rep(c("A", "B", "ref"), length(unique(sim$.date))), immunity = c(rep(c(0.2, 0.5, 0.4), length(unique(sim$.date)))) ) il <- immune_landscape(imm_data, date, lineage, immunity) fd <- fitness_decomposition(fit, il, generation_time = 5) fd
Forecast lineage frequencies (generic)
forecast(object, ...)forecast(object, ...)
object |
A model object. |
... |
Additional arguments passed to methods. |
A forecast object.
Projects lineage frequencies forward in time using the fitted model. Prediction uncertainty is quantified by parametric simulation from the estimated parameter distribution.
## S3 method for class 'lfq_fit' forecast( object, horizon = 28L, ci_level = 0.95, n_sim = 1000L, sampling_noise = TRUE, effective_n = 100L, ... )## S3 method for class 'lfq_fit' forecast( object, horizon = 28L, ci_level = 0.95, n_sim = 1000L, sampling_noise = TRUE, effective_n = 100L, ... )
object |
An |
horizon |
Number of days to forecast. Default 28 (4 weeks). |
ci_level |
Confidence level for prediction intervals. Default 0.95. |
n_sim |
Number of parameter draws for prediction intervals. Default 1000. |
sampling_noise |
Logical; add multinomial sampling noise to
prediction intervals? Default |
effective_n |
Effective sample size for multinomial sampling noise. Default 100, corresponding to a typical weekly sequencing volume per reporting unit. Smaller values produce wider (more conservative) prediction intervals. |
... |
Unused. |
An lfq_forecast object (tibble subclass) with columns:
Date.
Lineage name.
Median predicted frequency.
Lower prediction bound.
Upper prediction bound.
"fitted" or "forecast".
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim, engine = "mlr") fc <- forecast(fit, horizon = 21) fcsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim, engine = "mlr") fc <- forecast(fit, horizon = 21) fc
Returns a single-row tibble of model-level summary statistics.
glance.lfq_fit(x, ...)glance.lfq_fit(x, ...)
x |
An |
... |
Ignored. |
A single-row tibble with columns: engine, n_lineages,
n_timepoints, nobs, df, logLik, AIC, BIC, pivot,
convergence.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) glance.lfq_fit(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) glance.lfq_fit(fit)
Computes relative fitness of each lineage from a fitted model. Supports four output formats for different use cases.
growth_advantage( fit, type = c("growth_rate", "relative_Rt", "selection_coefficient", "doubling_time"), generation_time = NULL, ci_level = NULL )growth_advantage( fit, type = c("growth_rate", "relative_Rt", "selection_coefficient", "doubling_time"), generation_time = NULL, ci_level = NULL )
fit |
An |
type |
Output type:
|
generation_time |
Mean generation time in days. Required for
|
ci_level |
Confidence level for intervals. Default uses the level from the fitted model. |
The "relative_Rt" and "selection_coefficient" types use the
Piantham approximation (Piantham et al. 2022,
doi:10.3390/v14112556), which assumes:
Variants are in their exponential growth phase (not saturating due to population-level immunity).
All variants share the same generation time distribution.
Growth advantage reflects transmissibility differences; immune escape is not separately identified.
Confidence intervals for "relative_Rt" are computed in
log-space (Wald intervals on log-Rt), which is more accurate
than the linear delta method for ratios.
A tibble with columns:
Lineage name.
Point estimate.
Lower confidence bound.
Upper confidence bound.
Type of estimate.
Name of pivot (reference) lineage.
Piantham C, Linton NM, Nishiura H (2022). Predicting the trajectory of replacements of SARS-CoV-2 variants using relative reproduction numbers. Viruses, 14(11):2556. doi:10.3390/v14112556
Abousamra E, Figgins M, Bedford T (2024). Fitness models provide accurate short-term forecasts of SARS-CoV-2 variant frequency. PLoS Computational Biology, 20(9):e1012443. doi:10.1371/journal.pcbi.1012443
sim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) fit <- fit_model(sim, engine = "mlr") # Growth rates per week growth_advantage(fit, type = "growth_rate") # Relative Rt (needs generation time) growth_advantage(fit, type = "relative_Rt", generation_time = 5)sim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) fit <- fit_model(sim, engine = "mlr") # Growth rates per week growth_advantage(fit, type = "growth_rate") # Relative Rt (needs generation time) growth_advantage(fit, type = "relative_Rt", generation_time = 5)
Assembles time-varying population immunity estimates against
each circulating lineage from seroprevalence surveys, vaccination
records, or infection history data. The resulting object serves
as input to fitness_decomposition for disentangling
intrinsic transmissibility from immune escape.
immune_landscape( data, date, lineage, immunity, type = "combined", cross_immunity = NULL )immune_landscape( data, date, lineage, immunity, type = "combined", cross_immunity = NULL )
data |
A data frame with columns for date, lineage, and immunity level. |
date |
Column name (unquoted or string) containing dates. |
lineage |
Column name containing lineage/variant identifiers. |
immunity |
Column name containing population-level immunity estimates (proportion, 0–1 scale) against each lineage. |
type |
Character vector specifying immunity source(s):
|
cross_immunity |
Optional numeric matrix of cross-immunity
between lineages. Rows and columns correspond to lineages;
entry |
Immunity estimates may come from multiple sources. Seroprevalence surveys provide direct measurements but are infrequent and geographically sparse. Vaccination coverage data are more widely available but do not capture waning or variant-specific escape. Model-based reconstructions (e.g., from case and death data) can fill gaps but introduce model dependence.
immune_landscape() accepts any of these as input. The
critical requirement is that immunity is expressed on a 0–1
scale representing the proportion of the population with
neutralising protection against each lineage at each time point.
An immune_landscape object (S3 class) with
components:
Tibble with columns date, lineage,
immunity, and optionally type.
Character vector of lineage names.
Two-element Date vector.
Matrix or NULL.
fitness_decomposition for downstream
analysis.
# Simulated immunity data imm_data <- data.frame( date = rep(seq(as.Date("2024-01-01"), by = "week", length.out = 10), each = 3), lineage = rep(c("BA.5", "XBB.1.5", "JN.1"), 10), immunity = c( rep(c(0.6, 0.4, 0.1), 5), rep(c(0.55, 0.45, 0.25), 5)) ) il <- immune_landscape(imm_data, date = date, lineage = lineage, immunity = immunity) il# Simulated immunity data imm_data <- data.frame( date = rep(seq(as.Date("2024-01-01"), by = "week", length.out = 10), each = 3), lineage = rep(c("BA.5", "XBB.1.5", "JN.1"), 10), immunity = c( rep(c(0.6, 0.4, 0.1), 5), rep(c(0.55, 0.45, 0.25), 5)) ) il <- immune_landscape(imm_data, date = date, lineage = lineage, immunity = immunity) il
A simulated dataset of weekly influenza A/H3N2 clade sequence counts over a single Northern Hemisphere season (24 weeks).
influenza_h3n2influenza_h3n2
A data frame with 96 rows and 4 columns:
Collection date (Date, weekly).
Clade name (character).
Number of sequences (integer).
Total sequences for this week (integer).
Simulated based on 'Nextstrain' influenza clade dynamics.
data(influenza_h3n2) x <- lfq_data(influenza_h3n2, lineage = clade, date = date, count = count, total = total) xdata(influenza_h3n2) x <- lfq_data(influenza_h3n2, lineage = clade, date = date, count = count, total = total) x
Test if an object is an lfq_data object
is_lfq_data(x)is_lfq_data(x)
x |
Object to test. |
Logical scalar.
d <- data.frame(date = Sys.Date(), lineage = "A", count = 10) x <- lfq_data(d, lineage = lineage, date = date, count = count) is_lfq_data(x) is_lfq_data(d)d <- data.frame(date = Sys.Date(), lineage = "A", count = 10) x <- lfq_data(d, lineage = lineage, date = date, count = count) is_lfq_data(x) is_lfq_data(d)
Pipe-friendly growth advantage extraction
lfq_advantage(fit, type = "relative_Rt", generation_time = NULL, ...)lfq_advantage(fit, type = "relative_Rt", generation_time = NULL, ...)
fit |
An |
type |
Output type. Default |
generation_time |
Mean generation time in days. |
... |
Passed to |
A tibble of growth advantages.
Validates, structures, and annotates lineage count data for downstream modeling and analysis. This is the entry point for all lineagefreq workflows.
lfq_data( data, lineage, date, count, total = NULL, location = NULL, min_total = 10L )lfq_data( data, lineage, date, count, total = NULL, location = NULL, min_total = 10L )
data |
A data frame containing at minimum columns for lineage identity, date, and count. |
lineage |
< |
date |
< |
count |
< |
total |
< |
location |
< |
min_total |
Minimum total count per time point. Time points below this are flagged as unreliable. Default 10. |
Performs the following validation and processing:
Checks that all required columns exist and have correct types.
Coerces character dates to Date via ISO 8601 parsing.
Ensures counts are non-negative integers.
Replaces NA counts with 0 (with warning).
Aggregates duplicate lineage-date rows by summing (with warning).
Computes per-time-point totals and frequencies.
Flags time points below min_total as unreliable.
Sorts by date ascending, then lineage alphabetically.
An lfq_data object (a tibble subclass) with standardized
columns:
.lineageLineage identifier (character).
.dateCollection date (Date).
.countSequence count (integer).
.totalTotal sequences at this time point (integer).
.freqObserved frequency (numeric).
.reliableLogical; TRUE if .total >= min_total.
.locationLocation, if provided (character).
All original columns are preserved.
d <- data.frame( date = rep(seq(as.Date("2024-01-01"), by = "week", length.out = 8), each = 3), lineage = rep(c("JN.1", "KP.3", "Other"), 8), n = c(5, 2, 93, 12, 5, 83, 28, 11, 61, 50, 20, 30, 68, 18, 14, 80, 12, 8, 88, 8, 4, 92, 5, 3) ) x <- lfq_data(d, lineage = lineage, date = date, count = n) xd <- data.frame( date = rep(seq(as.Date("2024-01-01"), by = "week", length.out = 8), each = 3), lineage = rep(c("JN.1", "KP.3", "Other"), 8), n = c(5, 2, 93, 12, 5, 83, 28, 11, 61, 50, 20, 30, 68, 18, 14, 80, 12, 8, 88, 8, 4, 92, 5, 3) ) x <- lfq_data(d, lineage = lineage, date = date, count = n) x
Returns information about all modeling engines available in lineagefreq. Core engines (mlr, hier_mlr, piantham) are always available. Bayesian engines (fga, garw) require 'CmdStan'.
lfq_engines()lfq_engines()
A tibble with columns: engine, type, time_varying,
available, description.
lfq_engines()lfq_engines()
Enables tidyverse-style chaining:
data |> lfq_fit("mlr") |> lfq_forecast(28) |> lfq_score()
lfq_fit(data, engine = "mlr", ...)lfq_fit(data, engine = "mlr", ...)
data |
An lfq_data object. |
engine |
Engine name. Default |
... |
Passed to |
An lfq_fit object.
data(sarscov2_us_2022) sarscov2_us_2022 |> lfq_data(lineage = variant, date = date, count = count, total = total) |> lfq_fit("mlr") |> lfq_advantage(generation_time = 5)data(sarscov2_us_2022) sarscov2_us_2022 |> lfq_data(lineage = variant, date = date, count = count, total = total) |> lfq_fit("mlr") |> lfq_advantage(generation_time = 5)
Pipe-friendly forecasting
lfq_forecast(fit, horizon = 28L, ...)lfq_forecast(fit, horizon = 28L, ...)
fit |
An |
horizon |
Forecast horizon in days. Default 28. |
... |
Passed to |
An lfq_forecast object.
Runs backtest and returns scores in one step.
lfq_score( data, engines = "mlr", horizons = c(14, 28), metrics = c("mae", "coverage"), ... )lfq_score( data, engines = "mlr", horizons = c(14, 28), metrics = c("mae", "coverage"), ... )
data |
An lfq_data object. |
engines |
Character vector of engine names. |
horizons |
Forecast horizons in days. |
metrics |
Score metrics. |
... |
Passed to |
A tibble of scores.
Returns TRUE if 'CmdStanR' and 'CmdStan' are installed.
Bayesian engines require this.
lfq_stan_available()lfq_stan_available()
Logical scalar.
lfq_stan_available()lfq_stan_available()
Returns a one-row-per-lineage summary with growth rates, fitted frequencies at first and last time points, and growth advantage in multiple scales.
lfq_summary(fit, generation_time = NULL)lfq_summary(fit, generation_time = NULL)
fit |
An |
generation_time |
Generation time for Rt calculation. |
A tibble.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) lfq_summary(fit, generation_time = 5)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) lfq_summary(fit, generation_time = 5)
Reports lineagefreq version and availability of optional backends. Useful for reproducibility and bug reports.
lfq_version()lfq_version()
A list with components: version, r_version,
stan_available, engines.
lfq_version()lfq_version()
Creates a panel plot of forecast accuracy by engine and horizon.
plot_backtest(scores)plot_backtest(scores)
scores |
Output of |
A ggplot object.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) sc <- score_forecasts(bt) plot_backtest(sc)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) sc <- score_forecasts(bt) plot_backtest(sc)
Plot adaptive allocation
## S3 method for class 'adaptive_allocation' plot(x, ...)## S3 method for class 'adaptive_allocation' plot(x, ...)
x |
An |
... |
Unused. |
A ggplot object.
Produces either a reliability diagram (default) or a PIT histogram.
## S3 method for class 'calibration_report' plot(x, type = c("reliability", "pit"), ...)## S3 method for class 'calibration_report' plot(x, type = c("reliability", "pit"), ...)
x |
A |
type |
Which panel to display: |
... |
Unused. |
A ggplot object.
Plot EVOI curve
## S3 method for class 'evoi' plot(x, ...)## S3 method for class 'evoi' plot(x, ...)
x |
An |
... |
Unused. |
A ggplot object.
Stacked bar plot showing the partition of growth advantage into intrinsic transmissibility and immune escape components.
## S3 method for class 'fitness_decomposition' plot(x, ...)## S3 method for class 'fitness_decomposition' plot(x, ...)
x |
A |
... |
Unused. |
A ggplot object.
Plot population immunity landscape
## S3 method for class 'immune_landscape' plot(x, ...)## S3 method for class 'immune_landscape' plot(x, ...)
x |
An |
... |
Unused. |
A ggplot object.
Plot Prospective Evaluation Results
## S3 method for class 'lfq_prospective' plot(x, type = c("coverage", "radius", "comparison"), ...)## S3 method for class 'lfq_prospective' plot(x, type = c("coverage", "radius", "comparison"), ...)
x |
An |
type |
Character; one of |
... |
Ignored. |
A ggplot object.
Print a lineage frequency model
## S3 method for class 'lfq_fit' print(x, ...)## S3 method for class 'lfq_fit' print(x, ...)
x |
An |
... |
Ignored. |
Invisibly returns x.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42) fit <- fit_model(sim, engine = "mlr") print(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42) fit <- fit_model(sim, engine = "mlr") print(fit)
A convenience wrapper for reading surveillance count data from CSV files into lfq_data format. Expects a file with at least columns for date, lineage, and count.
read_lineage_counts( file, date = "date", lineage = "lineage", count = "count", ... )read_lineage_counts( file, date = "date", lineage = "lineage", count = "count", ... )
file |
Path to CSV file. |
date |
Name of the date column. Default |
lineage |
Name of the lineage column. Default |
count |
Name of the count column. Default |
... |
Additional arguments passed to |
An lfq_data object.
# Read the bundled example CSV f <- system.file("extdata", "example_counts.csv", package = "lineagefreq") x <- read_lineage_counts(f) x# Read the bundled example CSV f <- system.file("extdata", "example_counts.csv", package = "lineagefreq") x <- read_lineage_counts(f) x
Applies post-hoc recalibration to improve the coverage properties
of prediction intervals from forecast. Two methods
are available: isotonic regression (nonparametric, monotonicity-
preserving) and Platt scaling (logistic, parametric).
recalibrate(forecast_obj, bt, method = c("isotonic", "platt"))recalibrate(forecast_obj, bt, method = c("isotonic", "platt"))
forecast_obj |
An |
bt |
An |
method |
Recalibration method: |
Isotonic regression: Learns the monotone mapping from nominal coverage levels to observed coverage using the backtest data, then inverts it to find the nominal level that achieves the desired empirical coverage. This is a nonparametric approach that requires no distributional assumptions.
Platt scaling: Fits a logistic regression of the form
where is the standardised residual, and uses the fitted
model to adjust prediction interval widths.
Both methods require a backtest object with sufficient data to estimate the calibration mapping reliably (at least 30 forecast-observation pairs recommended).
An lfq_forecast object with recalibrated
.lower and .upper bounds. The object retains all
attributes of the original forecast and can be passed to
autoplot.
Platt JC (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74.
calibrate for calibration diagnostics.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) fit <- fit_model(sim, engine = "mlr") fc <- forecast(fit, horizon = 21) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) fc_recal <- recalibrate(fc, bt, method = "isotonic")sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) fit <- fit_model(sim, engine = "mlr") fc <- forecast(fit, horizon = 21) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) fc_recal <- recalibrate(fc, bt, method = "isotonic")
Allows third-party packages to register custom engines with
fit_model(). This enables an extensible plugin architecture
similar to 'parsnip' engine registration.
register_engine( name, fit_fn, description = "", type = "frequentist", time_varying = FALSE )register_engine( name, fit_fn, description = "", type = "frequentist", time_varying = FALSE )
name |
Engine name (character scalar). |
fit_fn |
Function with signature |
description |
One-line description of the engine. |
type |
|
time_varying |
Logical; does the engine support time-varying growth advantages? |
Invisibly returns the updated engine registry.
# Register a custom engine my_engine <- function(data, pivot = NULL, ci_level = 0.95, ...) { # Custom implementation... .engine_mlr(data, pivot = pivot, ci_level = ci_level, ...) } register_engine("my_mlr", my_engine, "Custom MLR wrapper") lfq_engines()# Register a custom engine my_engine <- function(data, pivot = NULL, ci_level = 0.95, ...) { # Custom implementation... .engine_mlr(data, pivot = pivot, ci_level = ci_level, ...) } register_engine("my_mlr", my_engine, "Custom MLR wrapper") lfq_engines()
A simulated dataset of weekly SARS-CoV-2 variant sequence counts for the United States in 2022. Includes the BA.1 to BA.2 to BA.4/5 to BQ.1 transition dynamics. This is simulated data (not real GISAID data) to avoid license restrictions while preserving realistic statistical properties.
sarscov2_us_2022sarscov2_us_2022
A data frame with 200 rows and 4 columns:
Collection date (Date, weekly).
Variant name (character): BA.1, BA.2, BA.4/5, BQ.1, Other.
Number of sequences assigned to this variant in this week (integer).
Total sequences for this week (integer).
Simulated based on parameters from published CDC MMWR genomic surveillance reports and 'Nextstrain' public data.
data(sarscov2_us_2022) x <- lfq_data(sarscov2_us_2022, lineage = variant, date = date, count = count, total = total) xdata(sarscov2_us_2022) x <- lfq_data(sarscov2_us_2022, lineage = variant, date = date, count = count, total = total) x
Computes standardized accuracy metrics from backtesting results.
score_forecasts( bt, metrics = c("mae", "rmse", "coverage", "wis", "crps", "log_score", "dss", "calibration") )score_forecasts( bt, metrics = c("mae", "rmse", "coverage", "wis", "crps", "log_score", "dss", "calibration") )
bt |
An |
metrics |
Character vector of metrics to compute:
|
A tibble with columns: engine, horizon, metric,
value.
Bracher J, Ray EL, Gneiting T, Reich NG (2021). Evaluating epidemic forecasts in an interval format. PLoS Computational Biology, 17(2):e1008618. doi:10.1371/journal.pcbi.1008618
Gneiting T, Raftery AE (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378. doi:10.1198/016214506000001437
compare_models() to rank engines based on scores.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) score_forecasts(bt)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), n_timepoints = 20, seed = 1) bt <- backtest(sim, engines = "mlr", horizons = c(7, 14), min_train = 42) score_forecasts(bt)
Computes a population-level selective pressure metric from genomic surveillance data alone, without requiring case counts or epidemiological data. The metric quantifies how rapidly the variant landscape is shifting and serves as an early warning signal for epidemic growth that is robust to case underreporting.
selective_pressure(fit, method = c("mean", "variance"))selective_pressure(fit, method = c("mean", "variance"))
fit |
An |
method |
Aggregation method: |
The approach follows the framework of Figgins and Bedford (2025), where selective pressure is defined as the variance-weighted mean growth rate across circulating lineages:
This represents the expected rate at which the population-average fitness is increasing, measured entirely from sequence data.
When method = "mean", the metric is positive when fitter-
than-average lineages are increasing in frequency, indicating
population-level adaptation. A sustained positive value precedes
epidemic growth because it means the effective reproduction number
of the average circulating virus is increasing.
When method = "variance", the metric captures the
heterogeneity of fitness across co-circulating lineages. High
variance indicates strong directional selection; low variance
indicates near-neutral drift.
This metric requires only genomic surveillance data. It does not require case counts, hospitalisations, or wastewater data, making it applicable in settings where epidemiological reporting is incomplete or delayed.
A tibble with columns:
Date.
Selective pressure value.
Lineage with highest frequency.
Frequency of dominant lineage.
Figgins MD, Bedford T (2025). Jointly modeling variant frequencies and case counts to estimate relative variant severity. medRxiv. doi:10.1101/2024.12.02.24318334
sim <- simulate_dynamics(n_lineages = 4, advantages = c("A" = 1.4, "B" = 1.1, "C" = 0.8), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") sp <- selective_pressure(fit) spsim <- simulate_dynamics(n_lineages = 4, advantages = c("A" = 1.4, "B" = 1.1, "C" = 0.8), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") sp <- selective_pressure(fit) sp
Estimates the minimum number of sequences needed to detect a lineage at a given frequency with specified precision.
sequencing_power(target_precision = 0.05, current_freq = 0.02, ci_level = 0.95)sequencing_power(target_precision = 0.05, current_freq = 0.02, ci_level = 0.95)
target_precision |
Desired half-width of the frequency confidence interval. Default 0.05 (plus/minus 5 percentage points). |
current_freq |
True or assumed frequency of the target lineage. Can be a vector for multiple scenarios. Default 0.02 (2%). |
ci_level |
Confidence level. Default 0.95. |
Uses the normal approximation to the binomial:
where z is the critical value, p is frequency, E is precision.
A tibble with columns: current_freq,
target_precision, required_n, ci_level.
# How many sequences to estimate a 2% lineage within +/-5%? sequencing_power() # Multiple scenarios sequencing_power(current_freq = c(0.01, 0.02, 0.05, 0.10))# How many sequences to estimate a 2% lineage within +/-5%? sequencing_power() # Multiple scenarios sequencing_power(current_freq = c(0.01, 0.02, 0.05, 0.10))
Generates synthetic lineage frequency data under a multinomial sampling model with configurable growth advantages. Useful for model validation, power analysis, and teaching.
simulate_dynamics( n_lineages = 4L, n_timepoints = 20L, total_per_tp = 500L, advantages = NULL, reference_name = "ref", start_date = Sys.Date(), interval = 7L, overdispersion = NULL, seed = NULL )simulate_dynamics( n_lineages = 4L, n_timepoints = 20L, total_per_tp = 500L, advantages = NULL, reference_name = "ref", start_date = Sys.Date(), interval = 7L, overdispersion = NULL, seed = NULL )
n_lineages |
Number of lineages including reference. Default 4. |
n_timepoints |
Number of time points. Default 20. |
total_per_tp |
Sequences per time point. A single integer
(constant across time) or a vector of length |
advantages |
Named numeric vector of per-week multiplicative
growth advantages for non-reference lineages. Length must equal
|
reference_name |
Name of the reference lineage. Default |
start_date |
Start date for the time series. Default today. |
interval |
Days between consecutive time points. Default 7 (weekly data). |
overdispersion |
If |
seed |
Random seed for reproducibility. |
An lfq_data object with an additional true_freq column
containing the true (pre-sampling) frequencies.
# JN.1 grows 1.3x per week, KP.3 declines at 0.9x sim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) sim# JN.1 grows 1.3x per week, KP.3 declines at 0.9x sim <- simulate_dynamics( n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42 ) sim
Tests whether each lineage's frequency is significantly increasing over time using a binomial GLM. Useful for early warning of lineages that may warrant enhanced surveillance.
summarize_emerging(data, threshold = 0.01, min_obs = 3L, p_adjust = "holm")summarize_emerging(data, threshold = 0.01, min_obs = 3L, p_adjust = "holm")
data |
An lfq_data object. |
threshold |
Minimum current frequency to test. Default 0.01. |
min_obs |
Minimum time points observed. Default 3. |
p_adjust |
P-value adjustment method. Default |
A tibble with columns: lineage, first_seen, last_seen,
n_timepoints, current_freq, growth_rate, p_value,
p_adjusted, significant, direction.
sim <- simulate_dynamics( n_lineages = 4, advantages = c(emerging = 1.5, stable = 1.0, declining = 0.7), n_timepoints = 12, seed = 42) summarize_emerging(sim)sim <- simulate_dynamics( n_lineages = 4, advantages = c(emerging = 1.5, stable = 1.0, declining = 0.7), n_timepoints = 12, seed = 42) summarize_emerging(sim)
Summarise a lineage frequency model
## S3 method for class 'lfq_fit' summary(object, ...)## S3 method for class 'lfq_fit' summary(object, ...)
object |
An |
... |
Ignored. |
Invisibly returns object.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42) fit <- fit_model(sim, engine = "mlr") summary(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("JN.1" = 1.3, "KP.3" = 0.9), n_timepoints = 15, seed = 42) fit <- fit_model(sim, engine = "mlr") summary(fit)
Produces a multi-panel display combining calibration diagnostics, detection power, estimation quality, and current variant landscape into a single figure suitable for weekly surveillance reports. Designed for programme managers rather than statisticians.
surveillance_dashboard(fit, data, bt = NULL, target_prevalence = 0.01)surveillance_dashboard(fit, data, bt = NULL, target_prevalence = 0.01)
fit |
An |
data |
The |
bt |
Optional |
target_prevalence |
Prevalence for detection power calculation. Default 0.01 (1 percent). |
The dashboard contains up to four panels: (1) current frequency landscape, (2) growth advantage forest plot, (3) detection power curve, and (4) calibration reliability diagram (if backtest data are provided).
A list of ggplot objects with class
surveillance_dashboard. A print method renders all
panels.
surveillance_value for EVOI analysis,
alert_threshold for sequential detection.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") panels <- surveillance_dashboard(fit, sim)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") panels <- surveillance_dashboard(fit, sim)
Quantifies the marginal decision value of sequencing additional samples from each region or stratum, based on the current posterior uncertainty of variant frequency estimates. The EVOI captures how much the expected estimation error decreases per additional sequence, enabling cost-effective resource allocation.
surveillance_value( fit, n_current = 100L, n_additional = seq(10L, 200L, by = 10L), target_lineage = NULL )surveillance_value( fit, n_current = 100L, n_additional = seq(10L, 200L, by = 10L), target_lineage = NULL )
fit |
An |
n_current |
Integer vector of current sample sizes per stratum. If a single integer, assumed equal across all lineages. |
n_additional |
Integer vector of candidate additional sample
sizes to evaluate. Default |
target_lineage |
Optional character; compute EVOI
specifically for this lineage. Default |
The EVOI is computed as the expected reduction in mean squared
estimation error for lineage frequency when additional
sequences are observed. Under the multinomial likelihood with
Gaussian approximation to the posterior, the variance of the
frequency estimate scales as , and additional
samples reduce this by the factor .
The marginal EVOI (the value of one additional sequence) is the derivative of the EVOI curve. It decreases monotonically with sample size, exhibiting the diminishing returns characteristic of information-theoretic quantities.
An evoi S3 class with components:
Tibble with n_additional, evoi,
marginal_evoi columns.
Numeric; current estimation variance (sum of growth rate SEs squared).
Character or NULL.
adaptive_design for dynamic allocation,
sequencing_power for static sample size planning.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") ev <- surveillance_value(fit, n_current = 500) evsim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.3, "B" = 0.9), n_timepoints = 15, seed = 1) fit <- fit_model(sim, engine = "mlr") ev <- surveillance_value(fit, n_current = 500) ev
Tidy a fitness decomposition
tidy.fitness_decomposition(x, ...)tidy.fitness_decomposition(x, ...)
x |
A |
... |
Unused. |
A tibble with the decomposition results.
Converts model results to a tidy tibble, compatible with the broom package ecosystem.
tidy.lfq_fit(x, conf.int = TRUE, conf.level = 0.95, ...)tidy.lfq_fit(x, conf.int = TRUE, conf.level = 0.95, ...)
x |
An |
conf.int |
Include confidence intervals? Default |
conf.level |
Confidence level. Default 0.95. |
... |
Ignored. |
A tibble with columns: lineage, term, estimate,
std.error, conf.low, conf.high.
sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) tidy.lfq_fit(fit)sim <- simulate_dynamics(n_lineages = 3, advantages = c("A" = 1.2, "B" = 0.8), seed = 1) fit <- fit_model(sim) tidy.lfq_fit(fit)
Remove a registered engine
unregister_engine(name)unregister_engine(name)
name |
Engine name to remove. |
Invisibly returns the updated registry.