| Title: | Ensemble Learning Framework for Diagnostic and Prognostic Modeling |
|---|---|
| Description: | Provides a framework to build and evaluate diagnosis or prognosis models using stacking, voting, and bagging ensemble techniques with various base learners. The package also includes tools for visualization and interpretation of models. The development version of the package is available on 'GitHub' at <https://github.com/xiaojie0519/E2E>. The methods are based on the foundational work of Breiman (1996) <doi:10.1007/BF00058655> on bagging and Wolpert (1992) <doi:10.1016/S0893-6080(05)80023-1> on stacking. |
| Authors: | Shanjie Luan [aut, cre] |
| Maintainer: | Shanjie Luan <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-18 09:58:54 UTC |
| Source: | https://github.com/xiaojie0519/e2e |
Applies a trained diagnostic model (single or ensemble) to a new dataset to generate predictions. It can handle various model objects created by the package, including single caret models, Bagging, Stacking, Voting, and EasyEnsemble objects.
apply_dia( trained_model_object, new_data, label_col_name = NULL, pos_class = "Positive", neg_class = "Negative" )apply_dia( trained_model_object, new_data, label_col_name = NULL, pos_class = "Positive", neg_class = "Negative" )
trained_model_object |
A trained model object from |
new_data |
A data frame containing the new samples for prediction. The first column must be the sample ID. |
label_col_name |
An optional character string specifying the name of the
column in |
pos_class |
A character string for the positive class label used in the
model's probability predictions. Defaults to |
neg_class |
A character string for the negative class label. This parameter
is mainly for consistency, as prediction focuses on |
A data frame with three columns: sample (the sample IDs), label
(the true labels from new_data, or NA if not available/specified), and score
(the predicted probability for the positive class).
# Assuming `bagging_results` and `test_dia` are available from previous steps # bagging_model <- bagging_results$model_object # Example 1: Default behavior - use the second column of test_dia as label # predictions <- apply_dia( # trained_model_object = bagging_model, # new_data = test_dia # ) # Example 2: Explicitly specify the label column by name # predictions_explicit <- apply_dia( # trained_model_object = bagging_model, # new_data = test_dia, # label_col_name = "outcome" # ) # Example 3: Predict on data without labels # test_data_no_labels <- test_dia[, -2] # Remove outcome column # predictions_no_label <- apply_dia( # trained_model_object = bagging_model, # new_data = test_data_no_labels, # label_col_name = NA # Explicitly disable label extraction # )# Assuming `bagging_results` and `test_dia` are available from previous steps # bagging_model <- bagging_results$model_object # Example 1: Default behavior - use the second column of test_dia as label # predictions <- apply_dia( # trained_model_object = bagging_model, # new_data = test_dia # ) # Example 2: Explicitly specify the label column by name # predictions_explicit <- apply_dia( # trained_model_object = bagging_model, # new_data = test_dia, # label_col_name = "outcome" # ) # Example 3: Predict on data without labels # test_data_no_labels <- test_dia[, -2] # Remove outcome column # predictions_no_label <- apply_dia( # trained_model_object = bagging_model, # new_data = test_data_no_labels, # label_col_name = NA # Explicitly disable label extraction # )
Generates risk scores for new patients using a trained model.
apply_pro(trained_model_object, new_data, time_unit = "day")apply_pro(trained_model_object, new_data, time_unit = "day")
trained_model_object |
A trained object (class |
new_data |
Data frame of new patients. |
time_unit |
Time unit for data preparation. |
Data frame with IDs, outcomes (if available), and risk scores.
Implements a Bagging (Bootstrap Aggregating) ensemble for diagnostic models. It trains multiple base models on bootstrapped samples of the training data and aggregates their predictions by averaging probabilities.
bagging_dia( data, base_model_name, n_estimators = 50, subset_fraction = 0.632, tune_base_model = FALSE, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative", seed = 456 )bagging_dia( data, base_model_name, n_estimators = 50, subset_fraction = 0.632, tune_base_model = FALSE, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative", seed = 456 )
data |
A data frame where the first column is the sample ID, the second is the outcome label, and subsequent columns are features. |
base_model_name |
A character string, the name of the base diagnostic model to use (e.g., "rf", "lasso"). This model must be registered. |
n_estimators |
An integer, the number of base models to train. |
subset_fraction |
A numeric value between 0 and 1, the fraction of samples to bootstrap for each base model. |
tune_base_model |
Logical, whether to enable tuning for each base model. |
threshold_choices |
A character string (e.g., "f1", "youden", "default") or a numeric value (0-1) for determining the evaluation threshold for the ensemble. |
positive_label_value |
A numeric or character value in the raw data representing the positive class. |
negative_label_value |
A numeric or character value in the raw data representing the negative class. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
seed |
An integer, for reproducibility. |
A list containing the model_object, sample_score, and evaluation_metrics.
initialize_modeling_system_dia, evaluate_model_dia
# This example assumes your package includes a dataset named 'train_dia'. # If not, create a toy data frame first. if (exists("train_dia")) { initialize_modeling_system_dia() bagging_rf_results <- bagging_dia( data = train_dia, base_model_name = "rf", n_estimators = 5, # Reduced for a quick example threshold_choices = "youden", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control" ) print_model_summary_dia("Bagging (RF)", bagging_rf_results) }# This example assumes your package includes a dataset named 'train_dia'. # If not, create a toy data frame first. if (exists("train_dia")) { initialize_modeling_system_dia() bagging_rf_results <- bagging_dia( data = train_dia, base_model_name = "rf", n_estimators = 5, # Reduced for a quick example threshold_choices = "youden", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control" ) print_model_summary_dia("Bagging (RF)", bagging_rf_results) }
Implements Bootstrap Aggregating (Bagging) for survival models. It trains multiple base models on bootstrapped subsets and averages the risk scores. This method reduces variance and improves stability.
bagging_pro( data, base_model_name, n_estimators = 10, subset_fraction = 0.632, tune_base_model = FALSE, time_unit = "day", years_to_evaluate = c(1, 3, 5), seed = 456 )bagging_pro( data, base_model_name, n_estimators = 10, subset_fraction = 0.632, tune_base_model = FALSE, time_unit = "day", years_to_evaluate = c(1, 3, 5), seed = 456 )
data |
Input data frame (ID, Status, Time, Features). |
base_model_name |
Character string name of the base model (e.g., "rsf_pro"). |
n_estimators |
Integer. Number of bootstrap iterations. |
subset_fraction |
Numeric (0-1). Fraction of data to sample in each iteration. |
tune_base_model |
Logical. Whether to tune each base model (computationally expensive). |
time_unit |
Time unit of the input data. |
years_to_evaluate |
Numeric vector of years for time-dependent AUC evaluation. |
seed |
Integer seed for reproducibility. |
A list containing the ensemble object, sample scores, and evaluation metrics.
Calculates various classification performance metrics (Accuracy, Precision, Recall, F1-score, Specificity, True Positives, etc.) for binary classification at a given probability threshold.
calculate_metrics_at_threshold_dia( prob_positive, y_true, threshold, pos_class, neg_class )calculate_metrics_at_threshold_dia( prob_positive, y_true, threshold, pos_class, neg_class )
prob_positive |
A numeric vector of predicted probabilities for the positive class. |
y_true |
A factor vector of true class labels. |
threshold |
A numeric value between 0 and 1, the probability threshold above which a prediction is considered positive. |
pos_class |
A character string, the label for the positive class. |
neg_class |
A character string, the label for the negative class. |
A list containing:
Threshold: The threshold used.
Accuracy: Overall prediction accuracy.
Precision: Precision for the positive class.
Recall: Recall (Sensitivity) for the positive class.
F1: F1-score for the positive class.
Specificity: Specificity for the negative class.
TP, TN, FP, FN, N: Counts of True Positives, True Negatives,
False Positives, False Negatives, and total samples.
y_true_ex <- factor(c("Negative", "Positive", "Positive", "Negative", "Positive"), levels = c("Negative", "Positive")) prob_ex <- c(0.1, 0.8, 0.6, 0.3, 0.9) metrics <- calculate_metrics_at_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, threshold = 0.5, pos_class = "Positive", neg_class = "Negative" ) print(metrics)y_true_ex <- factor(c("Negative", "Positive", "Positive", "Negative", "Positive"), levels = c("Negative", "Positive")) prob_ex <- c(0.1, 0.8, 0.6, 0.3, 0.9) metrics <- calculate_metrics_at_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, threshold = 0.5, pos_class = "Positive", neg_class = "Negative" ) print(metrics)
Trains a single Decision Tree model using caret::train (via rpart method)
for binary classification.
dt_dia(X, y, tune = FALSE, cv_folds = 5)dt_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning for |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Decision Tree model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model dt_model <- dt_dia(X_toy, y_toy) print(dt_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model dt_model <- dt_dia(X_toy, y_toy) print(dt_model)
Trains an Elastic Net-regularized logistic regression model
using caret::train (via glmnet method) for binary classification.
en_dia(X, y, tune = FALSE, cv_folds = 5)en_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning for |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Elastic Net model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model en_model <- en_dia(X_toy, y_toy) print(en_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model en_model <- en_dia(X_toy, y_toy) print(en_model)
Fits a Cox model with Elastic Net regularization (mixture of L1 and L2 penalties). Alpha is fixed at 0.5.
en_pro(X, y_surv, tune = FALSE)en_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_glmnet and pro_model.
Evaluates the performance of a trained diagnostic model using various metrics relevant to binary classification, including AUROC, AUPRC, and metrics at an optimal or specified probability threshold.
evaluate_model_dia( model_obj = NULL, X_data = NULL, y_data, sample_ids, threshold_choices = "default", pos_class, neg_class, precomputed_prob = NULL, y_original_numeric = NULL )evaluate_model_dia( model_obj = NULL, X_data = NULL, y_data, sample_ids, threshold_choices = "default", pos_class, neg_class, precomputed_prob = NULL, y_original_numeric = NULL )
model_obj |
A trained model object (typically a |
X_data |
A data frame of features corresponding to the data used for evaluation.
Required if |
y_data |
A factor vector of true class labels for the evaluation data. |
sample_ids |
A vector of sample IDs for the evaluation data. |
threshold_choices |
A character string specifying the thresholding strategy ("default", "f1", "youden") or a numeric probability threshold value (0-1). |
pos_class |
A character string, the label for the positive class. |
neg_class |
A character string, the label for the negative class. |
precomputed_prob |
Optional. A numeric vector of precomputed probabilities
for the positive class. If provided, |
y_original_numeric |
Optional. The original numeric/character vector of labels.
If not provided, it's inferred from |
A list containing:
sample_score: A data frame with sample (ID), label (original numeric),
and score (predicted probability for positive class).
evaluation_metrics: A list of performance metrics:
Threshold_Strategy: The strategy used for threshold selection.
_Threshold: The chosen probability threshold.
Accuracy, Precision, Recall, F1, Specificity: Metrics
calculated at _Threshold.
AUROC: Area Under the Receiver Operating Characteristic curve.
AUROC_95CI_Lower, AUROC_95CI_Upper: 95% confidence interval for AUROC.
AUPRC: Area Under the Precision-Recall curve.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) ids_toy <- paste0("Sample", 1:n_obs) # 2. Train a model rf_model <- rf_dia(X_toy, y_toy) # 3. Evaluate the model using F1-score optimal threshold eval_results <- evaluate_model_dia( model_obj = rf_model, X_data = X_toy, y_data = y_toy, sample_ids = ids_toy, threshold_choices = "f1", pos_class = "Case", neg_class = "Control" ) str(eval_results)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) ids_toy <- paste0("Sample", 1:n_obs) # 2. Train a model rf_model <- rf_dia(X_toy, y_toy) # 3. Evaluate the model using F1-score optimal threshold eval_results <- evaluate_model_dia( model_obj = rf_model, X_data = X_toy, y_data = y_toy, sample_ids = ids_toy, threshold_choices = "f1", pos_class = "Case", neg_class = "Control" ) str(eval_results)
Comprehensive evaluation of survival models using:
Harrell's Concordance Index (C-index).
Time-dependent Area Under the ROC Curve (AUROC) at specified years.
Kaplan-Meier analysis comparing high vs. low risk groups (based on median split).
evaluate_model_pro( trained_model_obj = NULL, X_data = NULL, Y_surv_obj, sample_ids, years_to_evaluate = c(1, 3, 5), precomputed_score = NULL, meta_normalize_params = NULL )evaluate_model_pro( trained_model_obj = NULL, X_data = NULL, Y_surv_obj, sample_ids, years_to_evaluate = c(1, 3, 5), precomputed_score = NULL, meta_normalize_params = NULL )
trained_model_obj |
A trained model object (optional if precomputed_score provided). |
X_data |
Features for prediction (optional if precomputed_score provided). |
Y_surv_obj |
True survival object. |
sample_ids |
Vector of IDs. |
years_to_evaluate |
Numeric vector of years for time-dependent AUC. |
precomputed_score |
Numeric vector of pre-calculated risk scores. |
meta_normalize_params |
Internal use. |
A list containing a dataframe of scores and a list of evaluation metrics.
Evaluates model performance from a data frame of predictions,
calculating metrics like AUROC, AUPRC, F1 score, etc. This function is designed
for use with prediction results, such as the output from apply_dia.
evaluate_predictions_dia( prediction_df, threshold_choices = "default", pos_class = "Positive", neg_class = "Negative" )evaluate_predictions_dia( prediction_df, threshold_choices = "default", pos_class = "Positive", neg_class = "Negative" )
prediction_df |
A data frame containing predictions. Must contain
the columns |
threshold_choices |
A character string specifying the thresholding strategy ("default", "f1", "youden") or a numeric probability threshold value (0-1). |
pos_class |
A character string for the positive class label used in reporting.
Defaults to |
neg_class |
A character string for the negative class label used in reporting.
Defaults to |
This function strictly requires the label column in prediction_df to adhere
to the following format:
1: Represents the positive class.
0: Represents the negative class.
NA: Will be ignored during calculation.
The function will stop with an error if any other values are found in the label column.
A named list containing all calculated performance metrics.
# # Create a sample prediction data frame # predictions_df <- data.frame( # sample = 1:10, # label = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0), # score = c(0.9, 0.2, 0.8, 0.6, 0.3, 0.4, 0.95, 0.1, 0.7, 0.5) # ) # # # Evaluate the predictions using the 'f1' threshold strategy # evaluation_results <- evaluate_predictions_dia( # prediction_df = predictions_df, # threshold_choices = "f1" # ) # # print(evaluation_results)# # Create a sample prediction data frame # predictions_df <- data.frame( # sample = 1:10, # label = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0), # score = c(0.9, 0.2, 0.8, 0.6, 0.3, 0.4, 0.95, 0.1, 0.7, 0.5) # ) # # # Evaluate the predictions using the 'f1' threshold strategy # evaluation_results <- evaluate_predictions_dia( # prediction_df = predictions_df, # threshold_choices = "f1" # ) # # print(evaluation_results)
Calculates performance metrics for external prediction sets.
evaluate_predictions_pro(prediction_df, years_to_evaluate = c(1, 3, 5))evaluate_predictions_pro(prediction_df, years_to_evaluate = c(1, 3, 5))
prediction_df |
Data frame with columns |
years_to_evaluate |
Years for AUC. |
List of evaluation metrics.
Generates and returns a ggplot object for Receiver Operating Characteristic (ROC) curves, Precision-Recall (PRC) curves, or confusion matrices.
figure_dia(type, data, file = NULL)figure_dia(type, data, file = NULL)
type |
String, specifies the type of plot to generate. Options are "roc", "prc", or "matrix". |
data |
A list object containing model evaluation results. It must include:
|
file |
Optional. A string specifying the path to save the plot (e.g.,
"plot.png"). If |
A ggplot object. If the file argument is provided, the plot is also
saved to the specified path.
# Create example data for a diagnostic model external_eval_example_dia <- list( sample_score = data.frame( ID = paste0("S", 1:100), label = sample(c(0, 1), 100, replace = TRUE), score = runif(100, 0, 1) ), evaluation_metrics = list( Final_Threshold = 0.53 ) ) # Generate an ROC curve plot object roc_plot <- figure_dia(type = "roc", data = external_eval_example_dia) # To display the plot, simply run: # print(roc_plot) # Generate a PRC curve and save it to a temporary file # tempfile() creates a safe, temporary path as required by CRAN temp_prc_path <- tempfile(fileext = ".png") figure_dia(type = "prc", data = external_eval_example_dia, file = temp_prc_path) # Generate a Confusion Matrix plot matrix_plot <- figure_dia(type = "matrix", data = external_eval_example_dia)# Create example data for a diagnostic model external_eval_example_dia <- list( sample_score = data.frame( ID = paste0("S", 1:100), label = sample(c(0, 1), 100, replace = TRUE), score = runif(100, 0, 1) ), evaluation_metrics = list( Final_Threshold = 0.53 ) ) # Generate an ROC curve plot object roc_plot <- figure_dia(type = "roc", data = external_eval_example_dia) # To display the plot, simply run: # print(roc_plot) # Generate a PRC curve and save it to a temporary file # tempfile() creates a safe, temporary path as required by CRAN temp_prc_path <- tempfile(fileext = ".png") figure_dia(type = "prc", data = external_eval_example_dia, file = temp_prc_path) # Generate a Confusion Matrix plot matrix_plot <- figure_dia(type = "matrix", data = external_eval_example_dia)
Generates and returns a ggplot object for Kaplan-Meier (KM) survival curves or time-dependent ROC curves.
figure_pro(type, data, file = NULL, time_unit = "days")figure_pro(type, data, file = NULL, time_unit = "days")
type |
"km" or "tdroc" |
data |
list with:
|
file |
optional path to save |
time_unit |
"days" (default), "months", or "years" for df$time |
ggplot object
Creates SHAP (SHapley Additive exPlanations) plots to explain feature contributions by training a surrogate model on the original model's scores.
figure_shap(data, raw_data, target_type, file = NULL, model_type = "xgboost")figure_shap(data, raw_data, target_type, file = NULL, model_type = "xgboost")
data |
A list containing |
raw_data |
A data frame with original features. The first column must be the sample ID. |
target_type |
String, the analysis type: "diagnosis" or "prognosis".
This determines which columns in |
file |
Optional. A string specifying the path to save the plot. If |
model_type |
String, the surrogate model for SHAP calculation. "xgboost" (default) or "lasso". |
A patchwork object combining SHAP summary and importance plots. If file is
provided, the plot is also saved.
# --- Example for a Diagnosis Model --- set.seed(123) train_dia_data <- data.frame( SampleID = paste0("S", 1:100), Label = sample(c(0, 1), 100, replace = TRUE), FeatureA = rnorm(100, 10, 2), FeatureB = runif(100, 0, 5) ) model_results <- list( sample_score = data.frame(ID = paste0("S", 1:100), score = runif(100, 0, 1)) ) # Generate SHAP plot object shap_plot <- figure_shap( data = model_results, raw_data = train_dia_data, target_type = "diagnosis", model_type = "xgboost" ) # To display the plot: # print(shap_plot)# --- Example for a Diagnosis Model --- set.seed(123) train_dia_data <- data.frame( SampleID = paste0("S", 1:100), Label = sample(c(0, 1), 100, replace = TRUE), FeatureA = rnorm(100, 10, 2), FeatureB = runif(100, 0, 5) ) model_results <- list( sample_score = data.frame(ID = paste0("S", 1:100), score = runif(100, 0, 1)) ) # Generate SHAP plot object shap_plot <- figure_shap( data = model_results, raw_data = train_dia_data, target_type = "diagnosis", model_type = "xgboost" ) # To display the plot: # print(shap_plot)
Determines an optimal probability threshold for binary classification based on maximizing F1-score or Youden's J statistic.
find_optimal_threshold_dia( prob_positive, y_true, type = c("f1", "youden"), pos_class, neg_class )find_optimal_threshold_dia( prob_positive, y_true, type = c("f1", "youden"), pos_class, neg_class )
prob_positive |
A numeric vector of predicted probabilities for the positive class. |
y_true |
A factor vector of true class labels. |
type |
A character string, specifying the optimization criterion: "f1" for F1-score or "youden" for Youden's J statistic (Sensitivity + Specificity - 1). |
pos_class |
A character string, the label for the positive class. |
neg_class |
A character string, the label for the negative class. |
A numeric value, the optimal probability threshold.
y_true_ex <- factor(c("Negative", "Positive", "Positive", "Negative", "Positive"), levels = c("Negative", "Positive")) prob_ex <- c(0.1, 0.8, 0.6, 0.3, 0.9) # Find threshold maximizing F1-score opt_f1_threshold <- find_optimal_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, type = "f1", pos_class = "Positive", neg_class = "Negative" ) print(opt_f1_threshold) # Find threshold maximizing Youden's J opt_youden_threshold <- find_optimal_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, type = "youden", pos_class = "Positive", neg_class = "Negative" ) print(opt_youden_threshold)y_true_ex <- factor(c("Negative", "Positive", "Positive", "Negative", "Positive"), levels = c("Negative", "Positive")) prob_ex <- c(0.1, 0.8, 0.6, 0.3, 0.9) # Find threshold maximizing F1-score opt_f1_threshold <- find_optimal_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, type = "f1", pos_class = "Positive", neg_class = "Negative" ) print(opt_f1_threshold) # Find threshold maximizing Youden's J opt_youden_threshold <- find_optimal_threshold_dia( prob_positive = prob_ex, y_true = y_true_ex, type = "youden", pos_class = "Positive", neg_class = "Negative" ) print(opt_youden_threshold)
Trains a Gradient Boosting Machine (GBM) model using caret::train
for binary classification.
gbm_dia(X, y, tune = FALSE, cv_folds = 5, tune_length = 10)gbm_dia(X, y, tune = FALSE, cv_folds = 5, tune_length = 10)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning for |
cv_folds |
An integer, the number of cross-validation folds for |
tune_length |
An integer, the number of random parameter combinations to try when tune=TRUE. Only used when search="random". Default is 20. |
A caret::train object representing the trained GBM model.
set.seed(42) n_obs <- 200 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model with default parameters gbm_model <- gbm_dia(X_toy, y_toy) print(gbm_model) # Train with extensive tuning (random search) gbm_model_tuned <- gbm_dia(X_toy, y_toy, tune = TRUE, tune_length = 30) print(gbm_model_tuned)set.seed(42) n_obs <- 200 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model with default parameters gbm_model <- gbm_dia(X_toy, y_toy) print(gbm_model) # Train with extensive tuning (random search) gbm_model_tuned <- gbm_dia(X_toy, y_toy, tune = TRUE, tune_length = 30) print(gbm_model_tuned)
Fits a stochastic gradient boosting model using the Cox Partial Likelihood distribution. Supports random search for hyperparameter optimization.
gbm_pro(X, y_surv, tune = FALSE, cv.folds = 5, max_tune_iter = 10)gbm_pro(X, y_surv, tune = FALSE, cv.folds = 5, max_tune_iter = 10)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs random search. |
cv.folds |
Integer. Number of cross-validation folds. |
max_tune_iter |
Integer. Maximum iterations for random search. |
An object of class survival_gbm and pro_model.
Retrieves a list of all diagnostic model functions currently registered in the internal environment.
get_registered_models_dia()get_registered_models_dia()
A named list where names are the registered model names and values are the corresponding model functions.
register_model_dia, initialize_modeling_system_dia
# Ensure system is initialized to see the default models initialize_modeling_system_dia() models <- get_registered_models_dia() # See available model names print(names(models))# Ensure system is initialized to see the default models initialize_modeling_system_dia() models <- get_registered_models_dia() # See available model names print(names(models))
Retrieves the list of available models.
get_registered_models_pro()get_registered_models_pro()
Named list of functions.
Implements the EasyEnsemble algorithm. It trains multiple base models on balanced subsets of the data (by undersampling the majority class) and aggregates their predictions.
imbalance_dia( data, base_model_name = "rf", n_estimators = 10, tune_base_model = FALSE, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative", seed = 456 )imbalance_dia( data, base_model_name = "rf", n_estimators = 10, tune_base_model = FALSE, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative", seed = 456 )
data |
A data frame where the first column is the sample ID, the second is the outcome label, and subsequent columns are features. |
base_model_name |
A character string, the name of the base diagnostic model to use (e.g., "xb", "rf"). This model must be registered. |
n_estimators |
An integer, the number of base models to train (number of subsets). |
tune_base_model |
Logical, whether to enable tuning for each base model. |
threshold_choices |
A character string (e.g., "f1", "youden", "default") or a numeric value (0-1) for determining the evaluation threshold for the ensemble. |
positive_label_value |
A numeric or character value in the raw data representing the positive class. |
negative_label_value |
A numeric or character value in the raw data representing the negative class. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
seed |
An integer, for reproducibility. |
A list containing the model_object, sample_score, and evaluation_metrics.
initialize_modeling_system_dia, evaluate_model_dia
# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create an imbalanced toy dataset set.seed(42) n_obs <- 100 n_minority <- 10 data_imbalanced_toy <- data.frame( ID = paste0("Sample", 1:n_obs), Status = c(rep(1, n_minority), rep(0, n_obs - n_minority)), Feat1 = rnorm(n_obs), Feat2 = runif(n_obs) ) # 3. Run the EasyEnsemble algorithm # n_estimators is reduced for a quick example easyensemble_results <- imbalance_dia( data = data_imbalanced_toy, base_model_name = "rf", n_estimators = 3, threshold_choices = "f1" ) print_model_summary_dia("EasyEnsemble (RF)", easyensemble_results)# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create an imbalanced toy dataset set.seed(42) n_obs <- 100 n_minority <- 10 data_imbalanced_toy <- data.frame( ID = paste0("Sample", 1:n_obs), Status = c(rep(1, n_minority), rep(0, n_obs - n_minority)), Feat1 = rnorm(n_obs), Feat2 = runif(n_obs) ) # 3. Run the EasyEnsemble algorithm # n_estimators is reduced for a quick example easyensemble_results <- imbalance_dia( data = data_imbalanced_toy, base_model_name = "rf", n_estimators = 3, threshold_choices = "f1" ) print_model_summary_dia("EasyEnsemble (RF)", easyensemble_results)
Initializes the diagnostic modeling system by loading required
packages and registering default diagnostic models (Random Forest, XGBoost,
SVM, MLP, Lasso, Elastic Net, Ridge, LDA, QDA, Naive Bayes, Decision Tree, GBM).
This function should be called once before using models_dia() or ensemble methods.
initialize_modeling_system_dia()initialize_modeling_system_dia()
Invisible NULL. Initializes the internal model registry.
# Initialize the system (typically run once at the start of a session or script) initialize_modeling_system_dia() # Check if a default model like Random Forest is now registered "rf" %in% names(get_registered_models_dia())# Initialize the system (typically run once at the start of a session or script) initialize_modeling_system_dia() # Check if a default model like Random Forest is now registered "rf" %in% names(get_registered_models_dia())
Initializes the environment and registers default survival models (Lasso, Elastic Net, Ridge, RSF, StepCox, GBM, XGBoost, PLS).
initialize_modeling_system_pro()initialize_modeling_system_pro()
Executes a complete diagnostic modeling workflow including single models, bagging, stacking, and voting ensembles across training and multiple test datasets. Returns structured results with AUROC values for visualization.
int_dia( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )int_dia( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
... |
Data frames for analysis. The first is the training dataset; all subsequent arguments are test datasets. |
model_names |
Character vector specifying which models to use. If NULL (default), uses all registered models. |
tune |
Logical, enable hyperparameter tuning. Default TRUE. |
n_estimators |
Integer, number of bootstrap samples for bagging. Default 10. |
seed |
Integer for reproducibility. Default 123. |
positive_label_value |
Value representing positive class. Default 1. |
negative_label_value |
Value representing negative class. Default 0. |
new_positive_label |
Factor level name for positive class. Default "Positive". |
new_negative_label |
Factor level name for negative class. Default "Negative". |
A list containing all_results, auroc_matrix, model_categories, dataset_names.
Extends int_dia by adding imbalance-specific models (EasyEnsemble).
Produces a comprehensive set of models optimized for imbalanced datasets.
int_imbalance( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )int_imbalance( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
... |
Data frames for analysis. The first is the training dataset; all subsequent arguments are test datasets. |
model_names |
Character vector specifying which models to use. If NULL (default), uses all registered models. |
tune |
Logical, enable hyperparameter tuning. Default TRUE. |
n_estimators |
Integer, number of bootstrap samples for bagging. Default 10. |
seed |
Integer for reproducibility. Default 123. |
positive_label_value |
Value representing positive class. Default 1. |
negative_label_value |
Value representing negative class. Default 0. |
new_positive_label |
Factor level name for positive class. Default "Positive". |
new_negative_label |
Factor level name for negative class. Default "Negative". |
Same structure as int_dia with additional imbalance-handling models.
## Not run: imbalanced_results <- int_imbalance(train_imbalanced, test_imbalanced) ## End(Not run)## Not run: imbalanced_results <- int_imbalance(train_imbalanced, test_imbalanced) ## End(Not run)
Executes a complete prognostic (survival) modeling workflow including single models, bagging, and stacking ensembles. Returns C-index and time-dependent AUROC metrics.
int_pro( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, time_unit = "day", years_to_evaluate = c(1, 3, 5) )int_pro( ..., model_names = NULL, tune = TRUE, n_estimators = 10, seed = 123, time_unit = "day", years_to_evaluate = c(1, 3, 5) )
... |
Data frames for survival analysis. First = training; others = test sets. Format: first column = ID, second = outcome (0/1), third = time, remaining = features. |
model_names |
Character vector specifying which models to use. If NULL (default), uses all registered prognostic models. |
tune |
Logical, enable tuning. Default TRUE. |
n_estimators |
Integer, bagging iterations. Default 10. |
seed |
Integer for reproducibility. Default 123. |
time_unit |
Time unit in data: "day", "month", or "year". Default "day". |
years_to_evaluate |
Numeric vector of years for time-dependent AUROC. Default c(1,3,5). |
A list with:
all_results: All model outputs
cindex_matrix: C-index values (models × datasets)
avg_auroc_matrix: Average time-dependent AUROC (models × datasets)
model_categories: Model category labels
dataset_names: Dataset identifiers
## Not run: prognosis_results <- int_pro(train_pro, test_pro1, test_pro2) ## End(Not run)## Not run: prognosis_results <- int_pro(train_pro, test_pro1, test_pro2) ## End(Not run)
Trains a Lasso-regularized logistic regression model using caret::train
(via glmnet method) for binary classification.
lasso_dia(X, y, tune = FALSE, cv_folds = 5)lasso_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning for |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Lasso model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model lasso_model <- lasso_dia(X_toy, y_toy) print(lasso_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model lasso_model <- lasso_dia(X_toy, y_toy) print(lasso_model)
Fits a Cox proportional hazards model regularized by the Lasso (L1) penalty. Uses cross-validation to select the optimal lambda.
lasso_pro(X, y_surv, tune = FALSE)lasso_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_glmnet and pro_model.
library(survival) # Create dummy data set.seed(123) df <- data.frame(time = rexp(50), status = sample(0:1, 50, replace=TRUE), var1 = rnorm(50), var2 = rnorm(50)) y <- Surv(df$time, df$status) x <- df[, c("var1", "var2")] model <- lasso_pro(x, y) print(class(model))library(survival) # Create dummy data set.seed(123) df <- data.frame(time = rexp(50), status = sample(0:1, 50, replace=TRUE), var1 = rnorm(50), var2 = rnorm(50)) y <- Surv(df$time, df$status) x <- df[, c("var1", "var2")] model <- lasso_pro(x, y) print(class(model))
Trains a Linear Discriminant Analysis (LDA) model using caret::train
for binary classification.
lda_dia(X, y, tune = FALSE, cv_folds = 5)lda_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning (currently ignored for LDA). |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained LDA model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model lda_model <- lda_dia(X_toy, y_toy) print(lda_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model lda_model <- lda_dia(X_toy, y_toy) print(lda_model)
Loads a CSV file containing patient data, extracts features, and converts the label column into a factor suitable for classification models. Handles basic data cleaning like trimming whitespace and type conversion.
load_and_prepare_data_dia( data_path, label_col_name, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )load_and_prepare_data_dia( data_path, label_col_name, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
data_path |
A character string, the file path to the input CSV data. The first column is assumed to be a sample ID. |
label_col_name |
A character string, the name of the column containing the class labels. |
positive_label_value |
A numeric or character value that represents the positive class in the raw data. |
negative_label_value |
A numeric or character value that represents the negative class in the raw data. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
A list containing:
X: A data frame of features (all columns except ID and label).
y: A factor vector of class labels, with levels new_negative_label
and new_positive_label.
sample_ids: A vector of sample IDs (the first column of the input data).
pos_class_label: The character string used for the positive class factor level.
neg_class_label: The character string used for the negative class factor level.
y_original_numeric: The original numeric/character vector of labels.
# Create a dummy CSV file in a temporary directory for demonstration temp_csv_path <- tempfile(fileext = ".csv") dummy_data <- data.frame( ID = paste0("Patient", 1:50), Disease_Status = sample(c(0, 1), 50, replace = TRUE), FeatureA = rnorm(50), FeatureB = runif(50, 0, 100), CategoricalFeature = sample(c("X", "Y", "Z"), 50, replace = TRUE) ) write.csv(dummy_data, temp_csv_path, row.names = FALSE) # Load and prepare data from the temporary file prepared_data <- load_and_prepare_data_dia( data_path = temp_csv_path, label_col_name = "Disease_Status", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control" ) # Check prepared data structure str(prepared_data$X) table(prepared_data$y) # Clean up the dummy file unlink(temp_csv_path)# Create a dummy CSV file in a temporary directory for demonstration temp_csv_path <- tempfile(fileext = ".csv") dummy_data <- data.frame( ID = paste0("Patient", 1:50), Disease_Status = sample(c(0, 1), 50, replace = TRUE), FeatureA = rnorm(50), FeatureB = runif(50, 0, 100), CategoricalFeature = sample(c("X", "Y", "Z"), 50, replace = TRUE) ) write.csv(dummy_data, temp_csv_path, row.names = FALSE) # Load and prepare data from the temporary file prepared_data <- load_and_prepare_data_dia( data_path = temp_csv_path, label_col_name = "Disease_Status", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control" ) # Check prepared data structure str(prepared_data$X) table(prepared_data$y) # Clean up the dummy file unlink(temp_csv_path)
Performs linear transformation of data to the range 0 to 1. Essential for stacking ensembles to normalize risk scores from heterogeneous base learners.
min_max_normalize(x, min_val = NULL, max_val = NULL)min_max_normalize(x, min_val = NULL, max_val = NULL)
x |
A numeric vector. |
min_val |
Optional reference minimum value (e.g., from training set). |
max_val |
Optional reference maximum value (e.g., from training set). |
A numeric vector of normalized values.
Trains a Multi-Layer Perceptron (MLP) neural network model
using caret::train for binary classification.
mlp_dia(X, y, tune = FALSE, cv_folds = 5)mlp_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning using |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained MLP model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model mlp_model <- mlp_dia(X_toy, y_toy) print(mlp_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model mlp_model <- mlp_dia(X_toy, y_toy) print(mlp_model)
Trains and evaluates one or more registered diagnostic models on a given dataset.
models_dia( data, model = "all_dia", tune = FALSE, seed = 123, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )models_dia( data, model = "all_dia", tune = FALSE, seed = 123, threshold_choices = "default", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
data |
A data frame where the first column is the sample ID, the second is the outcome label, and subsequent columns are features. |
model |
A character string or vector of character strings, specifying which models to run. Use "all_dia" to run all registered models. |
tune |
Logical, whether to enable hyperparameter tuning for individual models. |
seed |
An integer, for reproducibility of random processes. |
threshold_choices |
A character string (e.g., "f1", "youden", "default") or a numeric value (0-1), or a named list/vector allowing different threshold strategies/values for each model. |
positive_label_value |
A numeric or character value in the raw data representing the positive class. |
negative_label_value |
A numeric or character value in the raw data representing the negative class. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
A named list, where each element corresponds to a run model and
contains its trained model_object, sample_score data frame, and
evaluation_metrics.
initialize_modeling_system_dia, evaluate_model_dia
# This example assumes your package includes a dataset named 'train_dia'. # If not, you should create a toy data frame similar to the one below. # # train_dia <- data.frame( # ID = paste0("Patient", 1:100), # Disease_Status = sample(c(0, 1), 100, replace = TRUE), # FeatureA = rnorm(100), # FeatureB = runif(100) # ) # Ensure the 'train_dia' dataset is available in the environment # For example, if it is exported by your package: # data(train_dia) # Check if 'train_dia' exists, otherwise skip the example if (exists("train_dia")) { # 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Run selected models results <- models_dia( data = train_dia, model = c("rf", "lasso"), # Run only Random Forest and Lasso threshold_choices = list(rf = "f1", lasso = 0.6), # Different thresholds positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control", seed = 42 ) # 3. Print summaries for (model_name in names(results)) { print_model_summary_dia(model_name, results[[model_name]]) } }# This example assumes your package includes a dataset named 'train_dia'. # If not, you should create a toy data frame similar to the one below. # # train_dia <- data.frame( # ID = paste0("Patient", 1:100), # Disease_Status = sample(c(0, 1), 100, replace = TRUE), # FeatureA = rnorm(100), # FeatureB = runif(100) # ) # Ensure the 'train_dia' dataset is available in the environment # For example, if it is exported by your package: # data(train_dia) # Check if 'train_dia' exists, otherwise skip the example if (exists("train_dia")) { # 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Run selected models results <- models_dia( data = train_dia, model = c("rf", "lasso"), # Run only Random Forest and Lasso threshold_choices = list(rf = "f1", lasso = 0.6), # Different thresholds positive_label_value = 1, negative_label_value = 0, new_positive_label = "Case", new_negative_label = "Control", seed = 42 ) # 3. Print summaries for (model_name in names(results)) { print_model_summary_dia(model_name, results[[model_name]]) } }
High-level API to train and evaluate multiple survival models in batch.
models_pro( data, model = "all_pro", tune = FALSE, seed = 123, time_unit = "day", years_to_evaluate = c(1, 3, 5) )models_pro( data, model = "all_pro", tune = FALSE, seed = 123, time_unit = "day", years_to_evaluate = c(1, 3, 5) )
data |
Input data frame. |
model |
Character vector of model names or "all_pro". |
tune |
Logical. Enable hyperparameter tuning? |
seed |
Random seed. |
time_unit |
Time unit of input. |
years_to_evaluate |
Years for AUC calculation. |
A list of model results.
Trains a Naive Bayes model using caret::train for binary classification.
nb_dia(X, y, tune = FALSE, cv_folds = 5)nb_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning using |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Naive Bayes model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model nb_model <- nb_dia(X_toy, y_toy) print(nb_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model nb_model <- nb_dia(X_toy, y_toy) print(nb_model)
Creates a heatmap visualization with performance metrics across models and datasets, including category annotations and summary bar plots.
plot_integrated_results(results_obj, metric_name = "AUROC", output_file = NULL)plot_integrated_results(results_obj, metric_name = "AUROC", output_file = NULL)
results_obj |
Output from |
metric_name |
Character string for metric used (e.g., "AUROC", "C-index"). |
output_file |
Optional file path to save plot. If NULL, plot is displayed. |
A ggplot object (invisibly).
## Not run: results <- int_dia(train_dia, test_dia) plot_integrated_results(results, "AUROC") ## End(Not run)## Not run: results <- int_dia(train_dia, test_dia) plot_integrated_results(results, "AUROC") ## End(Not run)
Fits a Cox model using Partial Least Squares reduction for high-dimensional data.
pls_pro(X, y_surv, tune = FALSE)pls_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_plsRcox and pro_model.
A unified S3 generic method to generate prognostic risk scores from various trained model objects. This decouples the prediction implementation from the high-level evaluation logic, facilitating extensibility.
predict_pro(object, newdata, ...)predict_pro(object, newdata, ...)
object |
A trained model object with class |
newdata |
A data frame containing features for prediction. |
... |
Additional arguments passed to specific methods. |
A numeric vector representing the prognostic risk score (higher values typically indicate higher risk).
Prints a formatted summary of the evaluation metrics for a diagnostic model, either from training data or new data evaluation.
print_model_summary_dia(model_name, results_list, on_new_data = FALSE)print_model_summary_dia(model_name, results_list, on_new_data = FALSE)
model_name |
A character string, the name of the model (e.g., "rf", "Bagging (RF)"). |
results_list |
A list containing model evaluation results, typically
an element from the output of |
on_new_data |
Logical, indicating whether the results are from applying
the model to new, unseen data ( |
NULL. Prints the summary to the console.
# Example for a successfully evaluated model successful_results <- list( evaluation_metrics = list( Threshold_Strategy = "f1", `_Threshold` = 0.45, AUROC = 0.85, AUROC_95CI_Lower = 0.75, AUROC_95CI_Upper = 0.95, AUPRC = 0.80, Accuracy = 0.82, F1 = 0.78, Precision = 0.79, Recall = 0.77, Specificity = 0.85 ) ) print_model_summary_dia("MyAwesomeModel", successful_results) # Example for a failed model failed_results <- list(evaluation_metrics = list(error = "Training failed")) print_model_summary_dia("MyFailedModel", failed_results)# Example for a successfully evaluated model successful_results <- list( evaluation_metrics = list( Threshold_Strategy = "f1", `_Threshold` = 0.45, AUROC = 0.85, AUROC_95CI_Lower = 0.75, AUROC_95CI_Upper = 0.95, AUPRC = 0.80, Accuracy = 0.82, F1 = 0.78, Precision = 0.79, Recall = 0.77, Specificity = 0.85 ) ) print_model_summary_dia("MyAwesomeModel", successful_results) # Example for a failed model failed_results <- list(evaluation_metrics = list(error = "Training failed")) print_model_summary_dia("MyFailedModel", failed_results)
Formatted console output of model performance.
print_model_summary_pro(model_name, results_list)print_model_summary_pro(model_name, results_list)
model_name |
Name of the model. |
results_list |
Result object containing |
Trains a Quadratic Discriminant Analysis (QDA) model using caret::train
for binary classification.
qda_dia(X, y, tune = FALSE, cv_folds = 5)qda_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning (currently ignored for QDA). |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained QDA model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model qda_model <- qda_dia(X_toy, y_toy) print(qda_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model qda_model <- qda_dia(X_toy, y_toy) print(qda_model)
Registers a user-defined or pre-defined diagnostic model function with the internal model registry. This allows the function to be called later by its registered name, facilitating a modular model management system.
register_model_dia(name, func)register_model_dia(name, func)
name |
A character string, the unique name to register the model under. |
func |
A function, the R function implementing the diagnostic model.
This function should typically accept |
NULL. The function registers the model function invisibly.
get_registered_models_dia, initialize_modeling_system_dia
# Example of a dummy model function for registration my_dummy_rf_model <- function(X, y, tune = FALSE, cv_folds = 5) { message("Training dummy RF model...") # This is a placeholder and doesn't train a real model. # It returns a list with a structure similar to a caret train object. list(method = "dummy_rf") } # Initialize the system before registering initialize_modeling_system_dia() # Register the new model register_model_dia("dummy_rf", my_dummy_rf_model) # Verify that the model is now in the list of registered models "dummy_rf" %in% names(get_registered_models_dia())# Example of a dummy model function for registration my_dummy_rf_model <- function(X, y, tune = FALSE, cv_folds = 5) { message("Training dummy RF model...") # This is a placeholder and doesn't train a real model. # It returns a list with a structure similar to a caret train object. list(method = "dummy_rf") } # Initialize the system before registering initialize_modeling_system_dia() # Register the new model register_model_dia("dummy_rf", my_dummy_rf_model) # Verify that the model is now in the list of registered models "dummy_rf" %in% names(get_registered_models_dia())
Registers a model function into the internal system environment, making it available for batch execution.
register_model_pro(name, func)register_model_pro(name, func)
name |
String identifier for the model. |
func |
The model training function. |
Trains a Random Forest model using caret::train for binary classification.
rf_dia(X, y, tune = FALSE, cv_folds = 5)rf_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning using |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Random Forest model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model rf_model <- rf_dia(X_toy, y_toy) print(rf_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model rf_model <- rf_dia(X_toy, y_toy) print(rf_model)
Trains a Ridge-regularized logistic regression model using caret::train
(via glmnet method) for binary classification.
ridge_dia(X, y, tune = FALSE, cv_folds = 5)ridge_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning for |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained Ridge model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model ridge_model <- ridge_dia(X_toy, y_toy) print(ridge_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model ridge_model <- ridge_dia(X_toy, y_toy) print(ridge_model)
Fits a Cox model with Ridge (L2) regularization.
ridge_pro(X, y_surv, tune = FALSE)ridge_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_glmnet and pro_model.
Fits a Random Survival Forest using the log-rank splitting rule.
Includes capabilities for hyperparameter tuning via grid search over ntree,
nodesize, and mtry.
rsf_pro(X, y_surv, tune = FALSE, tune_params = NULL)rsf_pro(X, y_surv, tune = FALSE, tune_params = NULL)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs grid search for optimal hyperparameters based on C-index. |
tune_params |
Optional data frame containing the grid for tuning. |
An object of class survival_rsf and pro_model.
Implements a Stacking ensemble. It trains multiple base models, then uses their predictions as features to train a meta-model.
stacking_dia( results_all_models, data, meta_model_name, top = 5, tune_meta = FALSE, threshold_choices = "f1", seed = 789, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )stacking_dia( results_all_models, data, meta_model_name, top = 5, tune_meta = FALSE, threshold_choices = "f1", seed = 789, positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
results_all_models |
A list of results from |
data |
A data frame where the first column is the sample ID, the second is the outcome label, and subsequent columns are features. Used for training the meta-model. |
meta_model_name |
A character string, the name of the meta-model to use (e.g., "lasso", "gbm"). This model must be registered. |
top |
An integer, the number of top-performing base models (ranked by AUROC) to select for the stacking ensemble. |
tune_meta |
Logical, whether to enable tuning for the meta-model. |
threshold_choices |
A character string (e.g., "f1", "youden", "default") or a numeric value (0-1) for determining the evaluation threshold for the ensemble. |
seed |
An integer, for reproducibility. |
positive_label_value |
A numeric or character value in the raw data representing the positive class. |
negative_label_value |
A numeric or character value in the raw data representing the negative class. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
A list containing the model_object, sample_score, and evaluation_metrics.
models_dia, evaluate_model_dia
# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create a toy dataset for demonstration set.seed(42) data_toy <- data.frame( ID = paste0("Sample", 1:60), Status = sample(c(0, 1), 60, replace = TRUE), Feat1 = rnorm(60), Feat2 = runif(60) ) # 3. Generate mock base model results (as if from models_dia) # In a real scenario, you would run models_dia() on your full dataset base_model_results <- models_dia( data = data_toy, model = c("rf", "lasso"), seed = 123 ) # 4. Run the stacking ensemble stacking_results <- stacking_dia( results_all_models = base_model_results, data = data_toy, meta_model_name = "gbm", top = 2, threshold_choices = "f1" ) print_model_summary_dia("Stacking (GBM)", stacking_results)# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create a toy dataset for demonstration set.seed(42) data_toy <- data.frame( ID = paste0("Sample", 1:60), Status = sample(c(0, 1), 60, replace = TRUE), Feat1 = rnorm(60), Feat2 = runif(60) ) # 3. Generate mock base model results (as if from models_dia) # In a real scenario, you would run models_dia() on your full dataset base_model_results <- models_dia( data = data_toy, model = c("rf", "lasso"), seed = 123 ) # 4. Run the stacking ensemble stacking_results <- stacking_dia( results_all_models = base_model_results, data = data_toy, meta_model_name = "gbm", top = 2, threshold_choices = "f1" ) print_model_summary_dia("Stacking (GBM)", stacking_results)
Implements a Stacking Ensemble (Super Learner). It uses the risk scores from top-performing base models as meta-features to train a second-level meta-learner.
stacking_pro( results_all_models, data, meta_model_name, top = 3, tune_meta = FALSE, time_unit = "day", years_to_evaluate = c(1, 3, 5), seed = 789 )stacking_pro( results_all_models, data, meta_model_name, top = 3, tune_meta = FALSE, time_unit = "day", years_to_evaluate = c(1, 3, 5), seed = 789 )
results_all_models |
List of results from |
data |
Training data. |
meta_model_name |
Name of the meta-learner (e.g., "lasso_pro"). |
top |
Integer. Number of top base models to include based on C-index. |
tune_meta |
Logical. Tune the meta-learner? |
time_unit |
Time unit. |
years_to_evaluate |
Evaluation years. |
seed |
Integer seed. |
A list containing the stacking object and evaluation results.
Fits a Cox model and performs backward stepwise selection based on AIC.
stepcox_pro(X, y_surv, tune = FALSE)stepcox_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_stepcox and pro_model.
Trains a Support Vector Machine (SVM) model with a linear kernel
using caret::train for binary classification.
svm_dia(X, y, tune = FALSE, cv_folds = 5)svm_dia(X, y, tune = FALSE, cv_folds = 5)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning using |
cv_folds |
An integer, the number of cross-validation folds for |
A caret::train object representing the trained SVM model.
set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model svm_model <- svm_dia(X_toy, y_toy) print(svm_model)set.seed(42) n_obs <- 50 X_toy <- data.frame( FeatureA = rnorm(n_obs), FeatureB = runif(n_obs, 0, 100) ) y_toy <- factor(sample(c("Control", "Case"), n_obs, replace = TRUE), levels = c("Control", "Case")) # Train the model svm_model <- svm_dia(X_toy, y_toy) print(svm_model)
A test dataset for evaluating diagnostic models, with a structure
identical to train_dia.
test_diatest_dia
A data frame with rows for samples and 22 columns:
character. Unique identifier for each sample.
integer. The binary outcome (0 or 1).
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
Stored in data/test_dia.rda.
A test dataset for evaluating prognostic models, with a structure
identical to train_pro.
test_protest_pro
A data frame with rows for samples and 31 columns:
character. Unique identifier for each sample.
integer. The event status (0 or 1).
numeric. The time to event or censoring.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
Stored in data/test_pro.rda.
A training dataset for diagnostic models, containing sample IDs, binary outcomes, and gene expression features.
train_diatrain_dia
A data frame with rows for samples and 22 columns:
character. Unique identifier for each sample.
integer. The binary outcome, where 1 typically represents a positive case and 0 a negative case.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
This dataset is used to train machine learning models for diagnosis. The column names starting with 'AC', 'AL', 'LINC', etc., are feature variables.
Stored in data/train_dia.rda.
A training dataset for prognostic models, containing sample IDs, survival outcomes (time and event status), and gene expression features.
train_protrain_pro
A data frame with rows for samples and 31 columns:
character. Unique identifier for each sample.
integer. The event status, where 1 indicates an event occurred and 0 indicates censoring.
numeric. The time to event or censoring.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
numeric. Gene expression level.
This dataset is used to train machine learning models for prognosis. The features are typically gene expression values.
Stored in data/train_pro.rda.
Implements a Voting ensemble, combining predictions from multiple base models through soft or hard voting.
voting_dia( results_all_models, data, type = c("soft", "hard"), weight_metric = "AUROC", top = 5, seed = 789, threshold_choices = "f1", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )voting_dia( results_all_models, data, type = c("soft", "hard"), weight_metric = "AUROC", top = 5, seed = 789, threshold_choices = "f1", positive_label_value = 1, negative_label_value = 0, new_positive_label = "Positive", new_negative_label = "Negative" )
results_all_models |
A list of results from |
data |
A data frame where the first column is the sample ID, the second is the outcome label, and subsequent columns are features. Used for evaluation. |
type |
A character string, "soft" for weighted average of probabilities or "hard" for majority class voting. |
weight_metric |
A character string, the metric to use for weighting base models in soft voting (e.g., "AUROC", "F1"). Ignored for hard voting. |
top |
An integer, the number of top-performing base models (ranked by
|
seed |
An integer, for reproducibility. |
threshold_choices |
A character string (e.g., "f1", "youden", "default") or a numeric value (0-1) for determining the evaluation threshold for the ensemble. |
positive_label_value |
A numeric or character value in the raw data representing the positive class. |
negative_label_value |
A numeric or character value in the raw data representing the negative class. |
new_positive_label |
A character string, the desired factor level name for the positive class (e.g., "Positive"). |
new_negative_label |
A character string, the desired factor level name for the negative class (e.g., "Negative"). |
A list containing the model_object, sample_score, and evaluation_metrics.
models_dia, evaluate_model_dia
# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create a toy dataset for demonstration set.seed(42) data_toy <- data.frame( ID = paste0("Sample", 1:60), Status = sample(c(0, 1), 60, replace = TRUE), Feat1 = rnorm(60), Feat2 = runif(60) ) # 3. Generate mock base model results (as if from models_dia) base_model_results <- models_dia( data = data_toy, model = c("rf", "lasso"), seed = 123 ) # 4. Run the soft voting ensemble soft_voting_results <- voting_dia( results_all_models = base_model_results, data = data_toy, type = "soft", weight_metric = "AUROC", top = 2, threshold_choices = "f1" ) print_model_summary_dia("Soft Voting", soft_voting_results)# 1. Initialize the modeling system initialize_modeling_system_dia() # 2. Create a toy dataset for demonstration set.seed(42) data_toy <- data.frame( ID = paste0("Sample", 1:60), Status = sample(c(0, 1), 60, replace = TRUE), Feat1 = rnorm(60), Feat2 = runif(60) ) # 3. Generate mock base model results (as if from models_dia) base_model_results <- models_dia( data = data_toy, model = c("rf", "lasso"), seed = 123 ) # 4. Run the soft voting ensemble soft_voting_results <- voting_dia( results_all_models = base_model_results, data = data_toy, type = "soft", weight_metric = "AUROC", top = 2, threshold_choices = "f1" ) print_model_summary_dia("Soft Voting", soft_voting_results)
Trains an Extreme Gradient Boosting (XGBoost) model using caret::train
for binary classification.
xb_dia(X, y, tune = FALSE, cv_folds = 5, tune_length = 20)xb_dia(X, y, tune = FALSE, cv_folds = 5, tune_length = 20)
X |
A data frame of features. |
y |
A factor vector of class labels. |
tune |
Logical, whether to perform hyperparameter tuning using |
cv_folds |
An integer, the number of cross-validation folds for |
tune_length |
An integer, the number of random parameter combinations to try when tune=TRUE. Only used when search="random". Default is 20. |
A caret::train object representing the trained XGBoost model.
Fits an XGBoost model using the Cox proportional hazards objective function.
xgb_pro(X, y_surv, tune = FALSE)xgb_pro(X, y_surv, tune = FALSE)
X |
A data frame of predictors. |
y_surv |
A |
tune |
Logical. If TRUE, performs internal tuning (currently handled by cv.glmnet automatically). |
An object of class survival_xgboost and pro_model.