Geneartes a (list of) lgb.Dataset. Unsupported for clusters. Requires Matrix and lightgbm packages.

Laurae.lgb.dmat(data, label = NULL, missing = NA, save_names = NULL,
  save_keep = TRUE, clean_mem = FALSE, progress_bar = TRUE, ...)

Arguments

data

Type: matrix or dgCMatrix or data.frame or data.table or filename, or potentially a list of any of them. When a list is provided, it generates the appropriate lgb.Dataset for all the sets. The data to convert to lgb.Dataset. RAM usage required is 2x the current data input RAM usage, and 3x for data.frame and data.table due to internal matrix conversion added before binary matrix generation.

label

Type: numeric, or a list of numeric. The label of associated rows in data. Use NULL for passing no labels.

missing

Type: numeric. The value used to represent missing values in data. Defaults to NA (and missing values for dgCMatrix).

save_names

Type: character or NULL, or a list of characters. If names are provided, the generated lgb.Dataset are stored physically to the drive. When a list is provided (along with a list of data and labels), it stores files sequentially by name if a list is provided for data but not for save_names. Defaults to NA.

save_keep

Type: logical, or a list of logicals. When names are provided, save_keep allows to selectively choose the lgb.Dataset to retain for returning to the user. Useful when generating a list of lgb.Dataset but choosing to keep only a part of them. When FALSE, it returns a NULL instead of the lgb.Dataset. Defaults to TRUE.

clean_mem

Type: logical. Whether the force garbage collection at the end of each matrix construction in order to reclaim RAM. Defaults to FALSE.

progress_bar

Type: logical. Whether to print a progress bar in case of list inputs. Defaults to TRUE.

...

More arguments to pass to lightgbm::lgb.Dataset.

Value

The lgb.Dataset

Examples

library(Matrix) library(lightgbm) set.seed(0) # Generate lgb.Dataset from matrix random_mat <- matrix(runif(10000, 0, 1), nrow = 1000) random_labels <- runif(1000, 0, 1) lgb_from_mat <- Laurae.lgb.dmat(data = random_mat, label = random_labels, missing = NA) # Generate lgb.Dataset from data.frame random_df <- data.frame(random_mat) random_labels_2 <- runif(1000, 0, 1) lgb_from_df <- Laurae.lgb.dmat(data = random_df, label = random_labels, missing = NA) # Generate lgb.Dataset from respective elements of a list with progress bar # while keeping memory usage as low as theoretically possible random_list <- list(random_mat, random_df) random_labels_3 <- list(random_labels, random_labels_2) lgb_from_list <- Laurae.lgb.dmat(data = random_list, label = random_labels_3, missing = NA, progress_bar = TRUE, clean_mem = TRUE)
#> | | 0 % ~calculating |+++++++++++++++++++++++++ | 50% ~00s |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s
# Generate lgb.Dataset from respective elements of a list and keep only first # while keeping memory usage as low as theoretically possible lgb_from_list <- Laurae.lgb.dmat(data = random_list, label = random_labels_3, missing = NA, save_keep = c(TRUE, FALSE), clean_mem = TRUE)
#> | | 0 % ~calculating |+++++++++++++++++++++++++ | 50% ~00s |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 00s