(Un)Stratified k-fold for any type of label � kfold • LauraeDS

This function allows to create (un)stratified folds from a label vector.

kfold(y, k = 5, type = "random", seed = 0, named = TRUE)

Arguments

y	Type: numeric. The label vector (not a factor).
k	Type: integer. The amount of folds to create. Causes issues if `length(y) < k` (e.g more folds than samples). Defaults to `5`.
type	Type: character. Whether the folds should be `stratified` (keep the same label proportions for classification), `treatment` (make each fold exclusive according to the label vector which becomes a vector), `pseudo` (pseudo-random, attempts to minimize the variance between folds for regression), or `random` (for fully random folds). Defaults to `random`.
seed	Type: integer. The seed for the random number generator. Defaults to `0`.
named	Type: boolean. Whether the folds should be named. Defaults to `TRUE`.

Value

A list of vectors for each fold, where an integer represents the row number.

Details

In contrary to Laurae::kfold, please do not use stratified for regression, use pseudo instead. I had complaints about weird fold generation when using stratification with regression labels: it just does not work the way it was intended (now, use stratified for classification stratification, and pseudo for regression stratification).

Examples

# Reproducible Stratified folds
data <- 1:5000
folds1 <- kfold(y = data, k = 5, type = "pseudo", seed = 111)
folds2 <- kfold(y = data, k = 5, type = "pseudo", seed = 111)
identical(folds1, folds2)
#> [1] TRUE

# Treatments
data <- c(rep(1:50, rep(50, 50)))
str(kfold(y = data, k = 5, type = "treatment"))
#> List of 5
#>  $ Fold1: int [1:500] 451 452 453 454 455 456 457 458 459 460 ...
#>  $ Fold2: int [1:500] 101 102 103 104 105 106 107 108 109 110 ...
#>  $ Fold3: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ Fold4: int [1:500] 151 152 153 154 155 156 157 158 159 160 ...
#>  $ Fold5: int [1:500] 51 52 53 54 55 56 57 58 59 60 ...

# Stratified Classification
data <- c(rep(0, 250), rep(1, 250))
folds <- kfold(y = data, k = 5, type = "stratified")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}
#> [1] 0.5
#> [1] 0.5
#> [1] 0.5
#> [1] 0.5
#> [1] 0.5

# Stratified Regression
data <- 1:5000
folds <- kfold(y = data, k = 5, type = "pseudo")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}
#> [1] 2504.919
#> [1] 2483.742
#> [1] 2496.716
#> [1] 2500.756
#> [1] 2516.367

# Stratified Multi-class Classification
data <- c(rep(0, 250), rep(1, 250), rep(2, 250))
folds <- kfold(y = data, k = 5, type = "stratified")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}
#> [1] 1
#> [1] 1
#> [1] 1
#> [1] 1
#> [1] 1

# Unstratified Regression
data <- 1:5000
folds <- kfold(y = data, k = 5, type = "random")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}
#> [1] 2527.465
#> [1] 2446.88
#> [1] 2518.532
#> [1] 2502.391
#> [1] 2507.232

# Unstratified Multi-class Classification
data <- c(rep(0, 250), rep(1, 250), rep(2, 250))
folds <- kfold(y = data, k = 5, type = "random")
for (i in 1:length(folds)) {
  print(mean(data[folds[[i]]]))
}
#> [1] 0.9866667
#> [1] 0.96
#> [1] 1.066667
#> [1] 0.92
#> [1] 1.066667

(Un)Stratified k-fold for any type of label

Arguments

Value

Details

Examples

Contents