Last Updated on July 28, 2022 by Jay
In this tutorial we’ll talk about how to do train test split in the R language. Unlike Python sklearn’s simple syntax, it will take a few lines of code to do it in R, but don’t worry – we’ll go through the steps.
Dataset
Let’s load the iris dataset for the demonstration. The dataset contains 150 records. Let’s not worry about what is inside the dataset; our focus is on how to split the data.
library(datasets)
data("iris")
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species train
1 5.1 3.5 1.4 0.2 setosa 1
2 4.9 3.0 1.4 0.2 setosa 1
3 4.7 3.2 1.3 0.2 setosa 1
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 1
6 5.4 3.9 1.7 0.4 setosa 1
nrow(iris)
[1] 150
We will split the data into 75% training and 25% test data.
train_pct = 0.75
Universal Train Test Split Approach In R
Although existing R libraries offer convenient functions for the train test split, we want to create a universal solution so that we don’t have to rely on any library.
In R, the function rownames() returns the row index (number) of a data.frame object. This is similar to the pandas’ .index. By default, the rownames are numeric values starting from 1 to n, the size of the dataset.
class(iris)
[1] "data.frame"
rownames(iris)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17"
[18] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34"
[35] "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51"
[52] "52" "53" "54" "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68"
[69] "69" "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85"
[86] "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99" "100" "101" "102"
[103] "103" "104" "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[120] "120" "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132" "133" "134" "135" "136"
[137] "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"
Our approach is simple:
- Randomly select 112 (~75% of the total records) row indexes from the total of 150 records.
- Add an indicator column “train_data” to the dataframe, assigning the value of 1 for those 112 records.
The sample() function can randomly select a sample of a specified size from the full list of items. For example, sample(20, 10) will randomly select 10 numbers from integer numbers 1-20.
set.seed(0) ## ensures re-producibility
sample(20,10)
[1] 14 4 7 1 2 13 18 11 16 15
Now let’s sample the iris dataframe, and we’ll assign the row index numbers into a variable called train_rows. Checking the length of train_rows, it contains 112 elements indeed.
set.seed(0) ## ensures re-producibility
train_rows <- sample(nrow(iris), floor(train_pct *nrow(iris)), replace=FALSE)
length(train_rows)
[1] 112
Let’s now create a column called “train_data”, first assign 0 to all records, then assign 1 to those records that we selected in the train_rows variable.
iris[,'train_data'] = 0
iris[train_rows, 'train_data'] = 1