How To Do Train Test Split In R

Sharing is caring!

Last Updated on July 28, 2022 by Jay

In this tutorial we’ll talk about how to do train test split in the R language. Unlike Python sklearn’s simple syntax, it will take a few lines of code to do it in R, but don’t worry – we’ll go through the steps.

Train Test Split In R

Dataset

Let’s load the iris dataset for the demonstration. The dataset contains 150 records. Let’s not worry about what is inside the dataset; our focus is on how to split the data.

library(datasets)
data("iris")

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species train
1          5.1         3.5          1.4         0.2  setosa     1
2          4.9         3.0          1.4         0.2  setosa     1
3          4.7         3.2          1.3         0.2  setosa     1
4          4.6         3.1          1.5         0.2  setosa     1
5          5.0         3.6          1.4         0.2  setosa     1
6          5.4         3.9          1.7         0.4  setosa     1

nrow(iris)
[1] 150

We will split the data into 75% training and 25% test data.

train_pct = 0.75

Universal Train Test Split Approach In R

Although existing R libraries offer convenient functions for the train test split, we want to create a universal solution so that we don’t have to rely on any library.

In R, the function rownames() returns the row index (number) of a data.frame object. This is similar to the pandas’ .index. By default, the rownames are numeric values starting from 1 to n, the size of the dataset.

class(iris)
[1] "data.frame"

rownames(iris)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17" 
 [18] "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34" 
 [35] "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51" 
 [52] "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68" 
 [69] "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85" 
 [86] "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99"  "100" "101" "102"
[103] "103" "104" "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119"
[120] "120" "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132" "133" "134" "135" "136"
[137] "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"

Our approach is simple:

  1. Randomly select 112 (~75% of the total records) row indexes from the total of 150 records.
  2. Add an indicator column “train_data” to the dataframe, assigning the value of 1 for those 112 records.

The sample() function can randomly select a sample of a specified size from the full list of items. For example, sample(20, 10) will randomly select 10 numbers from integer numbers 1-20.

set.seed(0)  ## ensures re-producibility
sample(20,10)
 [1] 14  4  7  1  2 13 18 11 16 15

Now let’s sample the iris dataframe, and we’ll assign the row index numbers into a variable called train_rows. Checking the length of train_rows, it contains 112 elements indeed.

set.seed(0)  ## ensures re-producibility
train_rows <- sample(nrow(iris), floor(train_pct *nrow(iris)), replace=FALSE)

length(train_rows)
[1] 112

Let’s now create a column called “train_data”, first assign 0 to all records, then assign 1 to those records that we selected in the train_rows variable.

iris[,'train_data'] = 0
iris[train_rows, 'train_data'] = 1
Train Test Split

Additional Resources

How to Do Train Test Split in Sklearn

Leave a Reply

Your email address will not be published. Required fields are marked *