2018-05-06

10-k Cross Validation

数据

数据索引文件 data.csv 包括 1935 条数据。将数据分成 10 等份，用于模型的交叉验证（cross validation）。

ID	task	file	structure	content	language	score
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11

分隔数据

使用 R caret 包的 createFolds() 函数分割数据：

# load package caret 
library(caret)

# read in data
data <- read.csv('data.csv')

# create 10 random folds
folds <- createFolds(data$score)

# add a new column folds to the original data
data$folds <- 0

for ( i in 1: length(folds) ) {
    data$folds[folds[[i]]] = i
}

生成新的索引文件 data.csv：

ID	task	file	structure	content	language	score	folds
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0	3
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5	4
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11	8