2018-02-03

R 分层抽样

数据

数据索引文件 data.csv 中的 task 列包括 3 个类别：task 1，task 2 和 task 3。从各类别随机抽取 80% 的数据作为训练语料，其余 20% 用作测试语料。

ID	task	file	structure	content	language	score
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11

抽样

使用 R sampling 包的 strata() 函数进行分层抽样，抽样后的训练和测试数据索引保存为 train.csv 和 test.csv。

# clear memory
rm(list=ls(all=T))

# load sampling package
library(sampling)

# read in data index
task.idx <- read.csv("data.csv")

# select train and test data
n <- round(4/5*nrow(task.idx)/3)
sub_train <- strata(task.idx, stratanames=("task"), size=rep(n, 3), method="srswor")
data_train <- task.idx[sub_train$ID_unit,]
data_test  <- task.idx[-sub_train$ID_unit,]

# save the index of train and test data
write.csv(data_train, file="train.csv", quote=F, row.names=F)
write.csv(data_test, file="test.csv", quote=F, row.names=F)