训练集、测试集、验证集与划分方法

一、训练集、测试集、验证集

在机器学习预测任务中,我们需要对模型泛化误差进行评估,选择最优模型。如果我们把所有数据都用来训练模型的话,建立的模型自然是最契合这些数据的,测试表现也好。但换了其它数据集测试这个模型效果可能就没那么好了。为了防止过拟合,就需要将数据集分成训练集、验证集、测试集。

它们的作用分别是:

  • 训练集:用来训练模型
  • 验证集:评估模型预测的好坏及调整对应的参数
  • 测试集:测试已经训练好的模型的推广能力

有一个比喻十分形象,训练集就像高三学生的练习册,验证集就像高考模拟卷,测试集就是最后真正的考试。

如何选择训练集、验证集、测试集的划分比例?
在传统的机器学习中,这三者一般的比例为training/validation/test = 50/25/25,但是有些时候如果模型不需要很多调整只要拟合就可时,或者training本身就是training+validation(比如cross validation)时,也可以training/test =7/3。

二、划分方法

1、留出法(hold-out)

留出法是最简单的数据集划分方式,即将样本按比例分割,对应的函数是train_test_split

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X2,X_test, y2,y_test = train_test_split(X, y,
test_size=0.25,
stratify=y, # 按照标签来分层采样
random_state=0) # 随机数种子

X_train,X_valid, y_train,y_valid = train_test_split(X2, y2,
test_size=0.7,
stratify=y2,
random_state=0)

clf = svm.SVC(gamma='scale')

clf.fit(X_train, y_train)

y_valid_pred = clf.predict(X_valid)
print('Precision_valid', precision_score(y_valid, y_valid_pred, average='micro'))

y_test_pred = clf.predict(X_test)
print('Precision_test', precision_score(y_test, y_test_pred, average='micro'))

然而这种方式并不是很好,有两大缺点:一是浪费数据,二是容易过拟合且矫正方式不方便

2、交叉验证法(cross validation)

交叉验证法先将数据集D分成k份,每次随机的选择k-1份作为训练集,剩下的1份做验证集。当这一轮完成后,重新随机选择k-1份来训练数据。进行k次训练后,最终返回k个验证结果的均值。因此又称为”k折交叉验证”
数据量大的时候,k可以设置的小一些;数据量小的时候,k可以设置的大一些。

(1)K-Fold

K-Fold是最简单的K折交叉验证。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train,X_test, y_train,y_test = train_test_split(X, y,
test_size=0.3, # 划分比例
stratify=y, # 按照标签来分层采样
random_state=0) # 随机数种子

clf = svm.SVC(gamma='scale')

precision_scores = []

kf = KFold(n_splits=5, random_state=0, shuffle=True)
for train_index, valid_index in kf.split(X_train, y_train):
X_train_s, X_valid_s = X_train[train_index], X_train[valid_index]
y_train_s, y_valid_s = y_train[train_index], y_train[valid_index]
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_valid_s)
precision_scores.append(precision_score(y_pred, y_valid_s, average='micro'))

print('Precision_valid', np.mean(precision_scores))

y_test_pred = clf.predict(X_test)
print('Precision_test', precision_score(y_test, y_test_pred, average='micro'))

(2)StratifiedKFold

StratifiedKFold用法类似Kfold,但是它是分层采样,确保训练集、验证集中各类别样本的比例与原始数据集中相同。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train,X_test, y_train,y_test = train_test_split(X, y,
test_size=0.3, # 划分比例
stratify=y, # 按照标签来分层采样
random_state=0) # 随机数种子

clf = svm.SVC(gamma='scale')

precision_scores = []

kf = StratifiedShuffleSplit(n_splits=10, train_size=0.6, test_size=0.4, random_state=0)
for train_index, valid_index in kf.split(X_train, y_train):
X_train_s, X_valid_s = X_train[train_index], X_train[valid_index]
y_train_s, y_valid_s = y_train[train_index], y_train[valid_index]
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_valid_s)
precision_scores.append(precision_score(y_pred, y_valid_s, average='micro'))

print('Precision_valid', np.mean(precision_scores))

y_test_pred = clf.predict(X_test)
print('Precision_test', precision_score(y_test, y_test_pred, average='micro'))

(3)GroupKFold

这个跟StratifiedKFold比较像,不过数据集是按照一定分组进行打乱的,即先分堆。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GroupKFold
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score
import numpy as np

X = np.array([[1,], [2,], [3,], [4,], [5, ], [6, ], [7, ]])
y = np.array([1, 1, 1, 2, 1, 2, 1])

groups = np.array([1, 1, 1, 1, 1, 2, 2])

clf = DecisionTreeClassifier()

precision_scores = []

gkf = GroupKFold(n_splits=2)
for train_index, valid_index in gkf.split(X, y, groups=groups):
X_train_s, X_valid_s = X[train_index], X[valid_index]
y_train_s, y_valid_s = y[train_index], y[valid_index]
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_valid_s)
precision_scores.append(precision_score(y_pred, y_valid_s, average='micro'))

print('Precision_valid', np.mean(precision_scores))

(4)ShuffleSplit

随机排列交叉验证,生成一个用户给定数量的独立的数据划分,样例首先被打散然后划分为一对数据集合。
ShuffleSplit可以替代KFold,因为其提供了细致的数据集划分控制。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train,X_test, y_train,y_test = train_test_split(X, y,
test_size=0.3, # 划分比例
stratify=y, # 按照标签来分层采样
random_state=0) # 随机数种子

clf = svm.SVC(gamma='scale')

precision_scores = []

ss = ShuffleSplit(n_splits=5, test_size=0.25)
for train_index, valid_index in ss.split(X_train, y_train):
X_train_s, X_valid_s = X_train[train_index], X_train[valid_index]
y_train_s, y_valid_s = y_train[train_index], y_train[valid_index]
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_valid_s)
precision_scores.append(precision_score(y_pred, y_valid_s, average='micro'))

print('Precision_valid', np.mean(precision_scores))

y_test_pred = clf.predict(X_test)
print('Precision_test', precision_score(y_test, y_test_pred, average='micro'))

3、留一法(leave-one-out,LOO)

留一法是k折交叉验证的特殊情况(k=样本数m),即每次只用一个样本作验证集。该方法计算开销较大。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train,X_test, y_train,y_test = train_test_split(X, y,
test_size=0.3, # 划分比例
stratify=y, # 按照标签来分层采样
random_state=0) # 随机数种子

clf = svm.SVC(gamma='scale')

precision_scores = []

loo = LeaveOneOut() # 其实等效于KFold(n_splits=len(X))
for train_index, valid_index in loo.split(X_train, y_train):
X_train_s, X_valid_s = X_train[train_index], X_train[valid_index]
y_train_s, y_valid_s = y_train[train_index], y_train[valid_index]
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_valid_s)
precision_scores.append(precision_score(y_pred, y_valid_s, average='micro'))

print('Precision_valid', np.mean(precision_scores))

y_test_pred = clf.predict(X_test)
print('Precision_test', precision_score(y_test, y_test_pred, average='micro'))

4、自助法(bootstrapping)

自助法以自助采样为基础(有放回采样)。每次随机从数据集D中挑选一个样本,放入D中,然后将样本放回D中,重复m次之后,得到了包含m个样本的数据集。
自助法在数据集较小、难以有效划分训练/测试集时很有用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import precision_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

# 通过产生的随机数获得抽取样本的序号
bootstrapping = []
for i in range(len(X)):
bootstrapping.append(np.floor(np.random.random()*len(X)))

# 通过序号获得原始数据集中的数据
X_train = []
for i in range(len(X)):
X_train.append(X[int(bootstrapping[i])])
y_train = []
for i in range(len(y)):
y_train.append(y[int(bootstrapping[i])])

clf = svm.SVC(gamma='scale')

clf.fit(X_train, y_train)
y_pred = clf.predict(X)

print('Precision_valid', precision_score(y_pred, y, average='micro'))