迭代算法(adaboost)
算法介绍
Adaboost算法基本原理就是将多个弱分类器(弱分类器一般选用单层决策树)进行合理的结合,使其成为一个强分类器
算法优点
Adaboost的主要优点有:
- Adaboost作为分类器时,分类精度很高
-
在Adaboost的框架下,可以使用各种回归分类模型来构建弱学习器,非常灵活。
-
作为简单的二元分类器时,构造简单,结果可理解。
-
不容易发生过拟合
Adaboost的主要缺点有:
- 对异常样本敏感,异常样本在迭代中可能会获得较高的权重,影响最终的强学习器的预测准确性。
训练算法
1.错误率
$$
e_k = P(G_k(x_i) \neq y_i)
$$
2.分类器权重
$$
\alpha_k = \frac{1}{2}log\frac{1-e_k}{e_k}
$$
3.样本权重更新
$$
D_{k+1,i} = \frac{D_{k}}{Z_K}exp(-\alpha_ky_iG_k(x_i))
$$
这里$Z_k$是规范化因子
$$
Z_k = \sum\limits_{i=1}^{m}D_{k}exp(-\alpha_ky_iG_k(x_i))
$$
4.集合
$$
f(x) = sign(\sum\limits_{k=1}^{K}\alpha_kG_k(x))
$$
5.算法流程
输入:训练数据集$T={(x_1, y_1),(x_2, y_2),(x_3, y_3),...(x_n, y_n)}$
输出:最终的弱分类器$G_k(x)$
初始化:假定第一次训练时,样本均匀分布权值一样。
$$
D_1=(w_1,w_2,w_3......w_n)
$$
循环:m=1,2,3...M,
- 训练得到弱分类器$G_k(x)$
- 计算当前的分类器$e_k$
- 计算当前分类器的$\alpha_k$
- 更新训练数据集的权值分布,用于下一轮迭代
循环结束条件:$e_k$小于某个阈值(一般是0.5),或是达到最大迭代次数
组合分类器:
$$
f(x) = sign(\sum\limits_{k=1}^{K}\alpha_kG_k(x))
$$
代码
import numpy as np
def creatdata():
datMat = np.mat([[1., 2.1],
[1.5, 1.6],
[1.3, 1.],
[1., 1.],
[2., 1.]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat, classLabels
def classify(datamat, index, threshval, threshineq):
retarray = np.ones((np.shape(datamat)[0], 1))
if threshineq == 'lt':
retarray[datamat[:, index] <= threshval] = -1.0
else:
retarray[datamat[:, index] > threshval] = -1.0
return retarray
def buildonetree(data, labels, D):
datamat = np.mat(data)
labelsmat = np.mat(labels).T
m, n = np.shape(datamat)
numsteps = 10.0
besttree = {}
bestclass = np.mat(np.zeros((m, 1)))
minerror = float('inf')
for i in range(n):
rangemin = datamat[:, i].min()
rangemax = datamat[:, i].max()
step = (rangemax - rangemin) / numsteps
for j in range(-1, int(numsteps) + 1):
threshval = rangemin + float(j) * step
for inequal in ['lt', 'mt']:
predictvalue = classify(datamat, i, threshval, inequal)
errarr = np.mat(np.ones((m, 1)))
errarr[predictvalue == labelsmat] = 0
weighterror = D.T * errarr
if weighterror < minerror:
minerror = weighterror
bestclass = predictvalue.copy()
besttree['dim'] = i
besttree['thresh'] = threshval
besttree['ineq'] = inequal
return besttree, minerror, bestclass,
def buildadaboost(datainput, labels, numit=40):
weakclassarr = []
m = np.shape(datainput)[0]
D = np.mat(np.ones((m, 1)) / m)
aggclassest = np.mat(np.zeros((m, 1)))
for i in range(numit):
besttree, minerror, bestclass, = buildonetree(datainput, labels, D)
alpha = float(0.5 * np.log((1.0 - minerror) / max(minerror, 1e-16)))
besttree['alpha'] = alpha
weakclassarr.append(besttree)
expon = np.multiply(-1 * alpha * np.mat(labels).T, bestclass)
D = np.multiply(D, np.exp(expon))
D = D / D.sum()
aggclassest += alpha * bestclass
aggErrors = np.multiply(np.sign(aggclassest) != np.mat(labels).T, np.ones((m, 1)))
errorRate = aggErrors.sum() / m
print('errorRate:', errorRate)
if errorRate == 0.0: break
return weakclassarr
def adaclassify(data, classarr):
datamat = np.mat(data)
m = np.shape(datamat)[0]
aggclass = np.mat(np.zeros((m, 1)))
for i in range(len(classarr)):
classarray = classify(datamat, classarr[i]['dim'], classarr[i]['thresh'], classarr[i]['ineq'])
aggclass += np.multiply(classarr[i]['alpha'], classarray)
return np.sign(aggclass)
def loaddata(filename):
data = []
lable = []
with open(filename) as fr:
while True:
line = fr.readline()
if line:
linearr = line.strip().split()
mask_x = [float(x) for x in linearr]
data.append(mask_x[0:-1])
lable.append(float(linearr[-1]))
else:
break
pass
return data, lable
if __name__ == "__main__":
data, label = loaddata('horseColicTraining2.txt')
classifyclass = buildadaboost(data, label, 10)
test, testlabel = loaddata('horseColicTest2.txt')
prediction = adaclassify(test, classifyclass)
err = np.mat(np.ones((67, 1)))
error = err[prediction != np.mat(testlabel).T].sum() / 67
print(error)
总结
因为自己是用的机器算法实战所以数学推导很少,准备都码过一遍后再跟着李宏毅和🍉书补一下数学推导,下一个应该是linear regression(有挺多内容的),继续加油
Comments | NOTHING