机器学习算法-adaboost

发布于 2020-09-29  1409 次阅读


迭代算法(adaboost)

算法介绍

Adaboost算法基本原理就是将多个弱分类器(弱分类器一般选用单层决策树)进行合理的结合,使其成为一个强分类器

算法优点

Adaboost的主要优点有:

  1. Adaboost作为分类器时,分类精度很高

  2. 在Adaboost的框架下,可以使用各种回归分类模型来构建弱学习器,非常灵活。

  3. 作为简单的二元分类器时,构造简单,结果可理解。

  4. 不容易发生过拟合

Adaboost的主要缺点有:

  1. 对异常样本敏感,异常样本在迭代中可能会获得较高的权重,影响最终的强学习器的预测准确性。

训练算法

1.错误率

$$
e_k = P(G_k(x_i) \neq y_i)
$$

2.分类器权重

$$
\alpha_k = \frac{1}{2}log\frac{1-e_k}{e_k}
$$

3.样本权重更新

$$
D_{k+1,i} = \frac{D_{k}}{Z_K}exp(-\alpha_ky_iG_k(x_i))
$$

这里$Z_k$是规范化因子
$$
Z_k = \sum\limits_{i=1}^{m}D_{k}exp(-\alpha_ky_iG_k(x_i))
$$

4.集合

$$
f(x) = sign(\sum\limits_{k=1}^{K}\alpha_kG_k(x))
$$

5.算法流程

输入:训练数据集$T={(x_1, y_1),(x_2, y_2),(x_3, y_3),...(x_n, y_n)}$

输出:最终的弱分类器$G_k(x)$

初始化:假定第一次训练时,样本均匀分布权值一样。
$$
D_1=(w_1,w_2,w_3......w_n)
$$
循环:m=1,2,3...M,

  • 训练得到弱分类器$G_k(x)$
  • 计算当前的分类器$e_k$
  • 计算当前分类器的$\alpha_k$
  • 更新训练数据集的权值分布,用于下一轮迭代

循环结束条件:$e_k$小于某个阈值(一般是0.5),或是达到最大迭代次数

组合分类器:
$$
f(x) = sign(\sum\limits_{k=1}^{K}\alpha_kG_k(x))
$$

代码

import numpy as np


def creatdata():
    datMat = np.mat([[1., 2.1],
                     [1.5, 1.6],
                     [1.3, 1.],
                     [1., 1.],
                     [2., 1.]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return datMat, classLabels


def classify(datamat, index, threshval, threshineq):
    retarray = np.ones((np.shape(datamat)[0], 1))
    if threshineq == 'lt':
        retarray[datamat[:, index] <= threshval] = -1.0
    else:
        retarray[datamat[:, index] > threshval] = -1.0
    return retarray


def buildonetree(data, labels, D):
    datamat = np.mat(data)
    labelsmat = np.mat(labels).T
    m, n = np.shape(datamat)
    numsteps = 10.0
    besttree = {}
    bestclass = np.mat(np.zeros((m, 1)))
    minerror = float('inf')
    for i in range(n):
        rangemin = datamat[:, i].min()
        rangemax = datamat[:, i].max()
        step = (rangemax - rangemin) / numsteps
        for j in range(-1, int(numsteps) + 1):
            threshval = rangemin + float(j) * step
            for inequal in ['lt', 'mt']:
                predictvalue = classify(datamat, i, threshval, inequal)
                errarr = np.mat(np.ones((m, 1)))
                errarr[predictvalue == labelsmat] = 0
                weighterror = D.T * errarr
                if weighterror < minerror:
                    minerror = weighterror
                    bestclass = predictvalue.copy()
                    besttree['dim'] = i
                    besttree['thresh'] = threshval
                    besttree['ineq'] = inequal
    return besttree, minerror, bestclass,


def buildadaboost(datainput, labels, numit=40):
    weakclassarr = []
    m = np.shape(datainput)[0]
    D = np.mat(np.ones((m, 1)) / m)
    aggclassest = np.mat(np.zeros((m, 1)))
    for i in range(numit):
        besttree, minerror, bestclass, = buildonetree(datainput, labels, D)
        alpha = float(0.5 * np.log((1.0 - minerror) / max(minerror, 1e-16)))
        besttree['alpha'] = alpha
        weakclassarr.append(besttree)
        expon = np.multiply(-1 * alpha * np.mat(labels).T, bestclass)
        D = np.multiply(D, np.exp(expon))
        D = D / D.sum()
        aggclassest += alpha * bestclass
        aggErrors = np.multiply(np.sign(aggclassest) != np.mat(labels).T, np.ones((m, 1)))
        errorRate = aggErrors.sum() / m
        print('errorRate:', errorRate)
        if errorRate == 0.0: break
    return weakclassarr


def adaclassify(data, classarr):
    datamat = np.mat(data)
    m = np.shape(datamat)[0]
    aggclass = np.mat(np.zeros((m, 1)))
    for i in range(len(classarr)):
        classarray = classify(datamat, classarr[i]['dim'], classarr[i]['thresh'], classarr[i]['ineq'])
        aggclass += np.multiply(classarr[i]['alpha'], classarray)
    return np.sign(aggclass)


def loaddata(filename):
    data = []
    lable = []
    with open(filename) as fr:
        while True:
            line = fr.readline()
            if line:
                linearr = line.strip().split()
                mask_x = [float(x) for x in linearr]
                data.append(mask_x[0:-1])
                lable.append(float(linearr[-1]))
            else:
                break
                pass
    return data, lable


if __name__ == "__main__":
    data, label = loaddata('horseColicTraining2.txt')
    classifyclass = buildadaboost(data, label, 10)
    test, testlabel = loaddata('horseColicTest2.txt')
    prediction = adaclassify(test, classifyclass)
    err = np.mat(np.ones((67, 1)))
    error = err[prediction != np.mat(testlabel).T].sum() / 67
    print(error)

总结

因为自己是用的机器算法实战所以数学推导很少,准备都码过一遍后再跟着李宏毅和🍉书补一下数学推导,下一个应该是linear regression(有挺多内容的),继续加油