XGboost 初体验

背景

XGboost 是 Gradient Boosting 框架的一个实现，可以处理回归、分类和排序等多种任务。它有着预测性能上的强大并且训练速度快的特点。

初体验

本文不介绍原理，只介绍入门应用。

搭建环境

本文采用了最简单的安装方式，即 python 运行方式。

创建独立 python 环境

先创建一个独立的 python 环境

安装包 numpy


sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com numpy

安装包 scipy


sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com scipy

如果有冲突的话，可以忽略掉已安装的包


sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com scipy --ignore-installed scipy

安装包 xgboost

再安装 xgboost 包


sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com xgboost

结果


Successfully installed xgboost-0.81

安装包 matplotlib

xgboost 作图的时候需要


sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com matplotlib

安装环境 scikit-learn

安装包 scikit-learn


$ sudo pip3 install --index-url http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com -U scikit-learn

测试 sklearn 是否安装成功


from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
clf.predict([[2., 2.]])[0]

运行例子

这里参考了史上最详细的 XGBoost 实战中的四个例子，分别为：

基于 XGBoost 原生接口的分类
基于 XGBoost 原生接口的回归
基于 Scikit-learn 接口的分类
基于 Scikit-learn 接口的回归

上文中的例子有些不能直接运行，所以略有改动。

除此之外，上文中还有各参数解释的说明。有兴趣的可以直接打开查看。

基于 XGBoost 原生接口的分类

demo1.py，如下：


#-*-coding: UTF-8 -*-
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# IRIS数据集, 鸢尾花分类问题。
iris = load_iris()

# X如 [5.1, 3.5, 1.4, 0.2], 对应于4个特征，特征值都为正浮点数，单位为厘米。
## Sepal.Length（花萼长度）
## Sepal.Width（花萼宽度）
## Petal.Length（花瓣长度）
## Petal.Width（花瓣宽度）
X = iris.data

# Y如 0，对应于鸢尾花的分类
## Iris Setosa（[山鸢尾]）
## Iris Versicolour（杂色鸢尾）
## Iris Virginica（维吉尼亚鸢尾）
y = iris.target

# 取 20% 的数据为测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=61664521)

params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

plst = params.items()

dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 500

# 训练模型
model = xgb.train(plst, dtrain, num_rounds)

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

# 计算准确率
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print("Accuracy: %.2f %% " % (100 * cnt1 / (cnt1 + cnt2)))

# 显示重要特征
plot_importance(model)
plt.show()

运行结果


Accuracy: 93.33 %

基于 XGBoost 原生接口的回归

代码 demo2.py


#-*-coding: UTF-8 -*-

from sklearn.datasets import load_diabetes
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# 读取文件原始数据
## 这里取了糖尿病数据集，是一个用于回归的经典的数据集。
diabetes=load_diabetes()
X = diabetes.data
y = diabetes.target

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=61664521)

params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1,
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 300
plst = params.items()
model = xgb.train(plst, dtrain, num_rounds)

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

# 显示重要特征
plot_importance(model)
plt.show()

结果

基于 Scikit-learn 接口的分类


from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# read in the iris data
iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=61664521)

# 训练模型
model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)

# 对测试集进行预测
ans = model.predict(X_test)

# 计算准确率
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print("Accuracy: %.2f %% " % (100 * cnt1 / (cnt1 + cnt2)))

# 显示重要特征
plot_importance(model)
plt.show()

结果


Accuracy: 93.33 %

基于 Scikit-learn 接口的回归


#-*-coding: UTF-8 -*-

from sklearn.datasets import load_diabetes
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# 读取文件原始数据
## 这里取了糖尿病数据集，是一个用于回归的经典的数据集。
diabetes=load_diabetes()
X = diabetes.data
y = diabetes.target

# XGBoost训练过程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')
model.fit(X_train, y_train)

# 对测试集进行预测
ans = model.predict(X_test)

# 显示重要特征
plot_importance(model)
plt.show()

结果

XGboost 初体验

背景

初体验

搭建环境

创建独立 python 环境

安装包 numpy

安装包 scipy

安装包 xgboost

安装包 matplotlib

安装环境 scikit-learn

安装包 scikit-learn

测试 sklearn 是否安装成功

运行例子

基于 XGBoost 原生接口的分类

基于 XGBoost 原生接口的回归

基于 Scikit-learn 接口的分类

基于 Scikit-learn 接口的回归

参考

相关帖子

打包工具 -Pex

函数

控制流

常用数据结构

变量与数据类型

配置虚拟环境

认识开发工具

欢迎来到这里！