0%

XGBoost

总结

  • 在GBDT的基础上加了诸多创新点和工程化的东西达到了可以快速、分布式训练的目的
  • 首先通过层层推导,利用泰勒展开的二阶项信息得到了一个判断分割点收益的方程:Lsplit=12[iILgi2iIL(hi+λ)+iIRgi2iIR(hi+λ)iIgi2iI(hi+λ)]γ
  • 提出了两个防止过拟合的方法
    • 新训练出的树所给出的预测结果需要乘上shrinkage因子再加到森林中去
    • 对于特征(列)做采样(等同于RF中的方法)
  • 介绍了两种找最佳分割点的方法
    • 枚举所有分割点,然后找出最佳
    • 利用二阶导数信息当做权重,然后根据权重值来进行等权重划分,划分点即为分割点的候选点,然后依次遍历,此种方法可以应用于分布式计算
  • 针对于稀疏数据,提出了学习default路径的思路,在枚举分割点的时候,将缺失的信息归到两个分支的其中一个,比较效果后决定default路径的方向
  • 文章的工程化创新思路也非常多
    • 针对树训练中所需的大量排序操作设计了Block结构,把每个特征按照特征值排序,存成索引。这样可以在计算前完成排序工作
    • 文章还针对CPU的cache机制进行了存储优化,主要是使CPU的访问地址尽量连续
    • 在内存不足时,设计了Out-of-core机制,过多的数据先写在硬盘上,在要用的时候由一个独立的线程预先加载进来,加大了训练数据的最大规模。在此基础上加入了,内容压缩,存储块共享技术,对训练过程提速
  • 补充一些与传统GBM的区别
    • 使用了泰勒展开二阶项信息(牛顿法)计算下降方向
    • 不搜索步长
    • 加入正则项
    • 可以并行处理
    • 内置缺失值处理规则
    • 使用后剪枝而非预剪枝(GBM)

ABSTRACT

We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning.

INTRODUCTION

factor of success is scalability in all scenarios

  • a novel tree learning algorithm is for handling sparse data
  • a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning
  • We propose an effective cache-aware block structure for out-of-core tree learning.

we also make additional improvements in proposing a regularized learning objective

TREE BOOSTING IN A NUTSHELL

Regularized Learning Objective

L(ϕ)=il(yi^,yi)+kΩ(fk)

Where Ω(fk)=γT+12λ||w||2
It penalizes the complexity of the model to avoid over-fitting

Gradient Tree Boosting

The model is trained in an additive manner:

L(t)=i=1nl(yi,yi^(t1)+ft(xi))+Ω(ft)

Second-order approximation can be used to quickly optimize the objective:

L(t)=i=1n[l(yi,y^(t1))+gift(xi)+12hift2(xi)]+Ω(ft)

where gi=y^(t1)l(yi,y^(t1)) and hi=y^(t1)2l(yi,y^(t1))

define Ij={i|q(xi)=j} as the instance set of leaf j, we rewrite L as follows:

L~(t)=j=1T[(iIjgi)wj+12(iIjhi+λ)wj2]+γT

for a fixed structure q(x), we can compute the optimal weight wj of j (set derivative of wj to 0):

wj=iIjgiiIjhi+λ

and calculate the corresponding optimal value by:

L~(t)=12j=1TiIjgi2iIjhi+λ+γT

This equation can be used as a scoring function to measure the quality of a tree structure q

We can also give the function of the loss reduction after the split:

Lsplit=12[iILgi2iIL(hi+λ)+iIRgi2iIR(hi+λ)iIgi2iI(hi+λ)]γ

Shrinkage and Column Subsampling

Prevent over-fitting:

  • shrinkage
  • column (feature) sub-sampling (better than row sub-sampling)

SPLIT FINDING ALGORITHMS

Basic Exact Greedy Algorithm

Alt text

Approximate Algorithm

To support effective gradient tree boosting in:

  • the data dose not fit entirely into memory
  • distributed setting

we need an approximate algorithm:

  • first proposes candidate splitting points according to percentiles of feature distribution
  • then maps the continuous features into buckets split by these candidate points
  • aggregates the statistics and finds the best solution among proposals based on the aggregated statistics

Weighted Quantile Sketch

define rank function:

Alt text

the goal is to find candidate split points {sk1,sk2,,skl}, such that:

|rk(sk,j)rk(sk,j+1)|<ϵsk,1=minixik,sk,l=maxixik

where ϵ is an approximation factor

Sparsity-aware Split Finding

We propose to add a default direction in each tree node

Alt text

Alt text

The key improvement is to only visit the non-missing entries Ik

Alt text

SYSTEM DESIGN

Column Block for Parallel Learning

Alt text

Cache-aware Access

we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.

Alt text

Blocks for Out-of-core Computation

During computation, it is important to use an independent thread to pre-fetch the block into a main memory buffer

It is important to reduce the overhead and increase the throughput of disk IO

We mainly use two techniques to improve the out-of-core computation:

  • Block Compression
  • Block Sharding

Alt text

END TO END EVALUATIONS

System Implementation

skip

Dataset and Setup

skip

Classification

Alt text

Learning to Rank

Alt text

Out-of-core Experiment

Alt text

Distributed Experiment

It is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples

CONCLUSION

skip