• 在GBDT的基础上加了诸多创新点和工程化的东西达到了可以快速、分布式训练的目的
  • 首先通过层层推导,利用泰勒展开的二阶项信息得到了一个判断分割点收益的方程:
  • 提出了两个防止过拟合的方法
    • 新训练出的树所给出的预测结果需要乘上shrinkage因子再加到森林中去
    • 对于特征(列)做采样(等同于RF中的方法)
  • 介绍了两种找最佳分割点的方法
    • 枚举所有分割点,然后找出最佳
    • 利用二阶导数信息当做权重,然后根据权重值来进行等权重划分,划分点即为分割点的候选点,然后依次遍历,此种方法可以应用于分布式计算
  • 针对于稀疏数据,提出了学习default路径的思路,在枚举分割点的时候,将缺失的信息归到两个分支的其中一个,比较效果后决定default路径的方向
  • 文章的工程化创新思路也非常多
    • 针对树训练中所需的大量排序操作设计了Block结构,把每个特征按照特征值排序,存成索引。这样可以在计算前完成排序工作
    • 文章还针对CPU的cache机制进行了存储优化,主要是使CPU的访问地址尽量连续
    • 在内存不足时,设计了Out-of-core机制,过多的数据先写在硬盘上,在要用的时候由一个独立的线程预先加载进来,加大了训练数据的最大规模。在此基础上加入了,内容压缩,存储块共享技术,对训练过程提速
  • 补充一些与传统GBM的区别
    • 使用了泰勒展开二阶项信息(牛顿法)计算下降方向
    • 不搜索步长
    • 加入正则项
    • 可以并行处理
    • 内置缺失值处理规则
    • 使用后剪枝而非预剪枝(GBM)


We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning.


factor of success is scalability in all scenarios

  • a novel tree learning algorithm is for handling sparse data
  • a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning
  • We propose an effective cache-aware block structure for out-of-core tree learning.

we also make additional improvements in proposing a regularized learning objective


Regularized Learning Objective

Where $\Omega(f_k) = \gamma T + \frac {1} {2}\lambda||w||^2$
It penalizes the complexity of the model to avoid over-fitting

Gradient Tree Boosting

The model is trained in an additive manner:

Second-order approximation can be used to quickly optimize the objective:

where $g_i = \partial_{\hat y^{(t - 1)}}l(y_i, \hat y^{(t - 1)})$ and $h_i = \partial^2_{\hat y^{(t - 1)}}l(y_i, \hat y^{(t - 1)})$

define $I_j = \{i | q(x_i) = j\}$ as the instance set of leaf j, we rewrite L as follows:

for a fixed structure q(x), we can compute the optimal weight $w_j^*$ of j (set derivative of $w_j$ to 0):

and calculate the corresponding optimal value by:

This equation can be used as a scoring function to measure the quality of a tree structure q

We can also give the function of the loss reduction after the split:

Shrinkage and Column Subsampling

Prevent over-fitting:

  • shrinkage
  • column (feature) sub-sampling (better than row sub-sampling)


Basic Exact Greedy Algorithm

Alt text

Approximate Algorithm

To support effective gradient tree boosting in:

  • the data dose not fit entirely into memory
  • distributed setting

we need an approximate algorithm:

  • first proposes candidate splitting points according to percentiles of feature distribution
  • then maps the continuous features into buckets split by these candidate points
  • aggregates the statistics and finds the best solution among proposals based on the aggregated statistics

Weighted Quantile Sketch

define rank function:

Alt text

the goal is to find candidate split points $\{s_{k1}, s_{k2}, \dots, s_{kl}\}$, such that:

where $\epsilon$ is an approximation factor

Sparsity-aware Split Finding

We propose to add a default direction in each tree node

Alt text

Alt text

The key improvement is to only visit the non-missing entries $I_k$

Alt text


Column Block for Parallel Learning

Alt text

Cache-aware Access

we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.

Alt text

Blocks for Out-of-core Computation

During computation, it is important to use an independent thread to pre-fetch the block into a main memory buffer

It is important to reduce the overhead and increase the throughput of disk IO

We mainly use two techniques to improve the out-of-core computation:

  • Block Compression
  • Block Sharding

Alt text


System Implementation


Dataset and Setup



Alt text

Learning to Rank

Alt text

Out-of-core Experiment

Alt text

Distributed Experiment

It is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples