0%

## 总结

• 在GBDT的基础上加了诸多创新点和工程化的东西达到了可以快速、分布式训练的目的
• 首先通过层层推导，利用泰勒展开的二阶项信息得到了一个判断分割点收益的方程：
• 提出了两个防止过拟合的方法
• 新训练出的树所给出的预测结果需要乘上shrinkage因子再加到森林中去
• 对于特征（列）做采样（等同于RF中的方法）
• 介绍了两种找最佳分割点的方法
• 枚举所有分割点，然后找出最佳
• 利用二阶导数信息当做权重，然后根据权重值来进行等权重划分，划分点即为分割点的候选点，然后依次遍历，此种方法可以应用于分布式计算
• 针对于稀疏数据，提出了学习default路径的思路，在枚举分割点的时候，将缺失的信息归到两个分支的其中一个，比较效果后决定default路径的方向
• 文章的工程化创新思路也非常多
• 针对树训练中所需的大量排序操作设计了Block结构，把每个特征按照特征值排序，存成索引。这样可以在计算前完成排序工作
• 文章还针对CPU的cache机制进行了存储优化，主要是使CPU的访问地址尽量连续
• 在内存不足时，设计了Out-of-core机制，过多的数据先写在硬盘上，在要用的时候由一个独立的线程预先加载进来，加大了训练数据的最大规模。在此基础上加入了，内容压缩，存储块共享技术，对训练过程提速
• 补充一些与传统GBM的区别
• 使用了泰勒展开二阶项信息（牛顿法）计算下降方向
• 不搜索步长
• 加入正则项
• 可以并行处理
• 内置缺失值处理规则
• 使用后剪枝而非预剪枝（GBM）

## ABSTRACT

We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning.

## INTRODUCTION

factor of success is scalability in all scenarios

• a novel tree learning algorithm is for handling sparse data
• a theoretically justified weighted quantile sketch procedure enables handling instance weights in approximate tree learning
• We propose an effective cache-aware block structure for out-of-core tree learning.

we also make additional improvements in proposing a regularized learning objective

## TREE BOOSTING IN A NUTSHELL

### Regularized Learning Objective

Where $\Omega(f_k) = \gamma T + \frac {1} {2}\lambda||w||^2$
It penalizes the complexity of the model to avoid over-fitting

The model is trained in an additive manner:

Second-order approximation can be used to quickly optimize the objective:

where $g_i = \partial_{\hat y^{(t - 1)}}l(y_i, \hat y^{(t - 1)})$ and $h_i = \partial^2_{\hat y^{(t - 1)}}l(y_i, \hat y^{(t - 1)})$

define $I_j = \{i | q(x_i) = j\}$ as the instance set of leaf j, we rewrite L as follows:

for a fixed structure q(x), we can compute the optimal weight $w_j^*$ of j (set derivative of $w_j$ to 0):

and calculate the corresponding optimal value by:

This equation can be used as a scoring function to measure the quality of a tree structure q

We can also give the function of the loss reduction after the split:

### Shrinkage and Column Subsampling

Prevent over-fitting:

• shrinkage
• column (feature) sub-sampling (better than row sub-sampling)

## SPLIT FINDING ALGORITHMS

### Approximate Algorithm

To support effective gradient tree boosting in:

• the data dose not fit entirely into memory
• distributed setting

we need an approximate algorithm:

• first proposes candidate splitting points according to percentiles of feature distribution
• then maps the continuous features into buckets split by these candidate points
• aggregates the statistics and finds the best solution among proposals based on the aggregated statistics

### Weighted Quantile Sketch

define rank function:

the goal is to find candidate split points $\{s_{k1}, s_{k2}, \dots, s_{kl}\}$, such that:

where $\epsilon$ is an approximation factor

### Sparsity-aware Split Finding

We propose to add a default direction in each tree node

The key improvement is to only visit the non-missing entries $I_k$

## SYSTEM DESIGN

### Cache-aware Access

we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.

### Blocks for Out-of-core Computation

During computation, it is important to use an independent thread to pre-fetch the block into a main memory buffer

It is important to reduce the overhead and increase the throughput of disk IO

We mainly use two techniques to improve the out-of-core computation:

• Block Compression
• Block Sharding

## END TO END EVALUATIONS

skip

skip

### Distributed Experiment

It is able to take advantage of out-of-core computing and smoothly scale to all 1.7 billion examples

skip