A Gradient Boosting Machine

GBDT = Gradient Boosting + Decision Tree
another names:

Ensemble learning

Bootstrap AGGregatING
train weak learner parallelly
low variance

train weak learner iteratively
low bias

optimization goal:

$P^* = \underset {P} {argmin} \Phi(P)$ $=\underset {P} {argmin} E_{y, x}\psi(y, F(x; P))$

solution for the parameters in the form:

$P^* = \sum_{m = 0}^Mp_m$

where $p_0$ is an initial guess and $\{P_m\}_1^m$ are successive increments (“steps” or “boosts”)

define the increments $\{p_m\}_1^M$ as follows

current gradient $g_m$ is computed $gm = \{g_{jm}\} = [\frac {\partial \Phi(P)} {\partial P_j}]_{P = P_{m-1}}$
where $P_{m-1} = \sum_{i = 0}^{m-1}P_i$
linear search learning rate $\rho_m = \underset {\rho} {argmin} \Phi(P_{m - 1} - \rho g_m)$
the step is taken to be $P_m = -\rho_m g_m$

we consider $F(x)$ evaluated at each point x to be a “parameter” and seek to minimize

$F^*(x) = \underset {F(x)} {argmin} E_{y, x}\psi(y, F(x))$

we take solution to be

$F^*(x) = \sum_{m = 0}^Mf_m(x)$

where $f_0(x)$ is an initial guess, and $\{f_m(x)\}_1^M$ are incremental functions

current gradient:

$g_m(x) = E_y[\frac {\partial \psi(y, F(x))} {\partial F(x)}| x]_{F(x) = F_{m - 1}(x)}$

and

$F_{m-1}(x) = \sum_{i = 0}^{m - 1}f_i(x)$

the multiplier $\rho_m$ is given by the linear search

$\rho_m = \underset {\rho} {argmin} E_{y, x} \psi(y, F_{m - 1}(x) - \rho g_m(x))$

and the steepest-descent:

$f_m(x) = -\rho_mg_m(x)$

above approach breaks down when the joint distribution of (y, x) is represented by a finite data sample

$(\beta_m, a_m) = \underset {\beta, a} {argmin}\sum_{i = 1}^{N}\psi(y_i, F_{m - 1}(x_i) + \beta h(x_i;a))$

and then

$F_m(x) = F_{m - 1} + \beta_mh(x; a_m)$

train gradient model by:

$a_m = \underset {a, \beta} {argmin}\sum_{i = 1}^{N}[-g_m(x_i) - \beta h(x_i; a)]^2$

get learning rate $\beta$ by linear search