0%

A Gradient Boosting Machine

GBDT = Gradient Boosting + Decision Tree
another names:

  • MART(Multiple Additive Regression Tree)
  • GBRT(Gradient Boosting Regression Tree)
  • Tree Net

Ensemble learning

why ensemble?

  • statistics
    Alt text
  • calculation
    Alt text
  • representation
    Alt text

ensemble type

bagging

Bootstrap AGGregatING
train weak learner parallelly
low variance

boosting

train weak learner iteratively
low bias

  • adaboosting
  • gradient boosting

Gradient boosting

optimization in parameter space

optimization goal:

P=argminPΦ(P)=argminPEy,xψ(y,F(x;P))

solution for the parameters in the form:

P=m=0Mpm

where p0 is an initial guess and {Pm}1m are successive increments (“steps” or “boosts”)

steepest-descent

define the increments {pm}1M as follows

  1. current gradient gm is computed gm={gjm}=[Φ(P)Pj]P=Pm1
    wherePm1=i=0m1Pi
  2. linear search learning rateρm=argminρΦ(Pm1ρgm)
  3. the step is taken to bePm=ρmgm

optimization in function space

we consider F(x) evaluated at each point x to be a “parameter” and seek to minimize

F(x)=argminF(x)Ey,xψ(y,F(x))

we take solution to be

F(x)=m=0Mfm(x)

where f0(x) is an initial guess, and {fm(x)}1M are incremental functions

current gradient:

gm(x)=Ey[ψ(y,F(x))F(x)|x]F(x)=Fm1(x)

and

Fm1(x)=i=0m1fi(x)

the multiplier ρm is given by the linear search

ρm=argminρEy,xψ(y,Fm1(x)ρgm(x))

and the steepest-descent:

fm(x)=ρmgm(x)

finite data

above approach breaks down when the joint distribution of (y, x) is represented by a finite data sample

(βm,am)=argminβ,ai=1Nψ(yi,Fm1(xi)+βh(xi;a))

and then

Fm(x)=Fm1+βmh(x;am)

train gradient model by:

am=argmina,βi=1N[gm(xi)βh(xi;a)]2

get learning rate β by linear search

conclusion

Alt text

Reference