Click-Through Rate Estimation for Rare Events in Online Advertising

ABSTRACT

CTR estimation for rare events is essential but challenging, often incurring with huge variance, due to the sparsity in data

INTRODUCTION

large fraction of Internet traffic while the overwhelming counterparts are extremely rare events

BACKGROUND

for a given page/ad pair (p, a), with the number of impressions (I) and clicks (C)
the standard estimation of CTR as C/I is inevitably unreliable

Data Hierarchies

$r_i = (C_i + \alpha) / ( I_i + \alpha + \beta)$

Data Continuity

$\hat{C_j} = C_j; j = 1$
$\hat{C_j} = \gamma C_j + (1 - \gamma) C_{j - 1}; j = 2,\dots,M$
where $\gamma$is the smoothing factor, and 0 < $\gamma$ < 1
that is often less effective

we can first do exponential smoothing on impressions and clicks before leveraging the data hierarchy to combine two smooth method together

SMOOTHING CTR ESTIMATION

Leveraging Data Hierarchies with Empirical Bayes Methods

Two assumptions:

for each publisher/account pair, there is an underlying/unknown probability distribution of CTR, and the CTR of every page/ad pair in the publisher/account pair could be regarded a stochastic outcome of the distribution
for every page/ad pair, we observe the impressions and clicks generated from a distribution of the CTR of this pair

The maximum can be calculated through the fixed-point iteration

stopping criterion

1000 iterations
changes in both $\alpha, \beta$ are smaller than 1E-10

smoothed ctr:

$\hat{r_i} = (C_i + \alpha) / (I_i + \alpha + \beta)$

Utilizing Event History with Exponential Smoothing

Combining Data Hierarchies and Temporal Smoothing

better performance than each individual model alone

EXPERIMENTAL RESULTS

Evaluation Metrics

$r$ and $\hat{r}$ are the true CTR and the estimated CTR respectively

MSE

KL-divergence

when $r$ or $\hat{r}$ are in extreme values (0 or 1) $D_{kl}$ is not well defined. ignore it

MSE and the average KL divergence in general agree

Experiments

all the page/ad pairs we consider in this chapter have at least 10,000 impressions
exclude the publisher/account pairs that do not have at least 10 page/ad pairs

Results on Smoothing with Data Hierarchies

the more sparser data is the more dramatic effect is

Results on Temporal Exponential Smoothing

autohome CTR frame

use posterior features + gbdt

func	test AUC
hierarchy smooth	0.664616
def value($\alpha=10,\beta=50$)	0.664404
no smooth	0.664340

Results on manual data

compare:

no smooth
smooth with fixed value
smooth with calculate value

as we can see, when tiny sample fixed value smooth is better than no smooth. but if we have hundred samples or larger, fixed value smooth get worse result.