Generative Adversarial Nets

总结

生成对抗网络由生成器G和分类器D组成，生成器G接受噪音生成样本，D负责区分这些样本是否是由D生成出来的
D和G共同优化目标函数$\underset {G} {min} \underset {D} {max} V(D,G) = E_{x \sim p_{data}(x)}[log(D(x))] + E_{x \sim P_{g_z(z)}}[log(1 - D(g(z)))]$ 一个是让值尽量大，一个是让值尽量小，即所谓“对抗”
该目标函数会在$p_g = p_{data}$处收敛到最优，即生成分布与训练数据分布完全一致函数值为$-log4$
关于收敛性的证明：一个凸函数集合的上界的次梯度（可导的时候等于梯度）完全包含最后函数的梯度，所以在求导做梯度下降的时候整体仍然按照凸函数的性质收敛
使用Gaussian Parzen window(youtube视频)方法对模型进行评估, Parzen Window 其实是用已有数据到待测数据点的“距离”平均值来表示待测点出现在已有数据表示的分布上的概率（或者说在已有数据拟合出的概率密度函数上的值）
- 通用概率密度函数$P(x) = \frac {1} {n}\sum_{i = 1}^{n}\frac {1} {h^d}K(\frac {x - x_i} {h})$
- Parzen window中的核函数可以有多种，使用Gaussian核的就叫做Gaussian Parzen window$P(x) = \frac {1} {n} \sum^n_{i = 1}\frac {1} {h\sqrt {2\pi}^d}exp(-\frac 1 2(\frac {x - x_i} {h})^2 )$
- 算出概率之后再过一遍$score = -log(1 - P(x))$(猜测)
优点
- 不使用马尔科夫链
- 训练模型的时候不需要inference
- 可拟合sharp的分布
- 训练样本并不直接作用于G的训练，保证参数独立
- 可使用多种模型拟合G和D
缺点
- G和D需要同步训练（两次D一次G）
- 没有$p_g(z)$分布的显示表示

Abstract

propose a new framework for estimating generative models via adversarial process

simultaneously train two models:

a generative model G
a discriminative model D

a unique solution exists, with G recovering the train data distribution and D equal to $\frac 1 2$ everywhere

Introduction

a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters

Relate Work

restricted Boltzmann machines(RBM)
deep Boltzmann machines(DBMs)
deep belief networks(DBNs)
score matching
noise-contrastive estimation
generative stochastic network

Adversarial nets

G is a differentiable function to output a generate data by $G(z;\theta_g)$

D is a multilayer perce ptron $D(x;\theta_d)$ to output a single scalar that represent whether a data is fabricated by G

D and G play the following two-player minimax game with value function $V(G, D)$

$\underset {G} {min} \underset {D} {max}V(D,G) = E_{x\sim p_{data}(x)}[logD(x)] + E_{x \sim p_z(z)}[log(1 - D(G(z)))]$

we alternate between k steps of optimizing D and one step of optimizing G

Early in learning, when G is poor $log(1 - D(G(z)))$ saturates. we use $\underset {G} {max} logD(G(z))$ instead. It’s provides much stronger gradients early in learning

Theoretical Results

Global Optimality of $p_g = p_{data}$

the global minimum of the virtual training criterion $C(G)$ is achieved if and only if $p_g = p_{data}$. At that point, $C(G)$ achieves the value - $log4$

$C(G) = -log(4) + 2 * JSD(p_{data}||p_g)$

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal

Convergence of Algorithm 1

If G and D have enough capacity, and at each step of Algorithm 1, the discriminator
is allowed to reach its optimum given G, and $p_g$ is updated so as to improve the criterion

proof:
The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained

一个凸函数集合的上界的次导数包含达到最大值的函数的导数

看不懂……

Experiments

database:

MNIST
TFD
CIFAR-10

activation:
generator use mixture of rectifier linear and sidmod as activations; discriminator use maxout activation

We estimate probability of the test set data under $p_g$ by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution

Advantages and disadvantages

disadvantages:

there is no explicit representation of $p_g(x)$
and that D must be synchronized well with G during training (in particular, G must not be trained too much without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model $p_{data}$)

advantages:

Markov chains are never needed
no inference is needed during learning
a wide variety of functions can be incorporated into the model
generator network not being updated directly with data examples, but only with gradients flowing through the discriminator
they can represent very sharp, even degenerate distributions

Conclusions and future work

extensions in future:

A conditional generative model $p(x | c)$ can be obtained by adding c as input to both G and D
Learned approximate inference can be performed by training an auxiliary network to predict z given x
Semi-supervised learning