## 总结

- 生成对抗网络由生成器G和分类器D组成，生成器G接受噪音生成样本，D负责区分这些样本是否是由D生成出来的
- D和G共同优化目标函数$\underset {G} {min} \underset {D} {max} V(D,G) = E_{x \sim p_{data}(x)}[log(D(x))] + E_{x \sim P_{g_z(z)}}[log(1 - D(g(z)))]$ 一个是让值尽量大，一个是让值尽量小，即所谓“对抗”
- 该目标函数会在$p_g = p_{data}$处收敛到最优，即生成分布与训练数据分布完全一致函数值为$-log4$
- 关于收敛性的证明：一个凸函数集合的上界的次梯度（可导的时候等于梯度）完全包含最后函数的梯度，所以在求导做梯度下降的时候整体仍然按照凸函数的性质收敛
- 使用Gaussian Parzen window(youtube视频)方法对模型进行评估, Parzen Window 其实是用已有数据到待测数据点的“距离”平均值来表示待测点出现在已有数据表示的分布上的概率（或者说在已有数据拟合出的概率密度函数上的值）
- 通用概率密度函数$P(x) = \frac {1} {n}\sum_{i = 1}^{n}\frac {1} {h^d}K(\frac {x - x_i} {h})$
- Parzen window中的核函数可以有多种，使用Gaussian核的就叫做Gaussian Parzen window$P(x) = \frac {1} {n} \sum^n_{i = 1}\frac {1} {h\sqrt {2\pi}^d}exp(-\frac 1 2(\frac {x - x_i} {h})^2 )$
- 算出概率之后再过一遍$score = -log(1 - P(x))$(猜测)

- 优点
- 不使用马尔科夫链
- 训练模型的时候不需要inference
- 可拟合sharp的分布
- 训练样本并不直接作用于G的训练，保证参数独立
- 可使用多种模型拟合G和D

- 缺点
- G和D需要同步训练（两次D一次G）
- 没有$p_g(z)$分布的显示表示

## Abstract

propose a new framework for estimating generative models via adversarial process

simultaneously train two models:

- a generative model G
- a discriminative model D

a unique solution exists, with G recovering the train data distribution and D equal to $\frac 1 2$ everywhere

## Introduction

a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters

## Relate Work

- restricted Boltzmann machines(RBM)
- deep Boltzmann machines(DBMs)
- deep belief networks(DBNs)
- score matching
- noise-contrastive estimation
- generative stochastic network

## Adversarial nets

G is a differentiable function to output a generate data by $G(z;\theta_g)$

D is a multilayer perce ptron $D(x;\theta_d)$ to output a single scalar that represent whether a data is fabricated by G

D and G play the following two-player minimax game with value function $V(G, D)$

we alternate between k steps of optimizing D and one step of optimizing G

Early in learning, when G is poor $log(1 - D(G(z)))$ saturates. we use $\underset {G} {max} logD(G(z))$ instead. It’s provides much stronger gradients early in learning

## Theoretical Results

### Global Optimality of $p_g = p_{data}$

the global minimum of the virtual training criterion $C(G)$ is achieved if and only if $p_g = p_{data}$. At that point, $C(G)$ achieves the value - $log4$

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal

### Convergence of Algorithm 1

If G and D have enough capacity, and at each step of Algorithm 1, the discriminator

is allowed to reach its optimum given G, and $p_g$ is updated so as to improve the criterion

proof:

The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained

一个凸函数集合的上界的次导数包含达到最大值的函数的导数

看不懂……

## Experiments

database:

- MNIST
- TFD
- CIFAR-10

activation:

generator use mixture of rectifier linear and sidmod as activations; discriminator use maxout activation

We estimate probability of the test set data under $p_g$ by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution

## Advantages and disadvantages

disadvantages:

- there is no explicit representation of $p_g(x)$
- and that D must be synchronized well with G during training (in particular, G must not be trained too much without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model $p_{data}$)

advantages:

- Markov chains are never needed
- no inference is needed during learning
- a wide variety of functions can be incorporated into the model
- generator network not being updated directly with data examples, but only with gradients flowing through the discriminator
- they can represent very sharp, even degenerate distributions

## Conclusions and future work

extensions in future:

- A conditional generative model $p(x | c)$ can be obtained by adding c as input to both G and D
- Learned approximate inference can be performed by training an auxiliary network to predict z given x
- Semi-supervised learning