Fast R-CNN

总结

本文是结合RCNN和SPPnet的改进版本
- 大幅提成训练/测试速度
- 实现端对端一体化训练

对前作缺点分析

RCNN
- 多阶段训练
- 训练慢，也占用硬盘空间
- inference也慢
SPPnet
- 多阶段训练
- fine-tuning阶段无法更新conv网络参数

网络结构

借用SPPnet的思想，每次训练只有一次整体图片的卷积计算，然后通过映射截取特定proposal的卷积值
然后过RoI pooling层（窗口大小和步长根据输入大小动态计算）得到定长特征向量
舍弃SVM训练阶段，直接使用网络softmax去做分类
特征向量同时给到bounding-box回归训练流程，结合两个部分组成一个新的损失函数一起训练

训练样本组成

首先分析了SPPnet不能fine-tuning卷积层参数的原因——不同图片的proposal需要重新计算卷积值走了RCNN的老路，耗时太长
所以在sampling阶段，尽量用同一图片的proposal构成batch
实验证明，选取同一图片的proposal没有收敛之类的问题

FC层分解

简化了卷积层的计算步骤之后，FC层的计算就显得时间过长了（总计算时长的38.7%）
利用因式分解的思想简化计算——将uv的计算量分解为t (u + v)其中$t \ll min(u, v)$

其他的思考

fine-tuning全部网络是否有用？——用处不大，且会让训练速度大大降低，所以一般只fine-tuning后面几层（具体哪几层根据网络深度而定）
分多任务训练是否有帮助？——把分类损失和bounding-box回归损失结合在一起的效果最好，原因可能是分类任务&bounding任务共同反馈使网络学习的更好
尺度不变性用多尺度抽取还是单尺度？——多尺度效果更好但有限；在大型网络中因为GPU内存的限制（暂时）用不了多尺度，此时单尺度的效果超越中、小网络的多尺度，所以网络结构才是关键。选择尽量深的网络加单尺度
需要更多的训练数据？——是的，训练数据越多越好
SVMs比softmax表现效果更好？——对，感觉有点打脸？RCNN里一通分析为啥要单拎出来一个训练流程做分类，结果这次直接合在一块而且效果更好了。除了文章中提到的softmax训练引入了不同类别之间的竞争，个人认为还有一个原因是合并流程把bounding-box回归也合并进来了共同提升了效果
更多的proposal会提升效果？——不是，实验证明后续proposal变多会让mAP变低

VOC测试结果

time: 1830ms
VOC 2007
- 07 data: 66.9%
- 07+12 data: 70%
VOC 2010
- 12 data: 66.1%
- 07+12 data: 68.8%
VOC 2012
- 12 data: 65.7%
- 07+12 data: 68.4%

Abstrct

Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN

Introduction

two primary challenges:

numerous candidate object locations (often called “proposals”) must be processed
these candidates provide only rough localization that must be refined to achieve precise localization

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

R-CNN and SPPnet

R-CNN’s drawbacks:

training is a multi-stage pipeline
training is expensive in space and time
object detection is slow

R-CNN is slow because it performs a ConvNet forward pass for each object proposal

SPPnet just compute convolutional feature map once per image

SPPnet’s drawbacks:

training is a multi-stage pipeline
fine-tuning algorithm cannot update the convolutional layers

contributions

higher detection quality
training is single-stage, using a multi-task loss
training can update all network layers
no disk storage is required for feature caching

Fast R-CNN Architecture and Training

the network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map
for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected
- one that produces softmax probability estimates over K object classes plus a catch-all “background” class
- outputs four real-valued numbers for each of the K object classes

the RoI pooling layer

each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w)

the RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets ¹¹ in which there is only one pyramid level.

initializing from pre-trained networks

three transformations:

the last max pooling layer is replaced by a RoI pooling layer
the network’s last fully connected layer and softmax are replaced with the two sibling layers
the network is modified to take two data inputs: a list of images and a list of RoIs in those images

fine-tuning for detection

Question:

why SPPnet is unable to update weights below the spatial pyramid pooling layer

Answer:

that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image
the inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image
the training inputs are large

Solution:

sampled hierarchically
first by sampling N images and then by sampling R/N RoIs from each image
RoIs from the same image share computation and memory in the forward and backward passes
don’t cause slow training convergence in practice
N = 2 and R = 128

multi-task loss

We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression

$L(p, u , t^u, v) = L_{cls}(p, u) + \lambda[u >= 1]L_{loc}(t^u, v)$

in which $L_{cls} (p, u) = - logP_u$
For background RoIs there is no notion of a ground-truth bounding box and hence $L_{loc}$ is ignored
$L_{loc}(t^u, v) = \sum_{i \in \{x, y, w, h\}}smooth_{L_1}(t^u_i - v_i)$
All experiments use $\lambda = 1$

$\begin{equation}smooth_{L_1}(x) = \begin{cases} 0.5x^2, &if |x| < 1 \\ |x| - 0.5, &otherwise \end{cases} \end{equation}$

mini-bath sampling

batch-size = 128
sampling 64 RoIs from each image
images are horizontally flipped with probability 0.5

back-propagation through RoI pooling layers

backwards function:

$\frac {\partial L} {\partial x_i} = \sum_r \sum_j[i = i^*(r,j)]\frac {\partial L} {\partial y_{rj}}$

SGD hyper-parameters

softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001

biases are initialized to 0

scale invariance

two ways:

“brute force” learning
image pyramids

Fast R-CNN Detection

truncated SVD for faster detection

the time cost on FC is nearly half of the forward pass time

$w \approx U\Sigma_tV^T$

w is u * v
U is v * t
$\Sigma_t$ is t * t
V is v * t
truncated SVD reduces the parameter count from uv to t(u + v) $t \ll min(u, v)$

Main Results

three main results:

state-of-the-art mAP on VOC 2007, 2010, and 2012
fast training and testing compared to R-CNN, SPPnet
fine-tuning conv layers in VGG16 improves mAP

VOC 2010 and 2012 results

VOC 2007 results

training and testing time

truncated SVD

truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP

which layers to fine-tune

validate that fine-tuning the conv layers is important for VGG16

training through the RoI pooling layer is important for very deep nets

Question:

Does this mean that all conv layers should be fine-tuned?

Answer:

no
lower layer is generic and task independent has no meaningful effect on mAP

Design Evaluation

dose multi-task training help

scale invariance: to brute force or finesse?

deep ConvNets are adept at directly learning scale invariance

do we need more training data?

A good object detector should improve when supplied with more training data

roughly tripling the number of images to 16.5k
improves mAP on VOC07 test from 66.9% to 70.0%

do SVMs outperform softmax?

Reason:

softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI

are more proposals always better?

sparse set of object proposals
dense set

that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy

preliminary MS COCO results

skip

Conclusion

skip

Reference

¹¹. K.He,X.Zhang,S.Ren,andJ.Sun.Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV,2014 ↩