0%

## 总结

• 新的detection方法的特点
• 利用了高拟合能力的卷积神经网络
• 缓解了训练detection任务数据量少的问题

### 物体定位问题

• 当成回归问题做——效果不好
• 用滑动窗口做——新的数据集图片比较大，不适合

• (用selective search)生成proposals
• 通过CNN对每个proposals生成固定长度的特征向量
• 用多个SVM判定向量是否属于特定类别
• 非极大值抑制筛选最终结果

### 数据量少的问题

• 先使用无监督训练，然后有监督fine-tuning

• 先用大量的无关classification数据训练网络初始参数
• 然后用特定类别有监督fine-tuning
• 迁移学习思想

### 可拓展性

• 生成proposal的时候是类别无关的
• CNN抽取特征的时候是类别无关的
• 最后的SVM判别是和类别数量相关，但只是线性相关
• 非极大值抑制过程也是线性相关
• 所以在目标类别很多的时候依旧能保证高效率，而不用去做近似计算

### 训练细节

#### CNN fine-tuning

• 把预训练的最后一层替换为目标类别数+1（背景算作一个类别）的分类层
• 所有和标记框IoU >= 0.5的proposal当做标记框所在类别的正样本否则当做负样本
• 学习率是预训练阶段的十分之一（0.001）
• 用32个正样本和96个负样本组成一个128的batch（这个比例主要是因为正样本少）

#### 特征分类

• 比fine-tuning的样本筛选更加严格
• 只有标记框抽取出来的特征当做正样本
• 与标记框IoU低于0.3(grid search出来的)的当做负样本
• IoU大于0.3但是小于1的扔掉

fine-tuning的样本选取方式不适合用于精确定位位置，所以单独训练模型去干这个事

### 特征可视化

• 选定一个特征（本文选取的是$pool_5$里的特征）
• 用大量的proposal过神经网络，得到这个特征的值
• 按特征值从大到小排列
• 非极大值抑制筛选proposal
• 把较大的proposal输出即为此特征的抽取信息的直观表现

### 分类特征值选择

• CNN最后的抽象特征有多层，本文对用不同的特征作为分类特征做了实验
• 如果没有经过fine-tuning那么用靠前网络层的更好($fc_6 > fc_7$)
• 经过fine-tuning的就用后面层作为分类特征值

### 网络结构选择

CNN作为特征抽取器，其结构对于最后的效果有非常大的影响

### bounding-box回归

• 用中心点x, y坐标，方框长和宽四维数据去定义边框
• 训练四个映射，把四维数据映射为真实值
• 用CNN $pool_5$的值代表proposal信息
• 采用LR模型
• 只把proposal和真实边框IoU大于等于0.6的当做训练数据
• 正则化参数很重要

### 与OverFeat对比

• OverFeat用的是multi-scale pyramid而不是selective search
• OverFeat是为每个类别分别训练一个bounding-box regression模型

### VOC测试结果

• time: 13000ms
• VOC 2010: 53.7%
• VOC 2011&2012: 53.3%
• VOC 2013: 31.4%

## Abstract

propose a detection algorithm:

• one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects
• when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost

RCNN: Regions with CNN features

## Introduction

we focused on two problems:

• localizing objects with a deep network
• training a high-capacity model with only a small quantity of annotated detection data

### localizing objects with a deep network

old way:

• approach frames localization as a regression problem may not fare well in practice
• our network is too deep to use a sliding-window

new method:

• recognition using regions
• generate proposal
• extracts a fixed-length feature vector from each proposal using a CNN
• classifies each region with category-specific linear SVMs

### scarce labeled data

old way:

• use unsupervised pre-training, followed by supervised fine-tuning

new method:

• supervised pre-training on a large auxiliary dataset
• followed by domain-specific fine-tuning on a small dataset

## Object Detection with R-CNN

three modules:

• generates category-independent region proposals
• a large convolutional neural network that extracts a fixed-length feature vector from each region
• a set of class-specific linear SVMs

### module design

#### region proposals

use selective search to enable a controlled comparison with prior detection work

use fast mode of selective search

#### feature extraction

use AlexNet

##### transformations
• isotropically scales
• with context(B)
• without context(C)
• anisotropically scales(D)

The amount of context padding (p) is defined as a border size around the original object proposal in the transformed input coordinate frame

warping with context padding (p = 16 pixels) outperformed

### test-time detection

given all scored regions in an image, we apply a greedy non-maximum suppression

#### run-time analysis

two properties make detection efficient:

• all CNN parameters are shared across all categories
• the feature vectors computed by the CNN are low-dimensional

13s/image on a GPU or 53s/image on a CPU

scalable:

• CNN feature are class-independent
• The only class-specific computations are dot products

this analysis shows that R-CNN can scale to thousands of object classes without resorting to approximate techniques

### training

#### supervised pre-training

using image-level annotations only (bounding- box labels are not available for this data)

#### domain-specific fine-tuning

continue CNN training using only warped region proposals

the last layer is randomly initialized (N + 1)-way classification —— where N is the number of object classes, plus 1 for background

We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives

learning rate 0.001

In each SGD iteration, we uniformly sample 32 positive windows (over all classes) and 96 background windows to construct a mini-batch of size 128

#### object category classifier

Question:

• how to label a region that partially overlaps ground-truth?

• We resolve this issue with an IoU overlap threshold
• negatives: proposal regions which IoU below threshold are defined as negatives
• positives: defined simply to be the ground-truth bounding boxes for each class
• threshold is 0.3 selected by a grid search
• proposals that fall into the grey zone (more than 0.3 IoU overlap, but are not ground truth) are ignored
##### Positive vs. negative examples and softmax

Question:

• Why are positive and negative examples defined differently for fine-tuning the CNN versus training the object detection SVMs?

• fine-tuning data is limited
• proposals with overlap between 0.5 and 1, expands the number of positive examples by approximately 30x

Question:

• Why, after fine-tuning, train SVMs at all?

• definition of positive examples used in fine-tuning does not emphasize precise localization

## Visualization, Ablation and Modes of Error

### visualizing learned features

We propose a simple (and complementary) non-parametric method that directly shows what the network learned

• compute the unit’s activations on a large set of held-out region proposals
• sort the proposals from highest to lowest activation
• perform non- maximum suppression

### ablation studies

Performance layer-by-layer

#### without fine-tuning

features from fc7 generalize worse than features from fc6

#### with fine-tuning

The boost from fine-tuning is much larger for fc6 and fc7 than for pool5

#### comparison to recent feature learning method

All R-CNN variants strongly outperform the three DPM baselines

### network architectures

the choice of architecture has a large effect on R-CNN detection performance

### detection error analysis

we applied the excellent detection analysis tool from Hoiem et al.23

### bounding-box regression

we train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal

The input to our training algorithm is a set of N training pairs ${(P^i, G^i)}_{i = 1,\dots ,N}$ where $P^i = (P^i_x, P^i_y, P^i_w, P^i_h)$ specifies the pixel coordinates of the center of proposal $p^i$’s bounding box together with $P^i$’s width and height in pixels

ground-truth bounding box G is specified in the same way

our goal is to learn a transformation that maps a proposed box P to a ground-truth box G

we parameterize the transformation in terms of four functions $d_x(P), d_y(P), d_w(P), d_h(P)$

we can transform an input proposal P into a predicted ground-truth box $\hat G$ by applying the transformation:

• $\hat G_x = P_wd_x(P) + P_x$
• $\hat G_y = P_hd_y(P) + P_y$
• $\hat G_w = P_wexp(d_w(P))$
• $\hat G_h = P_hexp(d_h(P))$

we learn $w_*$ by optimizing the regularized least square objective:

where $\phi_5$ is pool5 features of proposal P and $t$ is calculated by ground-truth value

subtle issues:

• regularization is important($\lambda = 1000$)
• care must be taken when selecting which training pairs (P, G) to use(only use IoU >= 0.6)

we found that iterating does not improve results

skip

## The ILSVRC2013 Detection Dataset

less homogeneous than PASCAL VOC

### dataset overview

train (395,918), val (20,121), and test (40,152)

our general strategy is to rely heavily on the val set and use some of the train images as an auxiliary source of positive examples

### region proposals

we followed the same region proposal approach that was used for detection on PASCAL

### training data

we formed a set of images and boxes that includes all selective search and ground-truth boxes from val1 together with up to N ground-truth boxes per class from train

An initial experiment indicated that mining negatives from all of val1, versus a 5000 image subset (roughly half of it), resulted in only a 0.5 percentage point drop in mAP, while cutting SVM training time in half

The bounding-box regressors were trained on val1

### validation and evaluation

we validated data on the val2 set

### ablation study

essentially no difference between N = 500 and N = 1000

### relationship to OverFeat

OverFeat can be seen (roughly) as a special case of R-CNN

• selective search vs multi-scale pyramid
• per-class bounding-box regressors vs single bounding-box regressor

It is worth noting that OverFeat has a significant speed advantage over R-CNN(9x faster)

## Semantic Segmentation

### CNN features for segmentation

we evaluate three strategies for computing features on CPMC regions:

• (full) ignores the region’s shape and computes CNN features directly on the warped window
• (fg) computes CNN features only on a region’s foreground mask
• (full+fg) simply concatenates the full and fg features

### results on VOC 2011

• layer fc6 always outperforms fc7
• the context provided by the full features is highly informative even given the fg features

## Conclusion

two insights:

• The first is to apply high-capacity convolutional neural networks to bottom-up region proposals in order to localize and segment objects
• The second is a paradigm for training large CNNs when labeled training data is scarce.

## Reference

23. D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In ECCV. 2012.