## 总结

- 本文针对CNN对局部信息的计算提出了一种新的操作结构，用来捕获全局信息之间的关系
- 所谓non-local计算就是当计算某一点的响应结果时，将全局响应值带权求和
- 使用non-local有以下几个好处:
- 计算某一点的输出时考虑该点与全局的相关性
- 计算不复杂的同时可以涨点
- non-local操作保持原输入输出的尺寸不变，可以方便的加到网络中去

- 通用公式：$y_i= \frac {1} {C(x)}\sum_{\forall j}f(x_i, x_j)g(x_j)$
- 通常g函数就选为1*1卷积
- f函数一版选择Embedded Gaussian高斯函数，选择其他几种实际效果差距不大
- 特别说明一点当f函数选择为Embedded Gaussian时，公式就等同于self-attention的形式，说明self-attention是non-local的一种特例

- 在non-local的具体实现上有一些改动用来加速计算
- 计算相关性矩阵的时候将channel数目减半
- 计算相关性的时候不，枚举计算全局的点，可以通过pooling做降采样

- 训练的细节
- 使用image-net的模型作为初始化
- 只在non-local的最后输出加BN，其余地方都不加

- 调参试验
- 关于f函数的选择？对结果影响不大
- 把non-local块加在哪里?对结果影响不大，加在最后一层有轻微掉点
- 加non-local块的数量？加的越多越好
- 全局时间/全局空间/全局时空?将non-local单独应用于时间/空间均有提升，不过都不如应用于全局时空提升大
- C2D+non-local与C3D作对比？C2D+non-local好
- 将non-local应用于C3D?也可以获得提升
- 加长时间维度？可以取得更好的效果

## Abstract

In this paper we present non-local operations as a generic family of building blocks for capturing long-range dependencies

## Intorduction

Repeating local operations has several limitations:

- it is computationally inefficient
- it causes optimization difficulties that need to be carefully addressed
- these challenges make multihop dependency modeling difficult

A non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps

There are several advantages of using non-local operations:

- In contrast to the progressive behavior of recurrent and convolutional operations, non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance
- As we show in experiments, non-local operations are efficient and achieve their best results even with only a few layers
- Finally, our non-local operations maintain the variable input sizes and can be easily combined with other operations

## Related Work

skip-over

## Non-local Neural Networks

### Formulation

We define a generic non-local operation in deep neural networks as:

- Here i is the index of an output position
- j is the index that enumerates all possible positions
- x is the input signal
- y is the output signal of the same size as x
- A pairwise function f computes a scalar (representing relationship such as affinity) between i and all j
- The unary function g computes a representation of the input signal at the position j
- The response is normalized by a factor C(x)

### Instantiations

Next we describe several versions of f and g

Interestingly our non-local models are not sensitive to these choices

For simplicity, we only consider g in the form of a linear embedding

Next we discuss choices for the pairwise function f

#### Gaussian

#### Embedded Gaussian

Here $\theta(x_i) = W_\theta x_i$ and $\phi(x_j) = W_\phi x_j$ are two embeddings

We note that the self-attention module for a given i, $\frac 1 {C(x)}f(x_i, x_j)$ becomes the softmax computation.

Therefore self-attention is a special case of non-local operations in Embedded Gaussian version.

The attentional behavior is not essential in the applications we study

#### Dot puoduct

In this case, we set normalization factor as $C(x) = N$ where N is the number of positions in x, rather than the sum of f, because it simplifies gradient computation

#### Concatenation

### Non-local block

We define a non-local block as:

where $y_i$ is output of non-local operation and $+x_i$ denotes a residual connection

We further adopt the following implementations that make it more efficient

#### Implementation of Non-local blocks

We set the number of channels represented by $W_g, W_\theta,W_\phi$ to be half of the number of channels in x

We can change x to a subsampling version $\hat x$(e.g. by pooling)

## Video Classification Models

First we describe our baseline network architectures for this task, and then extend them into 3D ConvNets and our proposed non-local nets

#### 2D ConvNet baseline(C2D)

The only operation involving the temporal domain are the pooling layers

#### Inflated 3D ConvNet (I3D)

2D k×k kernel can be inflated as a 3D t×k×k kernel

Initialize: each of the t planes in the t×k×k kernel is initialized by the pre-trained k×k weights, rescaled by 1/t

#### Non-local network

We insert non-local blocks into C2D or I3D to turn them into non-local nets

The implementation details are described in the next section

### Implementation Details

#### Traning

- pre-trained on ImageNet
- fine-tune our models using 32-frame input clips
- These clips are formed by randomly cropping out 64 consecutive frames from the original full-length video and then dropping every other frame
- We add a BN layer right after the last 1×1×1 layer that represents $W_z$ ; we do not add BN to other layers in a non-local block

#### Inference

For the temporal domain, in our practice we sample 10 clips evenly from a full-length video and compute the softmax scores on them individually. The final prediction is the averaged softmax scores of all clips

## Experiments on Video Classification

### Experiments on Kinetics

#### Instantiations

Interestingly, the embedded Gaussian, dot-product, and concatenation versions perform similarly

Our experiments show that the attentional (softmax) behavior of this module is not the key to the improvement in our applications; instead, it is more likely that the non-local behavior is important, and it is insensitive to the instantiations

In the rest of this paper, we use the embedded Gaussian version by default

#### Which stage to add non-local blocks?

The improvement is similar

#### Going deeper with non-local blocks

This table shows the result of more non-local blocks

More non-local blocks in general lead to better results

#### Non-local in spacetime

This table we study the effect of non-local blocks applied along space, time or spacetime

both the space-only and time-only versions improve over the C2D baseline, but are inferior to the spacetime version

#### Non-local net vs. 3D ConvNet

This comparison shows that our method can be more effective than 3D convolutions when used alone

#### Non-local 3D ConvNet

This table shows the results of inserting 5 blocks into I3D models

Shows that non-local operations and 3D convolutions are complementary

#### Longer sequences

Finally we investigate the generality of our models on longer input videos. We use input clips consisting of 128 consecutive frames without subsampling

We also find that our NL I3D can maintain its gain over the I3D counterparts

#### Comparisons with state-of-the-art results

### Experiments on Charades

## Extension: Experiments on COCO

### Object detection and instance segmentation

We modify the Mask R-CNN backbone by adding one non-local block

This comparison suggests that non-local dependency has not been sufficiently captured by existing models despite increased depth/capacity

#### Keypoint detection

## Conclusion

skip-over