0%

Non-local Neural Networks

总结

  • 本文针对CNN对局部信息的计算提出了一种新的操作结构,用来捕获全局信息之间的关系
  • 所谓non-local计算就是当计算某一点的响应结果时,将全局响应值带权求和
  • 使用non-local有以下几个好处:
    • 计算某一点的输出时考虑该点与全局的相关性
    • 计算不复杂的同时可以涨点
    • non-local操作保持原输入输出的尺寸不变,可以方便的加到网络中去
  • 通用公式:$y_i= \frac {1} {C(x)}\sum_{\forall j}f(x_i, x_j)g(x_j)$
    • 通常g函数就选为1*1卷积
    • f函数一版选择Embedded Gaussian高斯函数,选择其他几种实际效果差距不大
    • 特别说明一点当f函数选择为Embedded Gaussian时,公式就等同于self-attention的形式,说明self-attention是non-local的一种特例
  • 在non-local的具体实现上有一些改动用来加速计算
    • 计算相关性矩阵的时候将channel数目减半
    • 计算相关性的时候不,枚举计算全局的点,可以通过pooling做降采样
  • 训练的细节
    • 使用image-net的模型作为初始化
    • 只在non-local的最后输出加BN,其余地方都不加
  • 调参试验
    • 关于f函数的选择?对结果影响不大
    • 把non-local块加在哪里?对结果影响不大,加在最后一层有轻微掉点
    • 加non-local块的数量?加的越多越好
    • 全局时间/全局空间/全局时空?将non-local单独应用于时间/空间均有提升,不过都不如应用于全局时空提升大
    • C2D+non-local与C3D作对比?C2D+non-local好
    • 将non-local应用于C3D?也可以获得提升
    • 加长时间维度?可以取得更好的效果

Abstract

In this paper we present non-local operations as a generic family of building blocks for capturing long-range dependencies

Intorduction

Repeating local operations has several limitations:

  • it is computationally inefficient
  • it causes optimization difficulties that need to be carefully addressed
  • these challenges make multihop dependency modeling difficult

A non-local operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps

4f4ad7a9bb78a03a26159c614df6bb5d.png

There are several advantages of using non-local operations:

  • In contrast to the progressive behavior of recurrent and convolutional operations, non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance
  • As we show in experiments, non-local operations are efficient and achieve their best results even with only a few layers
  • Finally, our non-local operations maintain the variable input sizes and can be easily combined with other operations

skip-over

Non-local Neural Networks

Formulation

We define a generic non-local operation in deep neural networks as:

  • Here i is the index of an output position
  • j is the index that enumerates all possible positions
  • x is the input signal
  • y is the output signal of the same size as x
  • A pairwise function f computes a scalar (representing relationship such as affinity) between i and all j
  • The unary function g computes a representation of the input signal at the position j
  • The response is normalized by a factor C(x)

Instantiations

Next we describe several versions of f and g

Interestingly our non-local models are not sensitive to these choices

9280fc6fea94b8f8d99102f5755356df.png

For simplicity, we only consider g in the form of a linear embedding

Next we discuss choices for the pairwise function f

Gaussian

Embedded Gaussian

Here $\theta(x_i) = W_\theta x_i$ and $\phi(x_j) = W_\phi x_j$ are two embeddings

We note that the self-attention module for a given i, $\frac 1 {C(x)}f(x_i, x_j)$ becomes the softmax computation.
Therefore self-attention is a special case of non-local operations in Embedded Gaussian version.
The attentional behavior is not essential in the applications we study

Dot puoduct

In this case, we set normalization factor as $C(x) = N$ where N is the number of positions in x, rather than the sum of f, because it simplifies gradient computation

Concatenation

Non-local block

We define a non-local block as:

where $y_i$ is output of non-local operation and $+x_i$ denotes a residual connection

74d2295d3c10e97d08ee2f692fc65fa8.png

We further adopt the following implementations that make it more efficient

Implementation of Non-local blocks

We set the number of channels represented by $W_g, W_\theta,W_\phi$ to be half of the number of channels in x

We can change x to a subsampling version $\hat x$(e.g. by pooling)

Video Classification Models

First we describe our baseline network architectures for this task, and then extend them into 3D ConvNets and our proposed non-local nets

2D ConvNet baseline(C2D)

6c621fec65df85f8072abc9027428aa4.png

The only operation involving the temporal domain are the pooling layers

Inflated 3D ConvNet (I3D)

2D k×k kernel can be inflated as a 3D t×k×k kernel

Initialize: each of the t planes in the t×k×k kernel is initialized by the pre-trained k×k weights, rescaled by 1/t

Non-local network

We insert non-local blocks into C2D or I3D to turn them into non-local nets

The implementation details are described in the next section

Implementation Details

Traning

  • pre-trained on ImageNet
  • fine-tune our models using 32-frame input clips
  • These clips are formed by randomly cropping out 64 consecutive frames from the original full-length video and then dropping every other frame
  • We add a BN layer right after the last 1×1×1 layer that represents $W_z$ ; we do not add BN to other layers in a non-local block

Inference

For the temporal domain, in our practice we sample 10 clips evenly from a full-length video and compute the softmax scores on them individually. The final prediction is the averaged softmax scores of all clips

Experiments on Video Classification

Experiments on Kinetics

806541b0a703a4a7edd147032ca11198.png

Instantiations

Interestingly, the embedded Gaussian, dot-product, and concatenation versions perform similarly

Our experiments show that the attentional (softmax) behavior of this module is not the key to the improvement in our applications; instead, it is more likely that the non-local behavior is important, and it is insensitive to the instantiations

In the rest of this paper, we use the embedded Gaussian version by default

Which stage to add non-local blocks?

56366a357571bb165af1433fa17578ae.png

The improvement is similar

Going deeper with non-local blocks

This table shows the result of more non-local blocks

3626927c2230cb6903235d604d628d7a.png

More non-local blocks in general lead to better results

Non-local in spacetime

This table we study the effect of non-local blocks applied along space, time or spacetime

a62de33b7a70fcabedf424abb381afaa.png

both the space-only and time-only versions improve over the C2D baseline, but are inferior to the spacetime version

Non-local net vs. 3D ConvNet

852cf2a7219a56a16db4458dc2402c02.png

This comparison shows that our method can be more effective than 3D convolutions when used alone

Non-local 3D ConvNet

This table shows the results of inserting 5 blocks into I3D models

70a8c2fef66bf44dc79d23b1258702d8.png

Shows that non-local operations and 3D convolutions are complementary

Longer sequences

Finally we investigate the generality of our models on longer input videos. We use input clips consisting of 128 consecutive frames without subsampling

b424af3af564c49b53378f4f42e438f1.png

We also find that our NL I3D can maintain its gain over the I3D counterparts

Comparisons with state-of-the-art results

9e511584c90c9c80210415af9398affe.png

Experiments on Charades

327398c6f4211d3b524b8b7cb5b874ed.png

Extension: Experiments on COCO

Object detection and instance segmentation

We modify the Mask R-CNN backbone by adding one non-local block

13d88324c791732f8944fd0cc80f94f4.png

This comparison suggests that non-local dependency has not been sufficiently captured by existing models despite increased depth/capacity

Keypoint detection

e1b72eb200580bb51b51d153f1b4fa0e.png

Conclusion

skip-over