CNN Architectures for Large-Scale Audio Classification

总结

这篇文章提出了用CNN处理特征代替之前抽取MFCCs + 分类器的方式，获得了比较好的效果
比较关注将一段音频处理为音频输入的方法
- 将音频按照960ms切段
- 960ms的音频每10ms用短时傅里叶变换处理（25ms）的窗口
- 将生成的光谱图分到64个mel-spaced frequency bins里面
- 每个桶的结果取log
- 就生成96 * 64的二维向量然后用CNN的方式搞
后面文章对比了多种模型和数据集大小的结果基本上越深的模型越好，数据集越大越好

Abstract

Finding that analogs of the CNNs used in image classification do well on our audio classification task

Introduction

Our labels apply to entire videos without any changes in time, so we have yet to try such recurrent models

By computing log-mel spectrograms of multiple frames, we create 2D image-like patches to present to the classifiers

Dataset

YouTube-100M
4.6 min per video on average
1 or more topic identifiers from 30871 labels
5 labels per video on average

Experimental Framework

Training

sample generation:

The audio is divided into non-overlapping 960 ms frames
The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms windows every 10 ms
The resulting spectrogram is integrated into 64 mel-spaced frequency bins
The magnitude of each bin is log-transformed after adding a small offset to avoid numerical issues
This gives log-mel spectrogram patches of 96 × 64 bins that form the input to all classifiers

Evaluation

We passed each 960 ms frame from each evaluation video through the classifier. We then averaged the classifier output scores across all segments in a video.

Architectures

fully Connected

Our best performing model had N = 3 layers, M = 1000 units

AlexNet

Because our inputs are 96 × 64, we use a stride of 2 × 1 so that the number of activations are similar after the initial layer

VGG

We tried another variant that reduced the initial strides (as we did with AlexNet), but found that not modifying the strides resulted in faster training and better performance

Incepiton V3

We modified the inception V3 by removing the first four layers of the stem, up to and including the MaxPool, as well as removing the auxiliary network

We tried including the stem and removing the first stride of 2 and MaxPool but found that it performed worse than the variant with the truncated stem

ResNet-50

We modified ResNet-50 by removing the stride of 2 from the first 7×7 convolution

Experiments

Architecture Comparison

Inception and ResNet achieve the best performance

Label Set Size

Using a 400 label subset of the 30K labels, we investigated how training with different subsets of classes can affect performance

Traning Set Size

Having a very large training set available allows us to investigate how training set size affects performance

AED with the Audio Set Dataset

The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904

Conclusions

skip over