Event Title

Sound Classification with Convolutional Neural Networks

Presenter Information

Pedro Ribeiro, Oberlin College

Location

Science Center A254

Start Date

10-27-2017 3:00 PM

End Date

10-27-2017 4:20 PM

Research Program

Massachusetts Institute of Technology (MIT) Summer Research Program (MSRP) BIO funded by NSF

Abstract

In recent years, convolutional neural networks (CNNs) have reached human-like performance on real world tasks. When trained on image, speech, and music classification tasks, CNNs achieve state of the art performance and, in some cases, replicate properties of biological sensory systems. Currently, most labeled audio databases are restricted in either the scope of labels (i.e containing only speech or music) or in size. Thus, networks are typically trained only on speech or music tasks, limiting the extent to which they can be compared to the entire auditory system. Here, we use Google AudioSet, a newly released collection of multiclass labeled sound files taken from Youtube, to train a convolutional neural network on a broad audio classification task. AudioSet contains over 2 million audio samples, each sample classified with up to 15 labels out of 527. Our network consists of 5 hierarchical convolutional layers with local response normalization followed by pooling after the first, second, and fifth layers. The model was implemented in Tensorflow and optimized via the Adam optimizer with a cross entropy loss function. We explored changes to the optimization and architecture such as varying the learning rate, filter sizes, and pooling type. Notably, changing max pooling to a weighted average pooling with a hanning window did not decrease performance on the task, and led to an increase in performance for some learning rates. Confusion patterns revealed implicit knowledge of sound category structure (for instance, the trained networks confused genres of music). Future work will further explore architecture and hyperparameter optimization and training on new tasks, such as predicting number of labels, using the same dataset and architecture. Additionally, we will compare performance and classification errors to human behavior on a similar task and synthesize sounds from the hidden layers.

Notes

Session I, Panel 4 - Sound | Science
Moderator: Joseph Lubben, Associate Professor of Music Theory

Major

Computer Science

Project Mentor(s)

Jenelle Feather and Josh McDermott, MIT

Document Type

Presentation

This document is currently not available here.

Share

COinS
 
Oct 27th, 3:00 PM Oct 27th, 4:20 PM

Sound Classification with Convolutional Neural Networks

Science Center A254

In recent years, convolutional neural networks (CNNs) have reached human-like performance on real world tasks. When trained on image, speech, and music classification tasks, CNNs achieve state of the art performance and, in some cases, replicate properties of biological sensory systems. Currently, most labeled audio databases are restricted in either the scope of labels (i.e containing only speech or music) or in size. Thus, networks are typically trained only on speech or music tasks, limiting the extent to which they can be compared to the entire auditory system. Here, we use Google AudioSet, a newly released collection of multiclass labeled sound files taken from Youtube, to train a convolutional neural network on a broad audio classification task. AudioSet contains over 2 million audio samples, each sample classified with up to 15 labels out of 527. Our network consists of 5 hierarchical convolutional layers with local response normalization followed by pooling after the first, second, and fifth layers. The model was implemented in Tensorflow and optimized via the Adam optimizer with a cross entropy loss function. We explored changes to the optimization and architecture such as varying the learning rate, filter sizes, and pooling type. Notably, changing max pooling to a weighted average pooling with a hanning window did not decrease performance on the task, and led to an increase in performance for some learning rates. Confusion patterns revealed implicit knowledge of sound category structure (for instance, the trained networks confused genres of music). Future work will further explore architecture and hyperparameter optimization and training on new tasks, such as predicting number of labels, using the same dataset and architecture. Additionally, we will compare performance and classification errors to human behavior on a similar task and synthesize sounds from the hidden layers.