Global ETD Search

Return to search

Image processing and forward propagation using binary representations, and robust audio analysis using deep learning

The work presented in this thesis consists of three main topics:
document segmentation and classification into text and score,
efficient computation with binary representations, and deep learning
architectures for polyphonic music transcription and classification.

In the case of musical documents, an important
problem is separating text from musical score by detecting the
corresponding boundary boxes. A new algorithm is
proposed for pixel-wise classification of digital documents in musical
score and text. It is based on a bag-of-visual-words approach and
random forest classification. A robust technique for identifying
bounding boxes of text and music score from the pixel-wise
classification is also proposed.

For efficient processing of learned models, we turn our attention to
binary representations. When dealing with binary data, the use of
bit-packing and bit-wise computation can reduce computational time and
memory requirements considerably. Efficiency is a key factor when
processing large scale datasets and in industrial applications.
SPmat is an optimized framework for binary image processing.
We propose a bit-packed representation for binary images that encodes
both pixels and square neighborhoods, and design SPmat, an optimized
framework for binary image processing, around it.

Bit-packing and bit-wise computation can also be used for efficient
forward propagation in deep neural networks. Quantified deep neural
networks have recently been proposed with the goal of improving
computational time performance and memory requirements while
maintaining as much as possible classification performance. A particular
type of quantized neural networks are binary neural networks in which
the weights and activations are constrained to $-1$ and $+1$. In this
thesis, we describe and evaluate Espresso, a novel optimized framework
for fast inference of binary neural networks that takes advantage of
bit-packing and bit-wise computations. Espresso is self contained,
written in C/CUDA and provides optimized implementations of all the
building blocks needed to perform forward propagation.

Following the recent success, we further investigate Deep neural
networks. They have achieved state-of-the-art results and
outperformed traditional machine learning methods in many applications
such as: computer vision, speech recognition, and machine translation.
However, in the case of music information retrieval (MIR) and audio
analysis, shallow neural networks are commonly used. The
effectiveness of deep and very deep architectures for MIR and audio
tasks has not been explored in detail. It is also not clear what is
the best input representation for a particular task. We therefore
investigate deep neural networks for the following audio analysis
tasks: polyphonic music transcription, musical genre classification,
and urban sound classification. We analyze the performance of common
classification network architectures using different input
representations, paying specific attention to residual networks. We
also evaluate the robustness of these models in case of degraded audio
using different combinations of training/testing data. Through
experimental evaluation we show that residual networks provide
consistent performance improvements when analyzing degraded audio
across different representations and tasks. Finally, we present a
convolutional architecture based on U-Net that can improve polyphonic
music transcription performance of different baseline transcription
networks. / Graduate

http://hdl.handle.net/1828/10653

Document segmentation

Optimized implementations

Deep learning

Music classification

Identifer	oai:union.ndltd.org:uvic.ca/oai:dspace.library.uvic.ca:1828/10653
Date	15 March 2019
Creators	Pedersoli, Fabrizio
Contributors	Tzanetakis, George
Source Sets	University of Victoria
Language	English, English
Detected Language	English
Type	Thesis
Format	application/pdf
Rights	Available to the World Wide Web

Page generated in 0.0152 seconds

Image processing and forward propagation using binary representations, and robust audio analysis using deep learning

Description

Links & Downloads

Tags

Additional Fields