<p> This dissertation addresses the problem of the automatic discovery of structure in music from audio signals by introducing novel approaches and proposing perceptually enhanced evaluations. First, the problem of music structure analysis is reviewed from the perspectives of music information retrieval (MIR) and music perception and cognition (MPC), including a discussion of the limitations and current challenges in both disciplines. When discussing the existing methods of evaluating the outputs of algorithms that discover musical structure, a transparent open source software called mir eval, which contains implementations to these evaluations, is introduced. Then, four MIR algorithms are presented: one to compress music recordings into audible summaries, another to discover musical patterns from an audio signal, and two for the identification of the large-scale, non-overlapping segments of a musical piece. After discussing these techniques, and given the differences when perceiving the structure of music, the idea of applying more MPC-oriented approaches is considered to obtain perceptually relevant evaluations for music segmentation. A methodology to automatically obtain the most difficult tracks for machines to annotate is presented in order to include them in a design of a human study to collect multiple human annotations. To select these tracks, a novel open source framework called music structural analysis framework (MSAF) is introduced. This framework contains the most relevant music segmentation algorithms and it uses mir eval to transparently evaluate them. Moreover, MSAF makes use of the JSON annotated music specification (JAMS), a new format to contain multiple annotations for several tasks in a single file, which simplifies the dataset design and the analysis of agreement across different human references. The human study to collect additional annotations (which are stored in JAMS files) is described, where five new annotations for fifty tracks are stored. Finally, these additional annotations are analyzed, confirming the problem of having ground-truth datasets with a single annotator per track due to the high degree of disagreement among annotators for the challenging tracks. To alleviate this, these annotations are merged to produce a more robust human reference annotation. Lastly, the standard F-measure of the hit rate measure to evaluate music segmentation is analyzed when access to additional annotations is not possible, and it is shown, via multiple human studies, that precision seems more perceptually relevant than recall.</p>
Identifer | oai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:3705329 |
Date | 22 August 2015 |
Creators | Nieto, Oriol |
Publisher | New York University |
Source Sets | ProQuest.com |
Language | English |
Detected Language | English |
Type | thesis |
Page generated in 0.002 seconds