Spelling suggestions: "subject:"automated epeech egmentation"" "subject:"automated epeech asegmentation""
1 |
Music And Speech Analysis Using The 'Bach' Scale Filter-BankAnanthakrishnan, G 04 1900 (has links)
The aim of this thesis is to define a perceptual scale for the ‘Time-Frequency’ analysis of music signals. The equal tempered ‘Bach ’ scale is a suitable scale, since it covers most of the genres of music and the error is equally distributed for each semi-tone. However, it may be necessary to allow a tolerance of around 50 cents or half the interval of the Bach scale, so that the interval can accommodate other common intonation schemes. The thesis covers the formulation of the Bach scale filter-bank as a time-varying model. It makes a comparative study with other commonly used perceptual scales. Two applications for the Bach scale filter-bank are also proposed, namely automated segmentation of speech signals and transcription of singing voice for query-by-humming applications.
Even though this filter-bank is suggested with a motivation from music, it could also be applied to speech. A method for automatically segmenting continuous speech into phonetic units is proposed. The results, obtained from the proposed method, show around 82% accuracy for the English and 85% accuracy for the Hindi databases. This is an improvement of around 2 -3% when the performance is compared with other popular methods in the literature. Interestingly, the Bach scale filters perform better than the filters designed for other common perceptual scales, such as Mel and Bark scales.
‘Musical transcription’ refers to the process of converting a musical rendering or performance into a set of symbols or notations. A query in a ‘query-by-humming system’ can be made in several ways, some of which are singing with words, or with arbitrary syllables, or whistling. Two algorithms are suggested to annotate a query. The algorithms are designed to be fairly robust for these various forms of queries. The first algorithm is a frequency selection based method. It works on the basis of selecting the most likely frequency components at any given time instant. The second algorithm works on the basis of finding time-connected contours of high energy in the ‘Time-Frequency’ plane of the input signal. The time domain algorithm works better in terms of instantaneous pitch estimates. It results in an error of around 10 -15%, while the frequency domain method results in an error of around 12 -20%.
A song rendered by two different people will have quite a few different properties. Their absolute pitches, rates of rendering, timbres based on voice quality and inaccuracies, may be different. The thesis discusses a method to quantify the distance between two different renderings of musical pieces. The distance function has been evaluated by attempting a search for a particular song from a database of a size of 315, made up of songs sung by both male and female singers and whistled queries. Around 90 % of the time, the correct song is found among the top five best choices picked.
Thus, the Bach scale has been proposed as a suitable scale for representing the perception of music. It has been explored in two applications, namely automated segmentation of speech and transcription of singing voices. Using the transcription obtained, a measure of the distance between renderings of musical pieces has also been suggested.
|
Page generated in 0.4165 seconds