Global ETD Search

Return to search

Nouveaux points de vue sur la classification hiérarchique et normalisation linguistique pour la segmentation et le regroupement en locuteurs / New insights into hierarchical clustering and linguistic normalization for speaker diarization

Face au volume croissant de données audio et multimédia, les technologies liées à l'indexation de données et à l'analyse de contenu ont suscité beaucoup d'intérêt dans la communauté scientifique. Parmi celles-ci, la segmentation et le regroupement en locuteurs, répondant ainsi à la question 'Qui parle quand ?' a émergé comme une technique de pointe dans la communauté de traitement de la parole. D'importants progrès ont été réalisés dans le domaine ces dernières années principalement menés par les évaluations internationales du NIST. Tout au long de ces évaluations, deux approches se sont démarquées : l'une est bottom-up et l'autre top-down. L'ensemble des systèmes les plus performants ces dernières années furent essentiellement des systèmes types bottom-up, cependant nous expliquons dans cette thèse que l'approche top-down comporte elle aussi certains avantages. En effet, dans un premier temps, nous montrons qu'après avoir introduit une nouvelle composante de purification des clusters dans l'approche top-down, nous obtenons des performances comparables à celles de l'approche bottom-up. De plus, en étudiant en détails les deux types d'approches nous montrons que celles-ci se comportent différemment face à la discrimination des locuteurs et la robustesse face à la composante lexicale. Ces différences sont alors exploitées au travers d'un nouveau système combinant les deux approches. Enfin, nous présentons une nouvelle technologie capable de limiter l'influence de la composante lexicale, source potentielle d'artefacts dans le regroupement et la segmentation en locuteurs. Notre nouvelle approche se nomme Phone Adaptive Training par analogie au Speaker Adaptive Training / The ever-expanding volume of available audio and multimedia data has elevated technologies related to content indexing and structuring to the forefront of research. Speaker diarization, commonly referred to as the `who spoke when?' task, is one such example and has emerged as a prominent, core enabling technology in the wider speech processing research community. Speaker diarization involves the detection of speaker turns within an audio document (segmentation) and the grouping together of all same-speaker segments (clustering). Much progress has been made in the field over recent years partly spearheaded by the NIST Rich Transcription evaluations focus on meeting domain, in the proceedings of which are found two general approaches: top-down and bottom-up. Even though the best performing systems over recent years have all been bottom-up approaches we show in this thesis that the top-down approach is not without significant merit. Indeed we first introduce a new purification component leading to competitive performance to the bottom-up approach. Moreover, while investigating the two diarization approaches more thoroughly we show that they behave differently in discriminating between individual speakers and in normalizing unwanted acoustic variation, i.e.\ that which does not pertain to different speakers. This difference of behaviours leads to a new top-down/bottom-up system combination outperforming the respective baseline system. Finally, we introduce a new technology able to limit the influence of linguistic effects, responsible for biasing the convergence of the diarization system. Our novel approach is referred to as Phone Adaptive Training (PAT).

http://www.theses.fr/2012ENST0019/document

Partitionnement des données

Segmentation

Data partitioning

Segmentation

Identifer	oai:union.ndltd.org:theses.fr/2012ENST0019
Date	02 May 2012
Creators	Bozonnet, Simon
Contributors	Paris, ENST, Evans, Nicholas W. D., Merialdo, Bernard
Source Sets	Dépôt national des thèses électroniques françaises
Language	English, French
Detected Language	English
Type	Electronic Thesis or Dissertation, Text, StillImage

Page generated in 0.0023 seconds

Nouveaux points de vue sur la classification hiérarchique et normalisation linguistique pour la segmentation et le regroupement en locuteurs / New insights into hierarchical clustering and linguistic normalization for speaker diarization

Description

Links & Downloads

Tags

Additional Fields