• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 94
  • 12
  • 6
  • 4
  • 3
  • 2
  • 1
  • 1
  • Tagged with
  • 152
  • 152
  • 152
  • 80
  • 56
  • 55
  • 25
  • 24
  • 24
  • 23
  • 21
  • 20
  • 19
  • 19
  • 19
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Establishing Effective Techniques for Increasing Deep Neural Networks Inference Speed / Etablering av effektiva tekniker för att öka inferenshastigheten i djupa neurala nätverk

Sunesson, Albin January 2017 (has links)
Recent trend in deep learning research is to build ever more deep networks (i.e. increase the number of layers) to solve real world classification/optimization problems. This introduces challenges for applications with a latency dependence. The problem arises from the amount of computations that needs to be performed for each evaluation. This is addressed by reducing inference speed. In this study we analyze two different methods for speeding up the evaluation of deep neural networks. The first method reduces the number of weights in a convolutional layer by decomposing its convolutional kernel. The second method lets samples exit a network through early exit branches when classifications are certain. Both methods were evaluated on several network architectures with consistent results. Convolutional kernel decomposition shows 20-70% speed up with no more than 1% loss in classification accuracy in setups evaluated. Early exit branches show up to 300% speed up with no loss in classification accuracy when evaluated on CPUs. / De senaste årens trend inom deep learning har varit att addera fler och fler lager till neurala nätverk. Det här introducerar nya utmaningar i applikationer med latensberoende. Problemet uppstår från mängden beräkningar som måste utföras vid varje evaluering. Detta adresseras med en reducering av inferenshastigheten. Jag analyserar två olika metoder för att snabba upp evalueringen av djupa neurala näverk. Den första metoden reducerar antalet vikter i ett faltningslager via en tensordekomposition på dess kärna. Den andra metoden låter samples lämna nätverket via tidiga förgreningar när en klassificering är säker. Båda metoderna utvärderas på flertalet nätverksarkitekturer med konsistenta resultat. Dekomposition på fältningskärnan visar 20-70% hastighetsökning med mindre än 1% försämring av klassifikationssäkerhet i evaluerade konfigurationer. Tidiga förgreningar visar upp till 300% hastighetsökning utan någon försämring av klassifikationssäkerhet när de evalueras på CPU.
72

[en] SEISMIC IMAGE SUPER RESOLUTION / [pt] SUPER RESOLUÇÃO DE IMAGENS SÍSMICAS

PEDRO FERREIRA ALVES PINTO 06 December 2022 (has links)
[pt] A super resolução (SR) é um tema de suma importância em domínios de conhecimentos variados, como por exemplo a área médica, de monitoramento e de segurança. O uso de redes neurais profundas para a resolução desta tarefa é algo extremamente recente no universo da sísmica, tendo poucas referências, as quais começaram a ser divulgadas há menos de 2 anos. Todavia, a literatura apresenta uma vasta gama de métodos, que utilizam redes neurais para a super resolução de imagens naturais. Tendo isto em vista, o objetivo deste trabalho é explorar tais abordagens aplicadas em dados sísmicos sintéticos de reservatórios. Para isto, foram empregados modelos de importância cronológica na literatura e foram comparados com um método clássico de interpolação e com os modelos da literatura de super resolução de imagens sísmicas. São estes modelos: o SRCNN, o RDN, a abordagem do Deep Image Prior e o SAN. Por fim, os resultados apresentam que o PSNR obtido por arquiteturas de projetos no domínio da sísmica equivale a 38.23 e o melhor resultado das arquiteturas propostas 38.62, mostrando o avanço que tais modelos trazem ao campo da sísmica. / [en] Super resolution (SR) is a topic of notable importance in domains of assorted knowledge, such as the medical, monitoring, and security areas. The use of deep neural networks to solve this task is something extremely recent in the seismic field, with few references, which began to be published less than 2 years ago. However, the literature presents a wide range of methods, using neural networks for the super resolution of natural images. With this in mind, the objective of this work is to explore such approaches applied to synthetic seismic data from reservoirs. For this, models of chronological importance in the literature were used and compared with a classic interpolation method and with models of the literature of super resolution of seismic images. These models are: SRCNN, RDN, the Deep Image Prior approach and SAN. The results show that the PSNR obtained by architectures developed for the seismic domain is equivalent to 38.23 and the best result of the proposed architectures is 38.62, showing the progress that such models bring to the seismic domain.
73

Deep Neural Networks for Improved Terminal Voltage and State-of-Charge Estimation of Lithium-Ion Batteries for Traction Applications

Goncalves Vidal, Carlos Jose January 2020 (has links)
The growing interest in more electrified vehicles has been pushing the industry and academia to pursue new and more accurate ways to estimate the xEV batteries State-of-Charge (SOC). The battery system still represents one of the many technical barriers that need to be eliminated or reduced to enable the proliferation of more xEV in the market, which in turn can help reduce CO2 emissions. Battery modelling and SOC estimation of Lithium-ion batteries (Li-ion) at a wide temperature range, including negative temperatures, has been a challenge for many engineers. For SOC estimation, several models configurations and approaches were developed and tested as results of this work, including different non-recurrent neural networks, such as Feedforward deep neural networks (FNN) and recurrent neural networks based on long short-term memory recurrent neural networks (LSTM-RNN). The approaches have considerably improved the accuracy presented in the previous state-of-the-art. They have expanded the application throughout five different Li-ion at a wide temperature range, achieving error as low as 0.66% Root Mean Square Error at -10⁰C using an FNN approach and 0.90% using LSTM-RNN. Therefore, the use of deep neural networks developed in this work can increase the potential for xEV application, especially where accuracy at negative temperatures is essential. For Li-ion modelling, a cell model using LSTM-RNN (LSTM-VM) was developed for the first time to estimate the battery cell terminal voltage and is compared against a gated recurrent unit (GRU-VM) approach and a Third-order Equivalent Circuit Model based on Thevenin theorem (ECM). The models were extensively compared for different Li-ion at a wide range of temperature conditions. The LSTM-VM has shown to be more accurate than the two other benchmarks, where could achieve 43 (mV) Root Mean Square Error at -20⁰C, a third when compared to the same situation using ECM. Although the difference between LSTM-VM and GRU-VM is not that steep. Finally, throughout the work, several methods to improve robustness, accuracy and training time have been introduced, including Transfer Learning applied to the development of SOC estimation models, showing great potential to reduce the amount of data necessary to train LSTM-RNN as well as improve its accuracy. / Thesis / Doctor of Philosophy (PhD) / For electric vehicle State-of-Charge estimation, several models configurations and approaches were developed and tested as results of this work, including different non-recurrent neural networks, such as Feedforward deep neural networks (FNN) and recurrent neural networks based on long short-term memory recurrent neural networks (LSTM-RNN). The approaches have considerably improved the accuracy presented in the previous state-of-the-art. They have expanded the application throughout five different Li-ion at a wide temperature range, achieving error as low as 0.66% Root Mean Square Error at -10⁰C using an FNN approach and 0.90% using LSTM-RNN. Therefore, the use of deep neural networks developed in this work can increase the potential for xEV application, especially where accuracy at negative temperatures is essential. For Li-ion modelling, a cell model using LSTM-RNN (LSTM-VM) was developed for the first time to estimate the battery cell terminal voltage and is compared against a gated recurrent unit (GRU-VM) approach and a Third-order Equivalent Circuit Model based on Thevenin theorem (ECM). The models were extensively compared for different Li-ion at a wide range of temperature conditions. The LSTM-VM has shown to be more accurate than the two other benchmarks, where could achieve 43 (mV) Root Mean Square Error at -20⁰C, a third when compared to the same situation using ECM. Although the difference between LSTM-VM and GRU-VM is not that steep.
74

A Deep Learning-based Dynamic Demand Response Framework

Haque, Ashraful 02 September 2021 (has links)
The electric power grid is evolving in terms of generation, transmission and distribution network architecture. On the generation side, distributed energy resources (DER) are participating at a much larger scale. Transmission and distribution networks are transforming to a decentralized architecture from a centralized one. Residential and commercial buildings are now considered as active elements of the electric grid which can participate in grid operation through applications such as the Demand Response (DR). DR is an application through which electric power consumption during the peak demand periods can be curtailed. DR applications ensure an economic and stable operation of the electric grid by eliminating grid stress conditions. In addition to that, DR can be utilized as a mechanism to increase the participation of green electricity in an electric grid. The DR applications, in general, are passive in nature. During the peak demand periods, common practice is to shut down the operation of pre-selected electrical equipment i.e., heating, ventilation and air conditioning (HVAC) and lights to reduce power consumption. This approach, however, is not optimal and does not take into consideration any user preference. Furthermore, this does not provide any information related to demand flexibility beforehand. Under the broad concept of grid modernization, the focus is now on the applications of data analytics in grid operation to ensure an economic, stable and resilient operation of the electric grid. The work presented here utilizes data analytics in DR application that will transform the DR application from a static, look-up-based reactive function to a dynamic, context-aware proactive solution. The dynamic demand response framework presented in this dissertation performs three major functionalities: electrical load forecast, electrical load disaggregation and peak load reduction during DR periods. The building-level electrical load forecasting quantifies required peak load reduction during DR periods. The electrical load disaggregation provides equipment-level power consumption. This will quantify the available building-level demand flexibility. The peak load reduction methodology provides optimal HVAC setpoint and brightness during DR periods to reduce the peak demand of a building. The control scheme takes user preference and context into consideration. A detailed methodology with relevant case studies regarding the design process of the network architecture of a deep learning algorithm for electrical load forecasting and load disaggregation is presented. A case study regarding peak load reduction through HVAC setpoint and brightness adjustment is also presented. To ensure the scalability and interoperability of the proposed framework, a layer-based software architecture to replicate the framework within a cloud environment is demonstrated. / Doctor of Philosophy / The modern power grid, known as the smart grid, is transforming how electricity is generated, transmitted and distributed across the US. In a legacy power grid, the utilities are the suppliers and the residential or commercial buildings are the consumers of electricity. However, the smart grid considers these buildings as active grid elements which can contribute to the economic, stable and resilient operation of an electric grid. Demand Response (DR) is a grid application that reduces electrical power consumption during peak demand periods. The objective of DR application is to reduce stress conditions of the electric grid. The current DR practice is to shut down pre-selected electrical equipment i.e., HVAC, lights during peak demand periods. However, this approach is static, pre-fixed and does not consider any consumer preference. The proposed framework in this dissertation transforms the DR application from a look-up-based function to a dynamic context-aware solution. The proposed dynamic demand response framework performs three major functionalities: electrical load forecasting, electrical load disaggregation and peak load reduction. The electrical load forecasting quantifies building-level power consumption that needs to be curtailed during the DR periods. The electrical load disaggregation quantifies demand flexibility through equipment-level power consumption disaggregation. The peak load reduction methodology provides actionable intelligence that can be utilized to reduce the peak demand during DR periods. The work leverages functionalities of a deep learning algorithm to increase forecasting accuracy. An interoperable and scalable software implementation is presented to allow integration of the framework with existing energy management systems.
75

Deep Learning for Ordinary Differential Equations and Predictive Uncertainty

Yijia Liu (17984911) 19 April 2024 (has links)
<p dir="ltr">Deep neural networks (DNNs) have demonstrated outstanding performance in numerous tasks such as image recognition and natural language processing. However, in dynamic systems modeling, the tasks of estimating and uncovering the potentially nonlinear structure of systems represented by ordinary differential equations (ODEs) pose a significant challenge. In this dissertation, we employ DNNs to enable precise and efficient parameter estimation of dynamic systems. In addition, we introduce a highly flexible neural ODE model to capture both nonlinear and sparse dependent relations among multiple functional processes. Nonetheless, DNNs are susceptible to overfitting and often struggle to accurately assess predictive uncertainty despite their widespread success across various AI domains. The challenge of defining meaningful priors for DNN weights and characterizing predictive uncertainty persists. In this dissertation, we present a novel neural adaptive empirical Bayes framework with a new class of prior distributions to address weight uncertainty.</p><p dir="ltr">In the first part, we propose a precise and efficient approach utilizing DNNs for estimation and inference of ODEs given noisy data. The DNNs are employed directly as a nonparametric proxy for the true solution of the ODEs, eliminating the need for numerical integration and resulting in significant computational time savings. We develop a gradient descent algorithm to estimate both the DNNs solution and the parameters of the ODEs by optimizing a fidelity-penalized likelihood loss function. This ensures that the derivatives of the DNNs estimator conform to the system of ODEs. Our method is particularly effective in scenarios where only a set of variables transformed from the system components by a given function are observed. We establish the convergence rate of the DNNs estimator and demonstrate that the derivatives of the DNNs solution asymptotically satisfy the ODEs determined by the inferred parameters. Simulations and real data analysis of COVID-19 daily cases are conducted to show the superior performance of our method in terms of accuracy of parameter estimates and system recovery, and computational speed.</p><p dir="ltr">In the second part, we present a novel sparse neural ODE model to characterize flexible relations among multiple functional processes. This model represents the latent states of the functions using a set of ODEs and models the dynamic changes of these states utilizing a DNN with a specially designed architecture and sparsity-inducing regularization. Our new model is able to capture both nonlinear and sparse dependent relations among multivariate functions. We develop an efficient optimization algorithm to estimate the unknown weights for the DNN under the sparsity constraint. Furthermore, we establish both algorithmic convergence and selection consistency, providing theoretical guarantees for the proposed method. We illustrate the efficacy of the method through simulation studies and a gene regulatory network example.</p><p dir="ltr">In the third part, we introduce a class of implicit generative priors to facilitate Bayesian modeling and inference. These priors are derived through a nonlinear transformation of a known low-dimensional distribution, allowing us to handle complex data distributions and capture the underlying manifold structure effectively. Our framework combines variational inference with a gradient ascent algorithm, which serves to select the hyperparameters and approximate the posterior distribution. Theoretical justification is established through both the posterior and classification consistency. We demonstrate the practical applications of our framework through extensive simulation examples and real-world datasets. Our experimental results highlight the superiority of our proposed framework over existing methods, such as sparse variational Bayesian and generative models, in terms of prediction accuracy and uncertainty quantification.</p>
76

On the use of $\alpha$-stable random variables in Bayesian bridge regression, neural networks and kernel processes.pdf

Jorge E Loria (18423207) 23 April 2024 (has links)
<p dir="ltr">The first chapter considers the l_α regularized linear regression, also termed Bridge regression. For α ∈ (0, 1), Bridge regression enjoys several statistical properties of interest such</p><p dir="ltr">as sparsity and near-unbiasedness of the estimates (Fan & Li, 2001). However, the main difficulty lies in the non-convex nature of the penalty for these values of α, which makes an</p><p dir="ltr">optimization procedure challenging and usually it is only possible to find a local optimum. To address this issue, Polson et al. (2013) took a sampling based fully Bayesian approach to this problem, using the correspondence between the Bridge penalty and a power exponential prior on the regression coefficients. However, their sampling procedure relies on Markov chain Monte Carlo (MCMC) techniques, which are inherently sequential and not scalable to large problem dimensions. Cross validation approaches are similarly computation-intensive. To this end, our contribution is a novel non-iterative method to fit a Bridge regression model. The main contribution lies in an explicit formula for Stein’s unbiased risk estimate for the out of sample prediction risk of Bridge regression, which can then be optimized to select the desired tuning parameters, allowing us to completely bypass MCMC as well as computation-intensive cross validation approaches. Our procedure yields results in a fraction of computational times compared to iterative schemes, without any appreciable loss in statistical performance.</p><p><br></p><p dir="ltr">Next, we build upon the classical and influential works of Neal (1996), who proved that the infinite width scaling limit of a Bayesian neural network with one hidden layer is a Gaussian process, when the network weights have bounded prior variance. Neal’s result has been extended to networks with multiple hidden layers and to convolutional neural networks, also with Gaussian process scaling limits. The tractable properties of Gaussian processes then allow straightforward posterior inference and uncertainty quantification, considerably simplifying the study of the limit process compared to a network of finite width. Neural network weights with unbounded variance, however, pose unique challenges. In this case, the classical central limit theorem breaks down and it is well known that the scaling limit is an α-stable process under suitable conditions. However, current literature is primarily limited to forward simulations under these processes and the problem of posterior inference under such a scaling limit remains largely unaddressed, unlike in the Gaussian process case. To this end, our contribution is an interpretable and computationally efficient procedure for posterior inference, using a conditionally Gaussian representation, that then allows full use of the Gaussian process machinery for tractable posterior inference and uncertainty quantification in the non-Gaussian regime.</p><p><br></p><p dir="ltr">Finally, we extend on the previous chapter, by considering a natural extension to deep neural networks through kernel processes. Kernel processes (Aitchison et al., 2021) generalize to deeper networks the notion proved by Neal (1996) by describing the non-linear transformation in each layer as a covariance matrix (kernel) of a Gaussian process. In this way, each succesive layer transforms the covariance matrix in the previous layer by a covariance function. However, the covariance obtained by this process loses any possibility of representation learning since the covariance matrix is deterministic. To address this, Aitchison et al. (2021) proposed deep kernel processes using Wishart and inverse Wishart matrices for each layer in deep neural networks. Nevertheless, the approach they propose requires using a process that does not emerge from the limit of a classic neural network structure. We introduce α-stable kernel processes (α-KP) for learning posterior stochastic covariances in each layer. Our results show that our method is much better than the approach proposed by Aitchison et al. (2021) in both simulated data and the benchmark Boston dataset.</p>
77

Réseaux de neurones profonds pour la séparation des sources et la reconnaissance robuste de la parole / Deep neural networks for source separation and noise-robust speech recognition

Aditya Arie Nugraha, . 05 December 2017 (has links)
Dans cette thèse, nous traitons le problème de la séparation de sources audio multicanale par réseaux de neurones profonds (deep neural networks, DNNs). Notre approche se base sur le cadre classique de séparation par algorithme espérance-maximisation (EM) basé sur un modèle gaussien multicanal, dans lequel les sources sont caractérisées par leurs spectres de puissance à court terme et leurs matrices de covariance spatiales. Nous explorons et optimisons l'usage des DNNs pour estimer ces paramètres spectraux et spatiaux. À partir des paramètres estimés, nous calculons un filtre de Wiener multicanal variant dans le temps pour séparer chaque source. Nous étudions en détail l'impact de plusieurs choix de conception pour les DNNs spectraux et spatiaux. Nous considérons plusieurs fonctions de coût, représentations temps-fréquence, architectures, et tailles d'ensembles d'apprentissage. Ces fonctions de coût incluent en particulier une nouvelle fonction liée à la tâche pour les DNNs spectraux: le rapport signal-à-distorsion. Nous présentons aussi une formule d'estimation pondérée des paramètres spatiaux, qui généralise la formulation EM exacte. Sur une tâche de séparation de voix chantée, nos systèmes sont remarquablement proches de la méthode de l'état de l'art actuel et améliorent le rapport source-interférence de 2 dB. Sur une tâche de rehaussement de la parole, nos systèmes surpassent la formation de voies GEV-BAN de l'état de l'art de 14%, 7% et 1% relatifs en terme d'amélioration du taux d'erreur sur les mots sur des données à 6, 4 et 2 canaux respectivement / This thesis addresses the problem of multichannel audio source separation by exploiting deep neural networks (DNNs). We build upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources are characterized by their power spectral densities and their source spatial covariance matrices. We explore and optimize the use of DNNs for estimating these spectral and spatial parameters. Employing the estimated source parameters, we then derive a time-varying multichannel Wiener filter for the separation of each source. We extensively study the impact of various design choices for the spectral and spatial DNNs. We consider different cost functions, time-frequency representations, architectures, and training data sizes. Those cost functions notably include a newly proposed task-oriented signal-to-distortion ratio cost function for spectral DNNs. Furthermore, we present a weighted spatial parameter estimation formula, which generalizes the corresponding exact EM formulation. On a singing-voice separation task, our systems perform remarkably close to the current state-of-the-art method and provide up to 2 dB improvement of the source-to-interference ratio. On a speech enhancement task, our systems outperforms the state-of-the-art GEV-BAN beamformer by 14%, 7%, and 1% relative word error rate improvement on 6-channel, 4-channel, and 2-channel data, respectively
78

Approches jointes texte/image pour la compréhension multimodale de documents / Text/image joint approaches for multimodal understanding of documents

Delecraz, Sébastien 10 December 2018 (has links)
Les mécanismes de compréhension chez l'être humain sont par essence multimodaux. Comprendre le monde qui l'entoure revient chez l'être humain à fusionner l'information issue de l'ensemble de ses récepteurs sensoriels. La plupart des documents utilisés en traitement automatique de l'information sont multimodaux. Par exemple, du texte et des images dans des documents textuels ou des images et du son dans des documents vidéo. Cependant, les traitements qui leurs sont appliqués sont le plus souvent monomodaux. Le but de cette thèse est de proposer des traitements joints s'appliquant principalement au texte et à l'image pour le traitement de documents multimodaux à travers deux études : l'une portant sur la fusion multimodale pour la reconnaissance du rôle du locuteur dans des émissions télévisuelles, l'autre portant sur la complémentarité des modalités pour une tâche d'analyse linguistique sur des corpus d'images avec légendes. Pour la première étude nous nous intéressons à l'analyse de documents audiovisuels provenant de chaînes d'information télévisuelle. Nous proposons une approche utilisant des réseaux de neurones profonds pour la création d'une représentation jointe multimodale pour les représentations et la fusion des modalités. Dans la seconde partie de cette thèse nous nous intéressons aux approches permettant d'utiliser plusieurs sources d'informations multimodales pour une tâche monomodale de traitement automatique du langage, afin d'étudier leur complémentarité. Nous proposons un système complet de correction de rattachements prépositionnels utilisant de l'information visuelle, entraîné sur un corpus multimodal d'images avec légendes. / The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions.
79

Structural priors in deep neural networks

Ioannou, Yani Andrew January 2018 (has links)
Deep learning has in recent years come to dominate the previously separate fields of research in machine learning, computer vision, natural language understanding and speech recognition. Despite breakthroughs in training deep networks, there remains a lack of understanding of both the optimization and structure of deep networks. The approach advocated by many researchers in the field has been to train monolithic networks with excess complexity, and strong regularization --- an approach that leaves much to desire in efficiency. Instead we propose that carefully designing networks in consideration of our prior knowledge of the task and learned representation can improve the memory and compute efficiency of state-of-the art networks, and even improve generalization --- what we propose to denote as structural priors. We present two such novel structural priors for convolutional neural networks, and evaluate them in state-of-the-art image classification CNN architectures. The first of these methods proposes to exploit our knowledge of the low-rank nature of most filters learned for natural images by structuring a deep network to learn a collection of mostly small, low-rank, filters. The second addresses the filter/channel extents of convolutional filters, by learning filters with limited channel extents. The size of these channel-wise basis filters increases with the depth of the model, giving a novel sparse connection structure that resembles a tree root. Both methods are found to improve the generalization of these architectures while also decreasing the size and increasing the efficiency of their training and test-time computation. Finally, we present work towards conditional computation in deep neural networks, moving towards a method of automatically learning structural priors in deep networks. We propose a new discriminative learning model, conditional networks, that jointly exploit the accurate representation learning capabilities of deep neural networks with the efficient conditional computation of decision trees. Conditional networks yield smaller models, and offer test-time flexibility in the trade-off of computation vs. accuracy.
80

Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

Fonseca De Sam Bento Ribeiro, Manuel January 2018 (has links)
Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks.

Page generated in 0.0438 seconds