1 |
An Entropy Estimate of Written Language and Twitter Language : A Comparison between English and SwedishJuhlin, Sanna January 2017 (has links)
The purpose of this study is to estimate and compare the entropy and redundancy of written English and Swedish. We also investigate and compare the entropy and redundancy of Twitter language. This is done by extracting n consecutive characters called n-grams and calculating their frequencies. No precise values are obtained, due to the amount of text being finite, while the entropy is estimated for text length tending towards infinity. However we do obtain results for n = 1,...,6 and the results show that written Swedish has higher entropy than written English and that the redundancy is lower for Swedish language. When comparing Twitter with the standard languages we find that for Twitter, the entropy is higher and the redundancy is lower.
|
2 |
Spatially Adaptive Analysis and Segmentation of Polarimetric SAR DataWang, Wei January 2017 (has links)
In recent years, Polarimetric Synthetic Aperture Radar (PolSAR) has been one of the most important instruments for earth observation, and is increasingly used in various remote sensing applications. Statistical modelling and scattering analysis are two main ways for PolSAR data interpretation, and have been intensively investigated in the past two decades. Moreover, spatial analysis was applied in the analysis of PolSAR data and found to be beneficial to achieve more accurate interpretation results. This thesis focuses on extracting typical spatial information, i.e., edges and regions by exploring the statistical characteristics of PolSAR data. The existing spatial analysing methods are mainly based on the complex Wishart distribution, which well characterizes the inherent statistical features in homogeneous areas. However, the non-Gaussian models can give better representation of the PolSAR statistics, and therefore have the potential to improve the performance of spatial analysis, especially in heterogeneous areas. In addition, the traditional fixed-shape windows cannot accurately estimate the distribution parameter in some complicated areas, leading to the loss of the refined spatial details. Furthermore, many of the existing methods are not spatially adaptive so that the obtained results are promising in some areas whereas unsatisfactory in other areas. Therefore, this thesis is dedicated to extracting spatial information by applying the non-Gaussian statistical models and spatially adaptive strategies. The specific objectives of the thesis include: (1) to develop reliable edge detection method, (2) to develop spatially adaptive superpixel generation method, and (3) to investigate a new framework of region-based segmentation. Automatic edge detection plays a fundamental role in spatial analysis, whereas the performance of classical PolSAR edge detection methods is limited by the fixed-shape windows. Paper 1 investigates an enhanced edge detection method using the proposed directional span-driven adaptive (DSDA) window. The DSDA window has variable sizes and flexible shapes, and can overcome the limitation of fixed-shape windows by adaptively selecting homogeneous samples. The spherically invariant random vector (SIRV) product model is adopted to characterize the PolSAR data, and a span ratio is combined with the SIRV distance to highlight the dissimilarity measure. The experimental results demonstrated that the proposed method can detect not only the obvious edges, but also the tiny and inconspicuous edges in heterogeneous areas. Edge detection and region segmentation are two important aspects of spatial analysis. As to the region segmentation, paper 2 presents an adaptive PolSAR superpixel generation method based on the simple linear iterative clustering (SLIC) framework. In the k-means clustering procedure, multiple cues including polarimetric, spatial, and texture information are considered to measure the distance. Since the constant weighting factor which balances the spectral similarity and spatial proximity may cause over- or under-superpixel segmentation in different areas, the proposed method sets the factor adaptively based on the homogeneity analysis. Then, in heterogeneous areas, the spectral similarity is more significant than the spatial constraint, generating superpixels which better preserved local details and refined structures. Paper 3 investigates another PolSAR superpixel generation method, which is achieved from the global optimization aspect, using the entropy rate method. The distance between neighbouring pixels is calculated based on their corresponding DSDA regions. In addition, the SIRV distance and the Wishart distance are combined together. Therefore, the proposed method makes good use of the entropy rate framework, and also incorporates the merits of the SIRV distance and the Wishart distance. The superpixels are generated in a homogeneity-adaptive manner, resulting in smooth representation of the land covers in homogeneous areas, and well preserved details in heterogeneous areas. / <p>QC 20171123</p>
|
3 |
Probabilistic Sequence Models with Speech and Language ApplicationsHenter, Gustav Eje January 2013 (has links)
Series data, sequences of measured values, are ubiquitous. Whenever observations are made along a path in space or time, a data sequence results. To comprehend nature and shape it to our will, or to make informed decisions based on what we know, we need methods to make sense of such data. Of particular interest are probabilistic descriptions, which enable us to represent uncertainty and random variation inherent to the world around us. This thesis presents and expands upon some tools for creating probabilistic models of sequences, with an eye towards applications involving speech and language. Modelling speech and language is not only of use for creating listening, reading, talking, and writing machines---for instance allowing human-friendly interfaces to future computational intelligences and smart devices of today---but probabilistic models may also ultimately tell us something about ourselves and the world we occupy. The central theme of the thesis is the creation of new or improved models more appropriate for our intended applications, by weakening limiting and questionable assumptions made by standard modelling techniques. One contribution of this thesis examines causal-state splitting reconstruction (CSSR), an algorithm for learning discrete-valued sequence models whose states are minimal sufficient statistics for prediction. Unlike many traditional techniques, CSSR does not require the number of process states to be specified a priori, but builds a pattern vocabulary from data alone, making it applicable for language acquisition and the identification of stochastic grammars. A paper in the thesis shows that CSSR handles noise and errors expected in natural data poorly, but that the learner can be extended in a simple manner to yield more robust and stable results also in the presence of corruptions. Even when the complexities of language are put aside, challenges remain. The seemingly simple task of accurately describing human speech signals, so that natural synthetic speech can be generated, has proved difficult, as humans are highly attuned to what speech should sound like. Two papers in the thesis therefore study nonparametric techniques suitable for improved acoustic modelling of speech for synthesis applications. Each of the two papers targets a known-incorrect assumption of established methods, based on the hypothesis that nonparametric techniques can better represent and recreate essential characteristics of natural speech. In the first paper of the pair, Gaussian process dynamical models (GPDMs), nonlinear, continuous state-space dynamical models based on Gaussian processes, are shown to better replicate voiced speech, without traditional dynamical features or assumptions that cepstral parameters follow linear autoregressive processes. Additional dimensions of the state-space are able to represent other salient signal aspects such as prosodic variation. The second paper, meanwhile, introduces KDE-HMMs, asymptotically-consistent Markov models for continuous-valued data based on kernel density estimation, that additionally have been extended with a fixed-cardinality discrete hidden state. This construction is shown to provide improved probabilistic descriptions of nonlinear time series, compared to reference models from different paradigms. The hidden state can be used to control process output, making KDE-HMMs compelling as a probabilistic alternative to hybrid speech-synthesis approaches. A final paper of the thesis discusses how models can be improved even when one is restricted to a fundamentally imperfect model class. Minimum entropy rate simplification (MERS), an information-theoretic scheme for postprocessing models for generative applications involving both speech and text, is introduced. MERS reduces the entropy rate of a model while remaining as close as possible to the starting model. This is shown to produce simplified models that concentrate on the most common and characteristic behaviours, and provides a continuum of simplifications between the original model and zero-entropy, completely predictable output. As the tails of fitted distributions may be inflated by noise or empirical variability that a model has failed to capture, MERS's ability to concentrate on high-probability output is also demonstrated to be useful for denoising models trained on disturbed data. / <p>QC 20131128</p> / ACORNS: Acquisition of Communication and Recognition Skills / LISTA – The Listening Talker
|
Page generated in 0.076 seconds