Spelling suggestions: "subject:"sparse"" "subject:"sparsely""
11 |
Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste / Integration of lexical resources in a probabilistic parserSigogne, Anthony 03 December 2012 (has links)
Cette thèse porte sur l'intégration de ressources lexicales et syntaxiques du français dans deux tâches fondamentales du Traitement Automatique des Langues [TAL] que sont l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste. Dans ce mémoire, nous utilisons des données lexicales et syntaxiques créées par des processus automatiques ou par des linguistes afin de donner une réponse à deux problématiques que nous décrivons succinctement ci-dessous : la dispersion des données et la segmentation automatique des textes. Grâce à des algorithmes d'analyse syntaxique de plus en plus évolués, les performances actuelles des analyseurs sont de plus en plus élevées, et ce pour de nombreuses langues dont le français. Cependant, il existe plusieurs problèmes inhérents aux formalismes mathématiques permettant de modéliser statistiquement cette tâche (grammaire, modèles discriminants,...). La dispersion des données est l'un de ces problèmes, et est causée principalement par la faible taille des corpus annotés disponibles pour la langue. La dispersion représente la difficulté d'estimer la probabilité de phénomènes syntaxiques apparaissant dans les textes à analyser mais qui sont rares ou absents du corpus ayant servi à l'apprentissage des analyseurs. De plus, il est prouvé que la dispersion est en partie un problème lexical, car plus la flexion d'une langue est importante, moins les phénomènes lexicaux sont représentés dans les corpus annotés. Notre première problématique repose donc sur l'atténuation de l'effet négatif de la dispersion lexicale des données sur les performances des analyseurs. Dans cette optique, nous nous sommes intéressé à une méthode appelée regroupement lexical, et qui consiste à regrouper les mots du corpus et des textes en classes. Ces classes réduisent le nombre de mots inconnus et donc le nombre de phénomènes syntaxiques rares ou inconnus, liés au lexique, des textes à analyser. Notre objectif est donc de proposer des regroupements lexicaux à partir d'informations tirées des lexiques syntaxiques du français, et d'observer leur impact sur les performances d'analyseurs syntaxiques. Par ailleurs, la plupart des évaluations concernant l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste ont été réalisées avec une segmentation parfaite du texte, car identique à celle du corpus évalué. Or, dans les cas réels d'application, la segmentation d'un texte est très rarement disponible et les segmenteurs automatiques actuels sont loin de proposer une segmentation de bonne qualité, et ce, à cause de la présence de nombreuses unités multi-mots (mots composés, entités nommées,...). Dans ce mémoire, nous nous focalisons sur les unités multi-mots dites continues qui forment des unités lexicales auxquelles on peut associer une étiquette morpho-syntaxique, et que nous appelons mots composés. Par exemple, cordon bleu est un nom composé, et tout à fait un adverbe composé. Nous pouvons assimiler la tâche de repérage des mots composés à celle de la segmentation du texte. Notre deuxième problématique portera donc sur la segmentation automatique des textes français et son impact sur les performances des processus automatiques. Pour ce faire, nous nous sommes penché sur une approche consistant à coupler, dans un même modèle probabiliste, la reconnaissance des mots composés et une autre tâche automatique. Dans notre cas, il peut s'agir de l'analyse syntaxique ou de l'étiquetage morpho-syntaxique. La reconnaissance des mots composés est donc réalisée au sein du processus probabiliste et non plus dans une phase préalable. Notre objectif est donc de proposer des stratégies innovantes permettant d'intégrer des ressources de mots composés dans deux processus probabilistes combinant l'étiquetage ou l'analyse à la segmentation du texte / This thesis focuses on the integration of lexical and syntactic resources of French in two fundamental tasks of Natural Language Processing [NLP], that are probabilistic part-of-speech tagging and probabilistic parsing. In the case of French, there are a lot of lexical and syntactic data created by automatic processes or by linguists. In addition, a number of experiments have shown interest to use such resources in processes such as tagging or parsing, since they can significantly improve system performances. In this paper, we use these resources to give an answer to two problems that we describe briefly below : data sparseness and automatic segmentation of texts. Through more and more sophisticated parsing algorithms, parsing accuracy is becoming higher for many languages including French. However, there are several problems inherent in mathematical formalisms that statistically model the task (grammar, discriminant models,...). Data sparseness is one of those problems, and is mainly caused by the small size of annotated corpora available for the language. Data sparseness is the difficulty of estimating the probability of syntactic phenomena, appearing in the texts to be analyzed, that are rare or absent from the corpus used for learning parsers. Moreover, it is proved that spars ness is partly a lexical problem, because the richer the morphology of a language is, the sparser the lexicons built from a Treebank will be for that language. Our first problem is therefore based on mitigating the negative impact of lexical data sparseness on parsing performance. To this end, we were interested in a method called word clustering that consists in grouping words of corpus and texts into clusters. These clusters reduce the number of unknown words, and therefore the number of rare or unknown syntactic phenomena, related to the lexicon, in texts to be analyzed. Our goal is to propose word clustering methods based on syntactic information from French lexicons, and observe their impact on parsers accuracy. Furthermore, most evaluations about probabilistic tagging and parsing were performed with a perfect segmentation of the text, as identical to the evaluated corpus. But in real cases of application, the segmentation of a text is rarely available and automatic segmentation tools fall short of proposing a high quality segmentation, because of the presence of many multi-word units (compound words, named entities,...). In this paper, we focus on continuous multi-word units, called compound words, that form lexical units which we can associate a part-of-speech tag. We may see the task of searching compound words as text segmentation. Our second issue will therefore focus on automatic segmentation of French texts and its impact on the performance of automatic processes. In order to do this, we focused on an approach of coupling, in a unique probabilistic model, the recognition of compound words and another task. In our case, it may be parsing or tagging. Recognition of compound words is performed within the probabilistic process rather than in a preliminary phase. Our goal is to propose innovative strategies for integrating resources of compound words in both processes combining probabilistic tagging, or parsing, and text segmentation
|
12 |
Bayesian and Information-Theoretic Learning of High Dimensional DataChen, Minhua January 2012 (has links)
<p>The concept of sparseness is harnessed to learn a low dimensional representation of high dimensional data. This sparseness assumption is exploited in multiple ways. In the Bayesian Elastic Net, a small number of correlated features are identified for the response variable. In the sparse Factor Analysis for biomarker trajectories, the high dimensional gene expression data is reduced to a small number of latent factors, each with a prototypical dynamic trajectory. In the Bayesian Graphical LASSO, the inverse covariance matrix of the data distribution is assumed to be sparse, inducing a sparsely connected Gaussian graph. In the nonparametric Mixture of Factor Analyzers, the covariance matrices in the Gaussian Mixture Model are forced to be low-rank, which is closely related to the concept of block sparsity. </p><p>Finally in the information-theoretic projection design, a linear projection matrix is explicitly sought for information-preserving dimensionality reduction. All the methods mentioned above prove to be effective in learning both simulated and real high dimensional datasets.</p> / Dissertation
|
13 |
Curvelet imaging and processing : an overviewHerrmann, Felix J. January 2004 (has links)
In this paper an overview is given on the application of directional basis functions, known under the name Curvelets/Contourlets, to various aspects of seismic processing and imaging. Key concepts in the approach are the use of (i) that localize in both domains (e.g. space and angle); (ii) non-linear estimation, which corresponds to localized muting on the coefficients, possibly supplemented by constrained optimization (iii) invariance of the basis functions under the imaging operators. We will discuss applications that include multiple and ground roll removal; sparseness-constrained least-squares migration and the computation of 4-D difference cubes.
|
14 |
Probabilistic Diagnostic Model for Handling Classifier Degradation in Machine LearningGustavo A. Valencia-Zapata (8082655) 04 December 2019 (has links)
Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. This research consists of three main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers. Finally, a probabilistic sampling technique based on training set diagnosis for avoiding classifier degradation is proposed<br>
|
15 |
An optimised QPSK-based receiver structure for possibly sparse data transmission over narrowband and wideband communication systemsSchoeman, Johan P. 24 August 2010 (has links)
In this dissertation an in-depth study was conducted into the design, implementation and evaluation of a QPSK-based receiver structure for application in a UMTS WCDMA environment. The novelty of this work lies with the specific receiver architecture aimed to optimise the BER performance when possibly sparse data streams are transmitted. This scenario is a real possibility according to Verd´u et al [1] and Hagenauer et al [2–6]. A novel receiver structure was conceptualised, developed and evaluated in both narrowband and wideband scenarios, where it was found to outperform conventional receivers when a sparse data stream is transmitted. In order to reach the main conclusions of this study, it was necessary to develop a realistic simulation platform. The developed platform is capable of simulating a communication system meeting the physical layer requirements of the UMTS WCDMA standard. The platform can also perform narrowband simulations. A flexible channel emulator was developed that may be configured to simulate AWGN channel conditions, frequency non-selective fading (either Rayleigh or Rician with a configurable LOS component and Doppler spread), or a full multipath scenario where each path has a configurable LOS component, Doppler spread, path gain and path delay. It is therefore possible to even simulate a complex, yet realistic, COST207-TU channel model. The platform is also capable of simulating MUI. Each interfering user has a unique and independent multipath fading channel, while sharing the same bandwidth. Finally, the entire platform executes all simulations in baseband for improved simulation times. The research outputs of this work are summarised below: <ul> <li>A parameter, the sparseness measure, was defined in order to quantify the level by which a data stream differs from an equiprobable data stream.</li> <li>A novel source model was proposed and developed to simulate data streams with a specified amount of sparseness.</li> <li>An introductory investigation was undertaken to determine the effect of simple FEC techniques on the sparseness of an encoded data stream.</li> <li>Novel receiver structures for both narrowband and wideband systems were proposed, developed and evaluated for systems where possibly sparse data streams may be transmitted.</li> <li>Analytic expressions were derived to take the effect of sparseness into account in communication systems, including expressions for the joint PDF of a BPSK branch, the optimal decision region of a detector in AWGN conditions as well as the BER performance of a communication system employing the proposed optimal receiver in both AWGN channel conditions as well as in flat fading channel conditions.</li> <li>Numerous BER performance curves were obtained comparing the proposed receiver structure with conventional receivers in a variety of channel conditions, including AWGN, frequency non-selective fading and a multipath COST207-TU channel environment, as well as the effect of MUI</li></ul>. AFRIKAANS : In hierdie verhandeling word ’n in-diepte studie gedoen rakende die ontwerp, implementasie en evaluasie van ’n KPSK-gebaseerde ontvanger struktuur wat in ’n UMTS WKVVT omgewing gebruik kan word. Die bydrae van hierdie werk lˆe in die spesifieke ontvanger argitektuur wat daarop mik om die BFT werksverrigting te optimeer wanneer yl data strome versend word. Hierdie is ’n realistiese moontlikheid volgens Verd´u et al [1] en Hagenauer et al [2–6]. ’n Nuwe ontvanger struktuur is gekonsepsualiseer, ontwikkel en evalueer vir beide noueband en wyeband stelsels, waar dit gevind is dat dit beter werksverrigting lewer as tradisionele ontvangers wanneer yl data strome versend word. Dit was nodig om ’n realistiese simulasie platform te ontwikkel om die belangrikste gevolgtrekkings van hierdie studie te kan maak. Die ontwikkelde platform is in staat om ’n kommunikasie stelsel te simuleer wat aan die fisiese laag vereistes van die UMTS WKVVT standaard voldoen. Die platform kan ook noueband stelsels simuleer. ’n Aanpasbare kanaal simulator is ontwikkel wat opgestel kan word om SWGR kanaal toestande, plat duining (beide Rayleigh of Ricies met ’n verstelbare siglyn komponent en Doppler verspreiding), sowel as ’n veelvuldige pad omgewing (waar elke unieke pad ’n verstelbare siglyn komponent, Doppler verspreiding, pad wins en pad vertraging het) te emuleer. Dit is selfs moontlik om ’n komplekse, maar steeds realistiese COST207-TU kanaal model te simuleer. Die platform het ook die vermo¨e om VGS te simuleer. Elke steurende gebruiker het ’n unieke en onafhanklike veelvuldige pad deinende kanaal, terwyl dieselfde bandwydte gedeel word. Laastens, alle simulasies van die platvorm word in basisband uitgevoer wat verkorte simulasie periodes verseker. Die navorsingsuitsette van hierdie werk kan as volg opgesom word: <ul> <li>’n Parameter, die ylheidsmaatstaf, is gedefin¨ýeer om dit moontlik te maak om die vlak waarmee die ylheid van ’n datastroom verskil van ’n ewekansige stroom te versyfer.</li> <li>’n Nuwe bronmodel is voorgestel en ontwikkel om datastrome met ’n spesifieke ylheid te emuleer.</li> <li>’n Inleidende ondersoek is onderneem om vas te stel wat die effek van VFK tegnieke op die ylheid van ’n enkodeerde datastroom is.</li> <li>Nuwe ontvanger strukture is voorgestel, ontwikkel en evalueer vir beide noueband en wyeband stelsels waar yl datastrome moontlik versend kan word.</li> <li>Analitiese uitdrukkings is afgelei om die effek van ylheid in ag te neem in kommunikasie stelsels. Uitdrukkings vir onder andere die gedeelte WDF van ’n BFVK tak, die optimale beslissingspunt van ’n detektor in SWGR toestande, sowel as die BFT werksverrigting van ’n kommunikasie stelsel wat van die voorgestelde optimale ontvangers gebruik maak, hetsy in SWGR of in plat duinende kanaal toestande.</li> <li>Talryke BFT werksverrigting krommes is verkry wat die voorgestelde ontvanger struktuur vergelyk met die konvensionele ontvangers in ’n verskeidenheid kanaal toestande, insluitend SWGR, plat duinende kanale en ’n veelvuldige pad COST207-TU kanaal omgewing, sowel as in die teenwoordigheid van VGS.</li></ul></p Copyright / Dissertation (MEng)--University of Pretoria, 2010. / Electrical, Electronic and Computer Engineering / unrestricted
|
Page generated in 0.0523 seconds