• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 69
  • 35
  • 12
  • 9
  • 5
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 153
  • 57
  • 35
  • 32
  • 30
  • 28
  • 26
  • 26
  • 25
  • 23
  • 19
  • 18
  • 17
  • 17
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for High-Throughput Science

Blocker, Alexander Weaver 18 September 2013 (has links)
The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire S. cerevisiae genome in less than 1 hour on EC2. We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing. / Statistics
32

Morphosyntactic Corpora and Tools for Persian

Seraji, Mojgan January 2015 (has links)
This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).
33

Preprocessing and Reduction for Semidefinite Programming via Facial Reduction: Theory and Practice

Cheung, Yuen-Lam 05 November 2013 (has links)
Semidefinite programming is a powerful modeling tool for a wide range of optimization and feasibility problems. Its prevalent use in practice relies on the fact that a (nearly) optimal solution of a semidefinite program can be obtained efficiently in both theory and practice, provided that the semidefinite program and its dual satisfy the Slater condition. This thesis focuses on the situation where the Slater condition (i.e., the existence of positive definite feasible solutions) does not hold for a given semidefinite program; the failure of the Slater condition often occurs in structured semidefinite programs derived from various applications. In this thesis, we study the use of the facial reduction technique, originally proposed as a theoretical procedure by Borwein and Wolkowicz, as a preprocessing technique for semidefinite programs. Facial reduction can be used either in an algorithmic or a theoretical sense, depending on whether the structure of the semidefinite program is known a priori. The main contribution of this thesis is threefold. First, we study the numerical issues in the implementation of the facial reduction as an algorithm on semidefinite programs, and argue that each step of the facial reduction algorithm is backward stable. Second, we illustrate the theoretical importance of the facial reduction procedure in the topic of sensitivity analysis for semidefinite programs. Finally, we illustrate the use of facial reduction technique on several classes of structured semidefinite programs, in particular the side chain positioning problem in protein folding.
34

[en] AN AUTOMATIC PREPROCESSING FOR TEXT MINING IN PORTUGUESE: A COMPUTER-AIDED APPROACH / [pt] UMA ABORDAGEM DE PRÉ-PROCESSAMENTO AUTOMÁTICO PARA MINERAÇÃO DE TEXTOS EM PORTUGUÊS: SOB O ENFOQUE DA INTELIGENCIA COMPUTACIONAL

CHRISTIAN NUNES ARANHA 25 June 2007 (has links)
[pt] O presente trabalho apresenta uma pesquisa onde é proposto um novo modelo de pré-processamento para mineração de textos em português utilizando técnicas de inteligência computacional baseadas em conceitos existentes, como redes neurais, sistemas dinâmicos, e estatística multidimensional. O objetivo dessa tese de doutorado é, portanto, inovar na fase de pré- processamento da mineração de textos, propondo um modelo automático de enriquecimento de dados textuais. Essa abordagem se apresenta como uma extensão do tradicional modelo de conjunto de palavras (bag-of-words), de preocupação mais estatística, e propõe um modelo do tipo conjunto de lexemas (bag-of-lexems) com maior aproveitamento do conteúdo lingüístico do texto em uma abordagem mais computacional, proporcionando resultados mais eficientes. O trabalho é complementado com o desenvolvimento e implementação de um sistema de préprocessamento de textos, que torna automática essa fase do processo de mineração de textos ora proposto. Apesar do objeto principal desta tese ser a etapa de préprocessamento, passaremos, de forma não muito aprofundada, por todas as etapas do processo de mineração de textos com o intuito de fornecer a teoria base completa para o entendimento do processo como um todo. Além de apresentar a teoria de cada etapa, individualmente, é executado um processamento completo (com coleta de dados, indexação, pré-processamento, mineração e pósprocessamento) utilizando nas outras etapas modelos já consagrados na literatura que tiveram sua implementação realizada durante esse trabalho. Ao final são mostradas funcionalidades e algumas aplicações como: classificação de documentos, extração de informações e interface de linguagem natural (ILN). / [en] This work presents a research that proposes a new model of pre-processing for text mining in portuguese using computational intelligence techniques based on existing concepts, such as neural networks, dinamic systems and multidimensional statistics. The object of this doctoral thesis is, therefore, innovation in the pre-processing phase of text-mining, proposing an automatic model for the enrichment of textual data. This approach is presented as an extension of the traditional bag-of-words model, that has a more statistical emphasis, and proposes a bag-of-lexemes model with greater usage of the texts' linguistic content in a more computational approach, providing more efficient results. The work is complemented by the development and implementation of a text pre-processing system that automates this phase of th text mining process as proposed. Despite the object of this thesis being the pre- processing stage, one feels apropriate to describe, in overview, every step of the text mining process in order to provide the basic theory necessary to understand the process as a whole. Beyond presenting the theory of every stage individually, one executes a complete process (with data collection, indexing, pre-processing, mining and postprocessing) using tried-and-true models in all the other stages, which were implemented during the development of this work. At last some functionalities and aplications are shown, such as: document classification, information extraction and natural language interface (NLI).
35

Pré-processamento de dados na identificação de processos industriais. / Pre-processing data in the identification of industrial processes.

Oscar Wilfredo Rodríguez Rodríguez 01 December 2014 (has links)
Neste trabalho busca-se estudar as diferentes etapas de pre-processamento de dados na identificacao de sistemas, que sao: filtragem, normalizacao e amostragem. O objetivo principal e de acondicionar os dados empiricos medidos pelos instrumentos dos processos industriais, para que quando estes dados forem usados na identificacao de sistemas, se possa obter modelos matematicos que representem da forma mais proxima a dinamica do processo real. Vai-se tambem implementar as tecnicas de pre-processamento de dados no software MatLab 2012b e vai-se fazer testes na Planta Piloto de Vazao instalada no Laboratorio de Controle de Processos Industriais do Departamento de Engenharia de Telecomunicacoes e Controle da Escola Politecnica da USP; bem como em plantas simuladas de processos industriais, em que e conhecido a priori seu modelo matematico. Ao final, vai-se analisar e comparar o desempenho das etapas de pre-processamento de dados e sua influencia no indice de ajuste do modelo ao sistema real (fit), obtido mediante o metodo de validacao cruzada. Os parametros do modelo sao obtidos para predicoes infinitos passos a frente. / This work aims to study the different stages of data pre-processing in system identification, as are: filtering, normalization and sampling. The main goal is to condition the empirical data measured by the instruments of industrial processes, so that when these data are used to identify systems, one can obtain mathematical models that represent more closely the dynamics of the real process. It will also be implemented the techniques of preprocessing of data in MatLab 2012b and it will be performed tests in the Pilot Plant of Flow at the Laboratory of Industrial Process Control, Department of Telecommunications and Control Engineering from the Polytechnic School of USP; as well as with simulated plants of industrial processes where it is known a priori its mathematical model. At the end, it is analyzed and compared the performance of the pre-processing of data and its influence on the index of adjustment of the model to the real system (fit), obtained by the cross validation method. The model parameters are obtained for infinite step-ahead prediction.
36

On an automatically parallel generation technique for tetrahedral meshes

Globisch, G. 30 October 1998 (has links) (PDF)
In order to prepare modern finite element analysis a program for the efficient parallel generation of tetrahedral meshes in a wide class of three dimensional domains having a generalized cylindric shape is presented. The applied mesh generation strategy is based on the decomposition of some 2D-reference domain into single con- nected subdomains by means of its triangulations the tetrahedral layers are built up in parallel. Adaptive grid controlling as well as nodal renumbering algorithms are involved. In the paper several examples are incorporated to demonstrate both program's capabilities and the handling with.
37

Modeling and solving university timetabling / Modélisation et résolution de problèmes d’emploi du temps d’universités

Arbaoui, Taha 10 December 2014 (has links)
Cette thèse s’intéresse aux problèmes d’emploi du temps d’universités. Ces problèmes sont rencontrés chaque année par les utilisateurs. Nous proposons des bornes inférieures, des méthodes heuristiques et des modèles de programmation mixte en nombres entiers et de programmation par contraintes. Nous traitons le problème d’emploi du temps d’examens et celui d’affectation des étudiants. Nous proposons de nouvelles méthodes et formulations et les comparons aux approches existantes. Nous proposons, pour le problème d’emploi du temps d’examens, une amélioration d’un modèle mathématique en nombres entiers qui permettra d’obtenir des solutions optimales. Ensuite, des bornes inférieures, une formulation plus compacte des contraintes et un modèle de programmation par contraintes sont proposés. Pour le problème d’emploi du temps d’examens à l’Université de Technologie de Compiègne, nous proposons une approche mémétique. Enfin, nous présentons un modèle mathématique pour le problème d’affectation des étudiants et nous étudions sa performance sur un ensemble d’instances réelles. / This thesis investigates university timetabling problems. These problems occur across universities and are faced each year by the practitioners. We propose new lower bounds, heuristic approaches, mixed integer and constraint programming models to solve them. We address the exam timetabling and the student scheduling problem. We investigate new methods and formulations and compare them to the existing approaches. For exam timetabling, we propose an improvement to an existing mixed integer programming model that makes it possible to obtain optimal solutions. Next, lower bounds, a more compact reformulation for constraints and a constraint programming model are proposed. For the exam timetabling problem at Université de Technologie de Compiègne, we designed a memetic approach. Finally, we present a new formulation for the student scheduling problem and investigate its performance on a set of real-world instances.
38

Framework pro předzpracování dopravních dat pro zjištění semantických míst / Trajectory Data Preprocessing Framework for Discovering Semantic Locations

Ostroukh, Anna January 2018 (has links)
Cílem práce je vytvoření přehledu o existujících přístupech pro předzpracování dopravních dat se zaměřením na objevování sémantických trajektorií a návrh a vývoj rámce, který integruje dopravní data z GPS senzorů se sémantikou. Problém analýzy nezpracovaných trajektorií spočíva v tom, že není natolik vyčerpávající, jako analýza trajektorií, které obsahují smysluplný kontext. Po nastudování různých přístupů a algoritmů sleduje návrh a vývoj rámce, který objevuje semantická místa pomocí schlukovací metody záložené na hustotě, aplikované na body zastavení v trajektoriích. Návrh a implementace rámce byl zhodnotěn na veřejně přístupných datových souborech obsahujících nezpracované GPS záznamy.
39

Predikce povahy spamových krátkých textů textovým klasifikátorem / Machine Learning Text Classifier for Short Texts Category Prediction

Drápela, Karel January 2018 (has links)
This thesis deals with categorization of short spam texts from SMS messages. First part summarizes current methods for text classification and~it's followed by description of several commonly used classifiers. In following chapters test data analysis, program implementation and results are described. The program is able to predict text categories based on predefined set of classes and also estimate classification accuracy on training data. For the two category types, that I designed, classifier reached accuracy of 82% and 92% . Both preprocessing and feature selection had a positive impact on resulting accuracy. It is possible to improve this accuracy further by removing portion of samples, which are difficult to classify. With 80\% recall it is possible to increase accuracy by 8-10%.
40

How can a module for sentiment analysis be designed to classify tweets about covid19 / Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19

Ly, Denny, Saad Abdul Malik, Tamara January 2021 (has links)
The sentiment analysis of a text is getting more focus nowadays from different entities for a variety of reasons. Emotions mining (sentiment analysis) is a very interesting subject to explore thus the research question is How can a module for sentiment analysis be designed to classify tweets about Covid-19. The dataset used for this project was taken from Kaggle and preprocessed with various methods such as Bag of Words and term frequency-inverse document frequency. The models are based on the following algorithms: KNN, SVM, DT, and NB. Some models are also based on the combination of ML and Lexicon. The outcome of the experiment showed that the lexicon method with an accuracy of 87% exceeded the machine learning methods implemented in this thesis and the experiments done by the ML community in Kaggle. This implies that the traditional lexicon approach is still considered a fit choice in the sentiment analysis field. / På senaste tiden har sentimentanalyser av text fått ett större fokus. Känsloutvinning (Emotions mining) är ett väldigt intressant ämne att utforska, Forskningsfrågan är då Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19. Datasetet som används är hämtat från Kaggle och sedan preprocesserat med hjälp av olika metoder såsom Bag of Words och term frequency-inverse document frequency. Modellerna är baserad på följande algoritmer: KNN, SVM, DT, och NB. Vissa modeller är baserad på en kombination of ML och Lexicon. Slutresultatet av experimentet visade sig vara att lexikon metoden med en prestanda av 87% översteg maskin inlärningsmetoderna som utfördes i denna uppsatsen och övriga experiment från ML gemensamhet i kaggle. Detta antyder att lexikon metoden är fortfarande ett bra val inom sentimentanalys området.

Page generated in 0.1 seconds