Global ETD Search

301	Complexity penalized methods for structured and unstructured data Goeva, Aleksandrina 08 November 2017 (has links) A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate. Statistics Complexity penalized Entropy Inverse problem Network inference Stochastic simulation Text mining
302	Standardization of textual data for comprehensive job market analysis / Normalisation textuelle pour une analyse exhaustive du marché de l'emploi Malherbe, Emmanuel 18 November 2016 (has links) Sachant qu'une grande partie des offres d'emplois et des profils candidats est en ligne, le e-recrutement constitue un riche objet d'étude. Ces documents sont des textes non structurés, et le grand nombre ainsi que l'hétérogénéité des sites de recrutement implique une profusion de vocabulaires et nomenclatures. Avec l'objectif de manipuler plus aisément ces données, Multiposting, une entreprise française spécialisée dans les outils de e-recrutement, a soutenu cette thèse, notamment en terme de données, en fournissant des millions de CV numériques et offres d'emplois agrégées de sources publiques.Une difficulté lors de la manipulation de telles données est d'en déduire les concepts sous-jacents, les concepts derrière les mots n'étant compréhensibles que des humains. Déduire de tels attributs structurés à partir de donnée textuelle brute est le problème abordé dans cette thèse, sous le nom de normalisation. Avec l'objectif d'un traitement unifié, la normalisation doit fournir des valeurs dans une nomenclature, de sorte que les attributs résultants forment une représentation structurée unique de l'information. Ce traitement traduit donc chaque document en un language commun, ce qui permet d'agréger l'ensemble des données dans un format exploitable et compréhensible. Plusieurs questions sont cependant soulevées: peut-on exploiter les structures locales des sites web dans l'objectif d'une normalisation finale unifiée? Quelle structure de nomenclature est la plus adaptée à la normalisation, et comment l'exploiter? Est-il possible de construire automatiquement une telle nomenclature de zéro, ou de normaliser sans en avoir une?Pour illustrer le problème de la normalisation, nous allons étudier par exemple la déduction des compétences ou de la catégorie professionelle d'une offre d'emploi, ou encore du niveau d'étude d'un profil de candidat. Un défi du e-recrutement est que les concepts évoluent continuellement, de sorte que la normalisation se doit de suivre les tendances du marché. A la lumière de cela, nous allons proposer un ensemble de modèles d'apprentissage statistique nécessitant le minimum de supervision et facilement adaptables à l'évolution des nomenclatures. Les questions posées ont trouvé des solutions dans le raisonnement à partir de cas, le learning-to-rank semi-supervisé, les modèles à variable latente, ainsi qu'en bénéficiant de l'Open Data et des médias sociaux. Les différents modèles proposés ont été expérimentés sur des données réelles, avant d'être implémentés industriellement. La normalisation résultante est au coeur de SmartSearch, un projet qui fournit une analyse exhaustive du marché de l'emploi. / With so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market. Apprentissage statistique Fouille de texte Traitement Automatique de la Langue E-Recrutement Machine Learning Text Mining Natural Langage processing E-Recruitment
303	Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists Gungor, Abdulmecit 03 April 2018 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Authorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors. Authorship Attribution Word2Vec Doc2Vec Word2Vec Inversion Word Scoring
304	Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions Binkheder, Samar Hussein 07 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement. Biomedical literature Electronic Health Records Information retrieval and extraction Machine learning Phenotyping definitions Text mining
305	Bridging Text Mining and Bayesian Networks Raghuram, Sandeep Mudabail 09 March 2011 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / After the initial network is constructed using expert’s knowledge of the domain, Bayesian networks need to be updated as and when new data is observed. Literature mining is a very important source of this new data. In this work, we explore what kind of data needs to be extracted with the view to update Bayesian Networks, existing technologies which can be useful in achieving some of the goals and what research is required to accomplish the remaining requirements. This thesis specifically deals with utilizing causal associations and experimental results which can be obtained from literature mining. However, these associations and numerical results cannot be directly integrated with the Bayesian network. The source of the literature and the perceived quality of research needs to be factored into the process of integration, just like a human, reading the literature, would. This thesis presents a general methodology for updating a Bayesian Network with the mined data. This methodology consists of solutions to some of the issues surrounding the task of integrating the causal associations with the Bayesian Network and demonstrates the idea with a semiautomated software system. Text Mining Causal Association Bayesian Network Text processing (Computer science) Data mining Bayesian statistical decision theory
306	Biomedical Literature Mining with Transitive Closure and Maximum Network Flow Hoblitzell, Andrew P. 15 May 2011 (has links) This thesis examines biomedical text mining with an application in bone biology. A special thanks is extended to Anita Park and Mark Jaeger from the Purdue University Graduate School Office, who acted as invaluable assets in the formatting of the thesis. IUPUI and every other university would be fortunate to have staff that respond in such a timely, corteous, and professional manner. / Indiana University-Purdue University Indianapolis (IUPUI) / The biological literature is a huge and constantly increasing source of information which the biologist may consult for information about their field, but the vast amount of data can sometimes become overwhelming. Medline, which makes a great amount of biological journal data available online, makes the development of automated text mining systems and hence “data-driven discovery” possible. This thesis examines current work in the field of text mining and biological literature, and then aims to mine documents pertaining to bone biology. The documents are retrieved from PubMed, and then direct associations between the terms are computers. Potentially novel transitive associations among biological objects are then discovered using the transitive closure algorithm and the maximum flow algorithm. The thesis discusses in detail the extraction of biological objects from the collected documents and the co-occurrence based text mining algorithm, the transitive closure algorithm, and the maximum network flow which were then run to extract the potentially novel biological associations. Generated hypotheses (novel associations) were assigned with significance scores for further validation by a bone biologist expert. Extension of the work in to hypergraphs for enhanced meaning and accuracy is also examined in the thesis. Biomedical text mining Bioinformatics Hypergraphs Data mining Biological literature Hypergraphs
307	Unstructured to Actionable: Extracting wind event impact data for enhanced infrastructure resilience Pham, An Huy 28 August 2023 (has links) The United States experiences more extreme wind events than any other country, owing to its extensive coastlines, central regions prone to tornadoes, and varied climate that together create a wide array of wind phenomena. Despite advanced meteorological forecasts, these events continue to have significant impacts on infrastructure due to the knowledge gap between hazard prediction and tangible impact. Consequently, disaster managers are increasingly interested in understanding the impacts of past wind events that can assist in formulating strategies to enhance community resilience. However, this data is often non-structured and embedded in various agency documents. This makes it challenging to access and use the data effectively. Therefore, it is important to investigate approaches that can distinguish and extract impact data from non-essential information. This research aims at exploring methods that can identify, extract, and summarize sentences containing impact data. The significance of this study lies in addressing the scarcity of historical impact data related to structural and community damage, given that such information is dispersed across multiple briefings and damage reports. The research has two main objectives. The first is to extract sentences providing information on infrastructure, or community damage. This task uses Zero-shot text classification with the large version of the Bidirectional and Auto-Regressive Transformers model (BART-large) pre-trained on the multi-nominal language inference (MNLI) dataset. The model identifies the impact sentences by evaluating entailment probabilities with user-defined impact keywords. This method addresses the absence of manually labeled data and establishes a framework applicable to various reports. The second objective transforms this extracted data into easily digestible summaries. This is achieved by using a pre-trained BART-large model on the Cable News Network (CNN) Daily Mail dataset to generate abstractive summaries, making it easier to understand the key points from the extracted impact data. This approach is versatile, given its dependence on user-defined keywords, and can adapt to different disasters, including tornadoes, hurricanes, earthquakes, floods, and more. A case study will demonstrate this methodology, specifically examining the Hurricane Ian impact data found in the Structural Extreme Events Reconnaissance (StEER) damage report. / Master of Science / The U.S. sees more severe windstorms than any other country. These storms can cause significant damage, despite the availability of warnings and alerts generated from weather forecast systems up to 72 hours before the storm hits. One challenge is the ineffective communication between emergency managers and at-risk communities, which can hinder timely evacuation and preparation. Additionally, data about past storm damages are often mixed up with non-actionable information in many different reports, making it difficult to use the data to enhance future warnings and readiness for upcoming storms. This study tries to solve this problem by finding ways to identify, extract, and summarize information about damage caused by windstorms. It is an important step toward using historical data to prepare for future events. Two main objectives guide this research. The first involves extracting sentences in these reports that provide information on damage to buildings, infrastructure, or communities. We're using a machine learning model to sort the sentences into two groups: those that contain useful information and those that do not. The second objective revolves around transforming this extracted data into easily digestible summaries. The same machine learning model is then trained in a different way, to create these summaries. As a result, critical data can be presented in a more user-friendly and effective format, enhancing its usefulness to disaster managers. Wind disaster and resilience community damage Text mining Zero-shot text classification Impact-based forecasting
308	Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities Wang, Xuerui 01 May 2009 (has links) The abundance of data in the information age poses an immense challenge for us: how to perform large-scale inference to understand and utilize this overwhelming amount of information. Such techniques are of tremendous intellectual significance and practical impact. As part of this grand challenge, the goal of my Ph.D. thesis is to develop effective and efficient statistical topic models for massive text collections by incorporating extra information from other modalities in addition to the text itself. Text documents are not just text, and different kinds of additional information are naturally interleaved with text. Most previous work, however, pays attention to only one modality at a time, and ignore the others. In my thesis, I will present a series of probabilistic topic models to show how we can bridge multiple modalities of information, in a united fashion, for various tasks. Interestingly, joint inference over multiple modalities leads to many findings that can not be discovered from just one modality alone, as briefly illustrated below: Email is pervasive nowadays. Much previous work in natural language processing modeled text using latent topics ignoring the social networks. On the other hand, social network research mainly dealt with the existence of links between entities without taking into consideration the language content or topics on those links. The author-recipient-topic (ART) model, by contrast, steers the discovery of topics according to the relationships between people, and learns topic distributions based on the direction-sensitive messages sent between entities. However, the ART model does not explicitly identify groups formed by entities in the network. Previous work in social network analysis ignores the fact that different groupings arise for different topics. The group-topic (GT) model, a probabilistic generative model of entity relationships and textual attributes, simultaneously discovers groups among the entities and topics among the corresponding text. Many of the large datasets do not have static latent structures; they are instead dynamic. The topics over time (TOT) model explicitly models time as an observed continuous variable. This allows TOT to see long-range dependencies in time and also helps avoid a Markov model's risk of inappropriately dividing a topic in two when there is a brief gap in its appearance. By treating time as a continuous variable, we also avoid the difficulties of discretization. Most topic models, including all of the above, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. The topical n -grams (TNG) model discovers topics as well as meaningful, topical phrases simultaneously. In summary, we believe that these models are clear evidence that we can better understand and utilize massive text collections when additional modalities are considered and modeled jointly with text. Multiple modalities Social network analysis Text mining Topic models Computer Sciences
309	Development of High-Efficiency Single-Crystal Perovskite Solar Cells Guided by Text-Based Data-Driven Insights Alsalloum, Abdullah Yousef 11 1900 (has links) Of the emerging photovoltaic technologies, perovskite solar cells (PSCs) are arguably among the most promising candidates for commercialization. Worldwide interest has prompted researchers to produce tens of thousands of studies on the topic, making PSCs one of the most active research topics of the past decade. Unfortunately, the rapid output of a substantial number of publications has made the traditional literature review process and research plans cumbersome tasks for both the novice and expert. In this dissertation, a data-driven analysis utilizing a novel text mining and natural language processing pipeline is applied on the perovskite literature to help decipher the field, uncover emerging research trends, and delineate an experimental research plan of action for this dissertation. The analysis led to the selection and exploration of two experimental projects on single-crystal PSCs, which are devices based on micrometers-thick grain-boundary-free monocrystalline films with long charge carrier diffusion lengths and enhanced light absorption (relative to polycrystalline films). First, a low-temperature crystallization approach is devised to improve the quality of methylammonium lead iodide (MAPbI3) single-crystal films, leading to devices with markedly enhanced open-circuit voltages (1.15 V vs 1.08 V for controls) and power conversion efficiencies (PCEs) of up to 21.9%, among the highest reported for MAPbI3-based devices. Second, mixed-cation formamidinium (FA)-based single-crystal PSCs are successfully fabricated with PCEs of up to 22.8% and short-circuit current values exceeding 26 mA cm-2, achieved by a significant expansion of the external quantum efficiency band edge, which extends past that of the state-of-the-art polycrystalline FAPbI3-based solar cells by about 50 meV — only 60 meV larger than that of the top-performing photovoltaic material, GaAs. These figures of merit not only set new record values for SC-PSCs, but also showcase the potential of adopting data-driven techniques to guide the research process of a data-rich field. solar cells single-crystal perovskites text mining natural language processing energy
310	Deciding Polarity of Opinions over Multi-Aspect Customer Reviews Kayaalp, Naime F. January 2014 (has links) No description available. Computer Science Industrial Engineering Marketing Text mining polarity analysis feature-opinion extraction decision making

Search results