Global ETD Search

41	A probabilistic and incremental model for online classification of documents : DV-INBC Rodrigues, Thiago Fredes January 2016 (has links) Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented. Mineracao : Dados Aprendizagem eletrônica Topic modeling Document classification Online learning Incremental learning
42	Fast Inference for Interactive Models of Text Lund, Jeffrey A 01 September 2015 (has links) Probabilistic models of text are a useful tool for enabling the analysis of large collections of digital text. For example, Latent Dirichlet Allocation can quickly produce topical summaries of large collections of text documents. Many important uses cases of such models include human interaction during the inference process for these models of text. For example, the Interactive Topic Model extends Latent Dirichlet Allocation to incorporate human expertiese during inference in order to produce topics which are better suited to individual user needs. However, interactive use cases of probabalistic models of text introduce new constraints on inference - the inference procedure must not only be accurate, but also fast enough to facilitate human interaction. If the inference is too slow, then the human interaction will be harmed, and the interactive aspect of the probalistic model will be less useful. Unfortunately, the most popular inference algorithms in use today either require strong approximations which can degrade the quality of some models, or require time-consuming sampling. We explore the use of Iterated Conditional Modes, an algorithm which is able to obtain locally optimal maximum a posteriori estimates, as an alternative to popular inference algorithms such as Gibbs sampling or mean field variational inference. Iterated Conditional Modes algorithm is not only fast enough to facilitate human interaction, but can produce better maximum a posteriori estimates than sampling. We demonstrate the superior performance of Iterated Conditional Modes on a wide variety of models. First we use a DP Mixture of Multinomials model applied to the problem of web search result cluster, and show that not only can we outperform previous methods in clustering quality, but we can achieve interactive runtimes when performing inference with Iterated Conditional Modes. We then apply Iterated Conditional Modes to the Interactive Topic Model. Not only is Iterated Conditional Modes much faster than the previous published Gibbs sampler, but we are better able to incorporate human feedback during inference, as measured by accuracy on a classification task using the resultant topic model. Finally, we utilize Iterated Conditional Modes with MomResp, a model used to aggregate multiple noisy crowdsourced data. Compared with Gibbs sampling, Iterated Conditional Modes is better able to recover ground truth labels from simulated noisy annotations, and runs orders of magnitude faster. Iterated Conditional Modes Posterior Inference Interactive Topic Modeling Fast Inference Computer Sciences
43	A Framework for Evaluating Recommender Systems Bean, Michael Gabriel 01 December 2016 (has links) Prior research on text collections of religious documents has demonstrated that viable recommender systems in the area are lacking, if not non-existent, for some datasets. For example, both www.LDS.org and scriptures.byu.edu are websites designed for religious use. Although they provide users with the ability to search for documents based on keywords, they do not provide the ability to discover documents based on similarity. Consequently, these systems would greatly benefit from a recommender system. This work provides a framework for evaluating recommender systems and is flexible enough for use with either website. Such a framework would identify the best recommender system that provides users another way to explore and discover documents related to their current interests, given a starting document. The framework created for this thesis, RelRec, is attractive because it compares two different recommender systems. Documents are considered relevant if they are among the nearest neighbors, where "nearest" is defined by a particular system's similarity formula. We use RelRec to compare output of two particular recommender systems on our selected data collection. RelRec shows that LDA recommeder outperforms the TF-IDF recommender in terms of coverage, making it preferable for LDS-based document collections. LDA TF-IDF recommendation systems LDS Scripture Citation Index RelRec topic modeling Linguistics
44	Predicting labor market competition and employee mobility — a machine learning approach Liu, Yuanyang 01 August 2019 (has links) Applying data analytics for talent acquisition and retention has been identified as one of the most urgent challenges facing HR leaders around the world; however, it is also one of the challenges that firms are least prepared to tackle. Our research strives to narrow such a capability gap between the urgency and readiness of data-driven human resource management. First, we predict interfirm competitors for human capital in the labor market utilizing the rich information contained in over 89,000 LinkedIn users' profiles. Using employee migrations across firms, we derive and analyze a human capital flow network. We leverage this network to extract global cues about interfirm human capital overlap through structural equivalence and community classification. The online employee profiles also provide rich data on the explicit knowledge base of firms and allow us to measure the interfirm human capital overlap in terms of similarity in their employees' skills. We validate our proposed human capital overlap metrics in a predictive analytics framework using future employee migrations as an indicator of labor market competition. The results show that our proposed metrics have superior predictive power over conventional firm-level economic and human resource measures. Second, we estimate the effect of skilled immigrants on the native U.S. workers' turnover probability. We apply unsupervised machine learning to categorize employees' self-reported skills and find that skilled immigrants disproportionately specialize in IT. In contrast, the native workers predominantly focus on management and analyst skills. Utilizing the randomness in the H-1B visa lottery system and a 2SLS design, we find that a 1 percentage point increase in a firm's proportion of skilled immigrant employees leads to a decrease of 0.69 percentage points in a native employee's turnover risk. However, this beneficial crowding-in effect varies for native workers with different skills. Our methodology highlights the need to account for a multifaceted view of the skilled immigration's effect on native workers. Finally, we also propose a set of features and models that are able to effectively predict future employee turnover outcomes. Our predictive models can provide significant utility to managers by identifying individuals with the highest turnover risks. Competitor Analysis Employee Turnover Human Capital Machine Learning Network Analysis Topic Modeling
45	Automating an Engine to Extract Educational Priorities for Workforce City Innovation Hobbs, Madison 01 January 2019 (has links) This thesis is grounded in my work done through the Harvey Mudd College Clinic Program as Project Manager of the PilotCity Clinic Team. PilotCity is a startup whose mission is to transform small to mid-sized cities into centers of innovation by introducing employer partnerships and work-based learning to high school classrooms. The team was tasked with developing software and algorithms to automate PilotCity's programming and to extract educational insights from unstructured data sources like websites, syllabi, resumes, and more. The team helped engineer a web application to expand and facilitate PilotCity's usership, designed a recommender system to automate the process of matching employers to high school classrooms, and packaged a topic modeling module to extract educational priorities from more complex data such as syllabi, course handbooks, or other educational text data. Finally, the team explored automatically generating supplementary course resources using insights from topic models. This thesis will detail the team's process from beginning to final deliverables including the methods, implementation, results, challenges, future directions, and impact of the project. Education Work Based Learning Natural Language Processing Topic modeling Data science Curriculum and Instruction Education
46	Cooperative Semantic Information Processing for Literature-Based Biomedical Knowledge Discovery Yu, Zhiguo 01 January 2013 (has links) Given that data is increasing exponentially everyday, extracting and understanding the information, themes and relationships from large collections of documents is more and more important to researchers in many areas. In this paper, we present a cooperative semantic information processing system to help biomedical researchers understand and discover knowledge in large numbers of titles and abstracts from PubMed query results. Our system is based on a prevalent technique, topic modeling, which is an unsupervised machine learning approach for discovering the set of semantic themes in a large set of documents. In addition, we apply a natural language processing technique to transform the “bag-of-words” assumption of topic models to the “bag-of-important-phrases” assumption and build an interactive visualization tool using a modified, open-source, Topic Browser. In the end, we conduct two experiments to evaluate the approach. The first, evaluates whether the “bag-of-important-phrases” approach is better at identifying semantic themes than the standard “bag-of-words” approach. This is an empirical study in which human subjects evaluate the quality of the resulting topics using a standard “word intrusion test” to determine whether subjects can identify a word (or phrase) that does not belong in the topic. The second is a qualitative empirical study to evaluate how well the system helps biomedical researchers explore a set of documents to discover previously hidden semantic themes and connections. The methodology for this study has been successfully used to evaluate other knowledge-discovery tools in biomedicine. Data Mining Topic Modeling Knowledge Discovery Information Visualization Electrical and Computer Engineering
47	Nonparametric Bayesian Models for Joint Analysis of Imagery and Text Li, Lingbo January 2014 (has links) <p>It has been increasingly important to develop statistical models to manage large-scale high-dimensional image data. This thesis presents novel hierarchical nonparametric Bayesian models for joint analysis of imagery and text. This thesis consists two main parts.</p><p>The first part is based on single image processing. We first present a spatially dependent model for simultaneous image segmentation and interpretation. Given a corrupted image, by imposing spatial inter-relationships within imagery, the model not only improves reconstruction performance but also yields smooth segmentation. Then we develop online variational Bayesian algorithm for dictionary learning to process large-scale datasets, based on online stochastic optimization with a natu- ral gradient step. We show that dictionary is learned simultaneously with image reconstruction on large natural images containing tens of millions of pixels.</p><p>The second part applies dictionary learning for joint analysis of multiple image and text to infer relationship among images. We show that feature extraction and image organization with annotation (when available) can be integrated by unifying dictionary learning and hierarchical topic modeling. We present image organization in both "flat" and hierarchical constructions. Compared with traditional algorithms feature extraction is separated from model learning, our algorithms not only better fits the datasets, but also provides richer and more interpretable structures of image</p> / Dissertation Electrical engineering Statistics Computer science Bayesian Nonparametrics Dictionary Learning Image Processing Machine Learning Topic Modeling
48	A probabilistic and incremental model for online classification of documents : DV-INBC Rodrigues, Thiago Fredes January 2016 (has links) Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented. Mineracao : Dados Aprendizagem eletrônica Topic modeling Document classification Online learning Incremental learning
49	A probabilistic and incremental model for online classification of documents : DV-INBC Rodrigues, Thiago Fredes January 2016 (has links) Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented. Mineracao : Dados Aprendizagem eletrônica Topic modeling Document classification Online learning Incremental learning
50	When Does it Mean? Detecting Semantic Change in Historical Texts Hengchen, Simon 06 December 2017 (has links) Contrary to what has been done to date in the hybrid field of natural language processing (NLP), this doctoral thesis holds that the new approach developed below makes it possible to semi-automatically detect semantic changes in digitised, OCRed, historical corpora. We define the term semi-automatic as “making use of an advanced tool whilst remaining in control of key decisions regarding the processing of the corpus”. If the tool utilised – “topic modelling”, and more precisely the “Latent Dirichlet Allocation” (LDA) – is not unknown in NLP or computational historical semantics, where it is already mobilised to follow a priori selected words and try to detect when these words change meaning, it has never been used, to the best of our knowledge, to detect which words change in a humanistically-relevant way. In other terms, our method does not study a word in context to gather information on this specific word, but the whole context – which we consider a witness to a potential evolution of reality – to gather more contextual information on one or several particular semantic shift candidates. In order to detect these semantic changes, we use the algorithm to create lexical fields: groups of words that together define a subject to which they all relate. By comparing lexical fields over different time periods of the same corpus (that is, by mobilising a diachronic approach), we try to determine whether words appear over time. We support that if a word starts to be used in a certain context at a certain time, it is a likely candidate for semantic change. Of course, the method developed here and illustrated by a case study applies to a certain context: that of digitised, OCRed, historical archives in Dutch. Nevertheless, this doctoral work also describes the advantages and disadvantages of the algorithm and postulates, on the basis of this evaluation, that the method is applicable to other fields, under other conditions. By carrying out a critical evaluation of the tools available and used, this doctoral thesis invites the community to the reproducibility of the method, whilst pointing out obvious limitations of the approach and propositions on how to solve them. / Doctorat en Information et communication / info:eu-repo/semantics/nonPublished Linguistique appliquée Linguistique historique Lexicologie LDA topic modelling topic modeling semantic change

Search results