Global ETD Search

41	A Framework for Evaluating Recommender Systems Bean, Michael Gabriel 01 December 2016 (has links) Prior research on text collections of religious documents has demonstrated that viable recommender systems in the area are lacking, if not non-existent, for some datasets. For example, both www.LDS.org and scriptures.byu.edu are websites designed for religious use. Although they provide users with the ability to search for documents based on keywords, they do not provide the ability to discover documents based on similarity. Consequently, these systems would greatly benefit from a recommender system. This work provides a framework for evaluating recommender systems and is flexible enough for use with either website. Such a framework would identify the best recommender system that provides users another way to explore and discover documents related to their current interests, given a starting document. The framework created for this thesis, RelRec, is attractive because it compares two different recommender systems. Documents are considered relevant if they are among the nearest neighbors, where "nearest" is defined by a particular system's similarity formula. We use RelRec to compare output of two particular recommender systems on our selected data collection. RelRec shows that LDA recommeder outperforms the TF-IDF recommender in terms of coverage, making it preferable for LDS-based document collections. LDA TF-IDF recommendation systems LDS Scripture Citation Index RelRec topic modeling Linguistics
42	Predicting labor market competition and employee mobility — a machine learning approach Liu, Yuanyang 01 August 2019 (has links) Applying data analytics for talent acquisition and retention has been identified as one of the most urgent challenges facing HR leaders around the world; however, it is also one of the challenges that firms are least prepared to tackle. Our research strives to narrow such a capability gap between the urgency and readiness of data-driven human resource management. First, we predict interfirm competitors for human capital in the labor market utilizing the rich information contained in over 89,000 LinkedIn users' profiles. Using employee migrations across firms, we derive and analyze a human capital flow network. We leverage this network to extract global cues about interfirm human capital overlap through structural equivalence and community classification. The online employee profiles also provide rich data on the explicit knowledge base of firms and allow us to measure the interfirm human capital overlap in terms of similarity in their employees' skills. We validate our proposed human capital overlap metrics in a predictive analytics framework using future employee migrations as an indicator of labor market competition. The results show that our proposed metrics have superior predictive power over conventional firm-level economic and human resource measures. Second, we estimate the effect of skilled immigrants on the native U.S. workers' turnover probability. We apply unsupervised machine learning to categorize employees' self-reported skills and find that skilled immigrants disproportionately specialize in IT. In contrast, the native workers predominantly focus on management and analyst skills. Utilizing the randomness in the H-1B visa lottery system and a 2SLS design, we find that a 1 percentage point increase in a firm's proportion of skilled immigrant employees leads to a decrease of 0.69 percentage points in a native employee's turnover risk. However, this beneficial crowding-in effect varies for native workers with different skills. Our methodology highlights the need to account for a multifaceted view of the skilled immigration's effect on native workers. Finally, we also propose a set of features and models that are able to effectively predict future employee turnover outcomes. Our predictive models can provide significant utility to managers by identifying individuals with the highest turnover risks. Competitor Analysis Employee Turnover Human Capital Machine Learning Network Analysis Topic Modeling
43	Automating an Engine to Extract Educational Priorities for Workforce City Innovation Hobbs, Madison 01 January 2019 (has links) This thesis is grounded in my work done through the Harvey Mudd College Clinic Program as Project Manager of the PilotCity Clinic Team. PilotCity is a startup whose mission is to transform small to mid-sized cities into centers of innovation by introducing employer partnerships and work-based learning to high school classrooms. The team was tasked with developing software and algorithms to automate PilotCity's programming and to extract educational insights from unstructured data sources like websites, syllabi, resumes, and more. The team helped engineer a web application to expand and facilitate PilotCity's usership, designed a recommender system to automate the process of matching employers to high school classrooms, and packaged a topic modeling module to extract educational priorities from more complex data such as syllabi, course handbooks, or other educational text data. Finally, the team explored automatically generating supplementary course resources using insights from topic models. This thesis will detail the team's process from beginning to final deliverables including the methods, implementation, results, challenges, future directions, and impact of the project. Education Work Based Learning Natural Language Processing Topic modeling Data science Curriculum and Instruction Education
44	Cooperative Semantic Information Processing for Literature-Based Biomedical Knowledge Discovery Yu, Zhiguo 01 January 2013 (has links) Given that data is increasing exponentially everyday, extracting and understanding the information, themes and relationships from large collections of documents is more and more important to researchers in many areas. In this paper, we present a cooperative semantic information processing system to help biomedical researchers understand and discover knowledge in large numbers of titles and abstracts from PubMed query results. Our system is based on a prevalent technique, topic modeling, which is an unsupervised machine learning approach for discovering the set of semantic themes in a large set of documents. In addition, we apply a natural language processing technique to transform the “bag-of-words” assumption of topic models to the “bag-of-important-phrases” assumption and build an interactive visualization tool using a modified, open-source, Topic Browser. In the end, we conduct two experiments to evaluate the approach. The first, evaluates whether the “bag-of-important-phrases” approach is better at identifying semantic themes than the standard “bag-of-words” approach. This is an empirical study in which human subjects evaluate the quality of the resulting topics using a standard “word intrusion test” to determine whether subjects can identify a word (or phrase) that does not belong in the topic. The second is a qualitative empirical study to evaluate how well the system helps biomedical researchers explore a set of documents to discover previously hidden semantic themes and connections. The methodology for this study has been successfully used to evaluate other knowledge-discovery tools in biomedicine. Data Mining Topic Modeling Knowledge Discovery Information Visualization Electrical and Computer Engineering
45	Nonparametric Bayesian Models for Joint Analysis of Imagery and Text Li, Lingbo January 2014 (has links) <p>It has been increasingly important to develop statistical models to manage large-scale high-dimensional image data. This thesis presents novel hierarchical nonparametric Bayesian models for joint analysis of imagery and text. This thesis consists two main parts.</p><p>The first part is based on single image processing. We first present a spatially dependent model for simultaneous image segmentation and interpretation. Given a corrupted image, by imposing spatial inter-relationships within imagery, the model not only improves reconstruction performance but also yields smooth segmentation. Then we develop online variational Bayesian algorithm for dictionary learning to process large-scale datasets, based on online stochastic optimization with a natu- ral gradient step. We show that dictionary is learned simultaneously with image reconstruction on large natural images containing tens of millions of pixels.</p><p>The second part applies dictionary learning for joint analysis of multiple image and text to infer relationship among images. We show that feature extraction and image organization with annotation (when available) can be integrated by unifying dictionary learning and hierarchical topic modeling. We present image organization in both "flat" and hierarchical constructions. Compared with traditional algorithms feature extraction is separated from model learning, our algorithms not only better fits the datasets, but also provides richer and more interpretable structures of image</p> / Dissertation Electrical engineering Statistics Computer science Bayesian Nonparametrics Dictionary Learning Image Processing Machine Learning Topic Modeling
46	A probabilistic and incremental model for online classification of documents : DV-INBC Rodrigues, Thiago Fredes January 2016 (has links) Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented. Mineracao : Dados Aprendizagem eletrônica Topic modeling Document classification Online learning Incremental learning
47	A probabilistic and incremental model for online classification of documents : DV-INBC Rodrigues, Thiago Fredes January 2016 (has links) Recentemente, houve um aumento rápido na criação e disponibilidade de repositórios de dados, o que foi percebido nas áreas de Mineração de Dados e Aprendizagem de Máquina. Este fato deve-se principalmente à rápida criação de tais dados em redes sociais. Uma grande parte destes dados é feita de texto, e a informação armazenada neles pode descrever desde perfis de usuários a temas comuns em documentos como política, esportes e ciência, informação bastante útil para várias aplicações. Como muitos destes dados são criados em fluxos, é desejável a criação de algoritmos com capacidade de atuar em grande escala e também de forma on-line, já que tarefas como organização e exploração de grandes coleções de dados seriam beneficiadas por eles. Nesta dissertação um modelo probabilístico, on-line e incremental é apresentado, como um esforço em resolver o problema apresentado. O algoritmo possui o nome DV-INBC e é uma extensão ao algoritmo INBC. As duas principais características do DV-INBC são: a necessidade de apenas uma iteração pelos dados de treino para criar um modelo que os represente; não é necessário saber o vocabulário dos dados a priori. Logo, pouco conhecimento sobre o fluxo de dados é necessário. Para avaliar a performance do algoritmo, são apresentados testes usando datasets populares. / Recently the fields of Data Mining and Machine Learning have seen a rapid increase in the creation and availability of data repositories. This is mainly due to its rapid creation in social networks. Also, a large part of those data is made of text documents. The information stored in such texts can range from a description of a user profile to common textual topics such as politics, sports and science, information very useful for many applications. Besides, since many of this data are created in streams, scalable and on-line algorithms are desired, because tasks like organization and exploration of large document collections would be benefited by them. In this thesis an incremental, on-line and probabilistic model for document classification is presented, as an effort of tackling this problem. The algorithm is called DV-INBC and is an extension to the INBC algorithm. The two main characteristics of DV-INBC are: only a single scan over the data is necessary to create a model of it; the data vocabulary need not to be known a priori. Therefore, little knowledge about the data stream is needed. To assess its performance, tests using well known datasets are presented. Mineracao : Dados Aprendizagem eletrônica Topic modeling Document classification Online learning Incremental learning
48	When Does it Mean? Detecting Semantic Change in Historical Texts Hengchen, Simon 06 December 2017 (has links) Contrary to what has been done to date in the hybrid field of natural language processing (NLP), this doctoral thesis holds that the new approach developed below makes it possible to semi-automatically detect semantic changes in digitised, OCRed, historical corpora. We define the term semi-automatic as “making use of an advanced tool whilst remaining in control of key decisions regarding the processing of the corpus”. If the tool utilised – “topic modelling”, and more precisely the “Latent Dirichlet Allocation” (LDA) – is not unknown in NLP or computational historical semantics, where it is already mobilised to follow a priori selected words and try to detect when these words change meaning, it has never been used, to the best of our knowledge, to detect which words change in a humanistically-relevant way. In other terms, our method does not study a word in context to gather information on this specific word, but the whole context – which we consider a witness to a potential evolution of reality – to gather more contextual information on one or several particular semantic shift candidates. In order to detect these semantic changes, we use the algorithm to create lexical fields: groups of words that together define a subject to which they all relate. By comparing lexical fields over different time periods of the same corpus (that is, by mobilising a diachronic approach), we try to determine whether words appear over time. We support that if a word starts to be used in a certain context at a certain time, it is a likely candidate for semantic change. Of course, the method developed here and illustrated by a case study applies to a certain context: that of digitised, OCRed, historical archives in Dutch. Nevertheless, this doctoral work also describes the advantages and disadvantages of the algorithm and postulates, on the basis of this evaluation, that the method is applicable to other fields, under other conditions. By carrying out a critical evaluation of the tools available and used, this doctoral thesis invites the community to the reproducibility of the method, whilst pointing out obvious limitations of the approach and propositions on how to solve them. / Doctorat en Information et communication / info:eu-repo/semantics/nonPublished Linguistique appliquée Linguistique historique Lexicologie LDA topic modelling topic modeling semantic change
49	EXPLORING PSEUDO-TOPIC-MODELING FOR CREATING AUTOMATED DISTANT-ANNOTATION SYSTEMS Sommers, Alexander Mitchell 01 September 2021 (has links) We explore the use a Latent Dirichlet Allocation (LDA) imitating pseudo-topic-model, based on our original relevance metric, as a tool to facilitate distant annotation of short (often one to two sentence or less) documents. Our exploration manifests as annotating tweets for emotions, this being the current use-case of interest to us, but we believe the method could be extended to any multi-class labeling task of documents of similar length. Tweets are gathered via the Twitter API using "track" terms thought likely to capture tweets with a greater chance of exhibiting each emotional class, 3,000 tweets for each of 26 topics anticipated to elicit emotional discourse. Our pseudo-topic-model is used to produce relevance-ranked vocabularies for each corpus of tweets and these are used to distribute emotional annotations to those tweets not manually annotated, magnifying the number of annotated tweets by a factor of 29. The vector labels the annotators produce for the topics are cascaded out to the tweets via three different schemes which are compared for performance by proxy through the competition of bidirectional-LSMTs trained using the tweets labeled at a distance. An SVM and two emotionally annotated vocabularies are also tested on each task to provide context and comparison. distant annotation emotion detection Natural language processing sentiment analysis topic modeling
50	As the Need Presents Itself: Social Identity Theory and Signaling in Online Crowdfunding Campaigns Hamilton, Scott J 12 1900 (has links) As social interactions increasingly become exclusively online, there is a need for research on the role of identity and social identity in online platforms. Drawing on Symbolic Interactionist approaches to identity, namely Social Identity Theory and Identity Theory, as well as Signaling Theory, this study argues that actors will selectively use religious language to signal their credentials to an audience for the purpose of garnering prosocial behavior in the form of donations to their fundraising campaign. Using latent semantic analysis topic models to analyze the self-presentations of crowdsourcing campaigners on GoFundMe.com, this study found evidence for the presence of signaling to a religious identity online as well as a significant difference in the presentation of need for campaigns originating in areas with high reported religiosity compared to campaigns from areas of low religiosity. In comparison to other campaigns, campaigners engaging in religious signaling were significantly increasing their donations. I suggest that strategically chosen religious topics in online crowdfunding is an example of low-cost identity signaling and provides insight into how signaling happens online and the potential outcomes resulting from this cultural work. Social Identity Theory Identity Theory Signaling Theory crowdfunding LSA topic modeling

Search results