Global ETD Search

1	Bayesian nonparametric models for name disambiguation and supervised learning Dai, Andrew Mingbo January 2013 (has links) This thesis presents new Bayesian nonparametric models and approaches for their development, for the problems of name disambiguation and supervised learning. Bayesian nonparametric methods form an increasingly popular approach for solving problems that demand a high amount of model flexibility. However, this field is relatively new, and there are many areas that need further investigation. Previous work on Bayesian nonparametrics has neither fully explored the problems of entity disambiguation and supervised learning nor the advantages of nested hierarchical models. Entity disambiguation is a widely encountered problem where different references need to be linked to a real underlying entity. This problem is often unsupervised as there is no previously known information about the entities. Further to this, effective use of Bayesian nonparametrics offer a new approach to tackling supervised problems, which are frequently encountered. The main original contribution of this thesis is a set of new structured Dirichlet process mixture models for name disambiguation and supervised learning that can also have a wide range of applications. These models use techniques from Bayesian statistics, including hierarchical and nested Dirichlet processes, generalised linear models, Markov chain Monte Carlo methods and optimisation techniques such as BFGS. The new models have tangible advantages over existing methods in the field as shown with experiments on real-world datasets including citation databases and classification and regression datasets. I develop the unsupervised author-topic space model for author disambiguation that uses free-text to perform disambiguation unlike traditional author disambiguation approaches. The model incorporates a name variant model that is based on a nonparametric Dirichlet language model. The model handles both novel unseen name variants and can model the unknown authors of the text of the documents. Through this, the model can disambiguate authors with no prior knowledge of the number of true authors in the dataset. In addition, it can do this when the authors have identical names. I use a model for nesting Dirichlet processes named the hybrid NDP-HDP. This model allows Dirichlet processes to be clustered together and adds an additional level of structure to the hierarchical Dirichlet process. I also develop a new hierarchical extension to the hybrid NDP-HDP. I develop this model into the grouped author-topic model for the entity disambiguation task. The grouped author-topic model uses clusters to model the co-occurrence of entities in documents, which can be interpreted as research groups. Since this model does not require entities to be linked to specific words in a document, it overcomes the problems of some existing author-topic models. The model incorporates a new method for modelling name variants, so that domain-specific name variant models can be used. Lastly, I develop extensions to supervised latent Dirichlet allocation, a type of supervised topic model. The keyword-supervised LDA model predicts document responses more accurately by modelling the effect of individual words and their contexts directly. The supervised HDP model has more model flexibility by using Bayesian nonparametrics for supervised learning. These models are evaluated on a number of classification and regression problems, and the results show that they outperform existing supervised topic modelling approaches. The models can also be extended to use similar information to the previous models, incorporating additional information such as entities and document titles to improve prediction. 519.5
2	Desambiguação de autores em bibliotecas digitais utilizando redes sociais e programação genética / Author name disambiguation in digital libraries using social networks and genetic programming Levin, Felipe Hoppe January 2010 (has links) Bibliotecas digitais tornaram-se uma importante fonte de informação para comunidades científicas. Entretanto, por coletar dados de diferentes fontes, surge o problema de informações ambíguas ou duplicadas de nomes de autores. Métodos tradicionais de desambiguação de nomes utilizam informação sintática de atributos. Todavia, recentemente o uso de redes de relacionamentos, que traz informação semântica, tem sido estudado em desambiguação de dados. Em desambiguação de nomes de autores, relações de co-autoria podem ser usadas para criar uma rede social, que pode ser utilizada para melhorar métodos de desambiguação de nomes de autores. Esta dissertação apresenta um estudo do impacto de adicionar análise de redes sociais a métodos de desambiguação de nomes de autores baseados em informação sintática de atributos. Nós apresentamos uma abordagem de aprendizagem de máquina baseada em Programação Genética e a utilizamos para avaliar o impacto de adicionar análise de redes sociais a desambiguação de nomes de autores. Através de experimentos usando subconjuntos de bibliotecas digitais reais, nós demonstramos que o uso de análise de redes sociais melhora de forma significativa a qualidade dos resultados. Adicionalmente, nós demonstramos que as funções de casamento criadas por nossa abordagem baseada em Programação Genética são capazes de competir com métodos do estado da arte. / Digital libraries have become an important source of information for scientific communities. However, by gathering data from different sources, the problem of duplicate and ambiguous information about author names arises. Traditional methods of name disambiguation use syntactic attribute information. However, recently the use of relationship networks, which provides semantic information, has been studied in data disambiguation. In author name disambiguation, the co-authorship relations can be used to create a social network, which can be used to improve author name disambiguation methods. This dissertation presents a study of the impact of adding social network analysis to author name disambiguation methods based on syntactic attribute information. We present a machine learning approach based on Genetic Programming and use it to evaluate the impact of social network analysis in author name disambiguation. Through experiments using subsets of real digital libraries, we show that the use of social network analysis significantly improves the quality of results. Also, we demonstrate that match functions created by our Genetic Programming approach are able to compete with state-of-the-art methods. Banco : Dados Agentes sociais Name disambiguation Relationship analysis Social networks Genetic programming Match functions Digital libraries
3	Desambiguação de autores em bibliotecas digitais utilizando redes sociais e programação genética / Author name disambiguation in digital libraries using social networks and genetic programming Levin, Felipe Hoppe January 2010 (has links) Bibliotecas digitais tornaram-se uma importante fonte de informação para comunidades científicas. Entretanto, por coletar dados de diferentes fontes, surge o problema de informações ambíguas ou duplicadas de nomes de autores. Métodos tradicionais de desambiguação de nomes utilizam informação sintática de atributos. Todavia, recentemente o uso de redes de relacionamentos, que traz informação semântica, tem sido estudado em desambiguação de dados. Em desambiguação de nomes de autores, relações de co-autoria podem ser usadas para criar uma rede social, que pode ser utilizada para melhorar métodos de desambiguação de nomes de autores. Esta dissertação apresenta um estudo do impacto de adicionar análise de redes sociais a métodos de desambiguação de nomes de autores baseados em informação sintática de atributos. Nós apresentamos uma abordagem de aprendizagem de máquina baseada em Programação Genética e a utilizamos para avaliar o impacto de adicionar análise de redes sociais a desambiguação de nomes de autores. Através de experimentos usando subconjuntos de bibliotecas digitais reais, nós demonstramos que o uso de análise de redes sociais melhora de forma significativa a qualidade dos resultados. Adicionalmente, nós demonstramos que as funções de casamento criadas por nossa abordagem baseada em Programação Genética são capazes de competir com métodos do estado da arte. / Digital libraries have become an important source of information for scientific communities. However, by gathering data from different sources, the problem of duplicate and ambiguous information about author names arises. Traditional methods of name disambiguation use syntactic attribute information. However, recently the use of relationship networks, which provides semantic information, has been studied in data disambiguation. In author name disambiguation, the co-authorship relations can be used to create a social network, which can be used to improve author name disambiguation methods. This dissertation presents a study of the impact of adding social network analysis to author name disambiguation methods based on syntactic attribute information. We present a machine learning approach based on Genetic Programming and use it to evaluate the impact of social network analysis in author name disambiguation. Through experiments using subsets of real digital libraries, we show that the use of social network analysis significantly improves the quality of results. Also, we demonstrate that match functions created by our Genetic Programming approach are able to compete with state-of-the-art methods. Banco : Dados Agentes sociais Name disambiguation Relationship analysis Social networks Genetic programming Match functions Digital libraries
4	Desambiguação de autores em bibliotecas digitais utilizando redes sociais e programação genética / Author name disambiguation in digital libraries using social networks and genetic programming Levin, Felipe Hoppe January 2010 (has links) Bibliotecas digitais tornaram-se uma importante fonte de informação para comunidades científicas. Entretanto, por coletar dados de diferentes fontes, surge o problema de informações ambíguas ou duplicadas de nomes de autores. Métodos tradicionais de desambiguação de nomes utilizam informação sintática de atributos. Todavia, recentemente o uso de redes de relacionamentos, que traz informação semântica, tem sido estudado em desambiguação de dados. Em desambiguação de nomes de autores, relações de co-autoria podem ser usadas para criar uma rede social, que pode ser utilizada para melhorar métodos de desambiguação de nomes de autores. Esta dissertação apresenta um estudo do impacto de adicionar análise de redes sociais a métodos de desambiguação de nomes de autores baseados em informação sintática de atributos. Nós apresentamos uma abordagem de aprendizagem de máquina baseada em Programação Genética e a utilizamos para avaliar o impacto de adicionar análise de redes sociais a desambiguação de nomes de autores. Através de experimentos usando subconjuntos de bibliotecas digitais reais, nós demonstramos que o uso de análise de redes sociais melhora de forma significativa a qualidade dos resultados. Adicionalmente, nós demonstramos que as funções de casamento criadas por nossa abordagem baseada em Programação Genética são capazes de competir com métodos do estado da arte. / Digital libraries have become an important source of information for scientific communities. However, by gathering data from different sources, the problem of duplicate and ambiguous information about author names arises. Traditional methods of name disambiguation use syntactic attribute information. However, recently the use of relationship networks, which provides semantic information, has been studied in data disambiguation. In author name disambiguation, the co-authorship relations can be used to create a social network, which can be used to improve author name disambiguation methods. This dissertation presents a study of the impact of adding social network analysis to author name disambiguation methods based on syntactic attribute information. We present a machine learning approach based on Genetic Programming and use it to evaluate the impact of social network analysis in author name disambiguation. Through experiments using subsets of real digital libraries, we show that the use of social network analysis significantly improves the quality of results. Also, we demonstrate that match functions created by our Genetic Programming approach are able to compete with state-of-the-art methods. Banco : Dados Agentes sociais Name disambiguation Relationship analysis Social networks Genetic programming Match functions Digital libraries
5	Entity-Centric Text Mining for Historical Documents Coll Ardanuy, Maria 07 July 2017 (has links) No description available. 510 digital humanities text mining toponym disambiguation person name disambiguation historical text mining Informatik (PPN619939052)
6	Um estudo comparativo entre abordagens supervisionadas para a resolução de referências a autores / A comparative study of supervised approaches for author reference resolution CANUTO, Sérgio Daniel Carvalho 25 August 2011 (has links) Made available in DSpace on 2014-07-29T14:57:49Z (GMT). No. of bitstreams: 1 Dissertacao Sergio Daniel Carvalho Canuto.pdf: 584503 bytes, checksum: 6a393853a561ed8fec4bd9e4eef56628 (MD5) Previous issue date: 2011-08-25 / In this work we investigate two classes of solutions for the problem of author name disambiguation.We refer to the approach of the first class as relational based on attributes (RBA) solutions. These approaches use similarity measures based on attributes of the two references being compared or based on the attributes of other references connected to them by authorship. The other class of approaches uses information on semantic relationships among entities in addition to attribute based similarity measures to decide if two references refer to the same author. We refer to the approaches of this class as relational based on entities (RBE) solutions. We present a supervised version of the RBE based on the work introduced by Bhattacharya and Gettor [7]. In the experiments we conducted our RBE solution presented statistically significant gains in efficacy over all the other methods studied. However, the gains are only marginal over the RBA methods experimented. On the other hand, the execution time of both training and testing phases of the RBE methods are notably greater than those of the RBA methods. As far as we know there is no other similar study reported in literature and we consider the results reported here are relevant because they inspire research about enhancing RBA solutions. / Neste trabalho investigamos duas classes de soluções supervisionadas para o problema de resolver se duas ou mais referências a autores (nomes de autores) correspondem à mesma pessoa. Denominamos abordagens relacionais baseadas em atributo (RBA) as abordagens da primeira classe. Nessas abordagens são utilizadas medidas de similaridades entre atributos textuais de duas referências ou de referências ligadas a elas por coautoria. A outra classe de soluções estudada utiliza informações de relacionamento semântico entre entidades, em adição às similaridades por atributos, para decidir quando duas ou mais referências devem ser consideradas correferentes. Denominamos as abordagens dessa classe de relacionais baseadas em entidades (RBE). Apresentamos uma versão supervisionada de solução RBE que se baseia na proposta apresentada por Bhattacharya e Gettor [7]. Experimentos utilizando duas coleções reais e uma coleção artificial mostram que a solução RBE proposta neste trabalho apresenta ganhos de eficácia estatisticamente comprovados em relação a todos os métodos analisados. Entretanto, o ganho é apenas marginal em relação aos métodos da classe RBA analisados. Por outro lado, o custo computacional tanto de treino quanto de teste das abordagens RBE é consideravelmente maior que o custo dos métodos RBA. Consideramos que esse estudo comparativo é inédito e que as conclusões são importantes, pois incentivam pesquisas para o aprimoramento das soluções RBA. Resolução de entidades Desambiguação de nomes de autores Eentity resolution Author name disambiguation
7	Improving Artist Content Matching with Stacking : A comparison of meta-level learners for stacked generalization Magnússon, Fannar January 2018 (has links) Using automatic methods to assign incoming tracks and albums from multiple sources to artists entities in a digital rights management company, where no universal artist identifier is available and artist names can be ambiguous, is a challenging problem. In this work we propose to use stacked generalization to combine the predictions of heterogeneous classifiers for an improved quality of artist content matching on two datasets from a digital rights management company. We compare the performance of using a nonlinear meta-level learner to a linear meta-level learner for the stacked generalization on the two datasets, as well as on eight additional datasets to see how well our results general- ize. We conduct experiments and evaluate how the different meta-level learners perform, using the base learners’ class probabilities or a combination of the base learners’ class probabilities and original input features as meta-features. Our results indicate that stacking with a non-linear meta-level learner can improve predictions on the artist chooser problem. Furthermore, our results indicate that when using a linear meta-level learner for stacked generalization, using the base learners’ class probabilities as metafeatures works best, while using a combination of the base learners’ class probabilities and the original input features as meta-features works best when using a non-linear metalevel learner. Among all the evaluated stacking approaches, stacking with a non-linear meta-level learner, using a combination of the base learners’ class probabilities and the original input features as meta-features, performs the best in our experiments over the ten evaluation datasets. / Att använda automatiska metoder för att tilldela spår och album från olika källor till artister i en digital underhållningstjänst är problematiskt då det inte finns några universellt använda identifierare för artister och namn på artister kan vara tvetydiga. I det här verket föreslår vi en användning av staplad generalisering för att kombinera förutsägningar från heterogena klassificerare för förbättra artistmatchningen i två datamäng från en digital underhållningstjänst. Vi jämför prestandan mellan en linjär och en icke-linjär metainlärningsmetod för den staplade generaliseringen av de två datamängder, samt även åtta ytterligare datamäng för att se hur resultaten kan generaliseras. Vi utför experiment och utvärderar hur de olika metainlärningsmetoderna presterar genom att använda basinlärningsmetodens klassannolikheter eller en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation. Våra resultat indikerar att staplandet med en icke-linjär metainlärningsmetod kan förbättra förutsägningarna i problemet med att tilldela artister. Vidare indikerar våra resultat att när man använder en linjär metainlärningsmetod för en staplad generalisering är det bäst att använda basinlärningsmetodens klassannolikheter som metarepresentation, medan när man använder en icke-linjär metainlärningsmetod för en staplade generaliseringen är det bäst att använda en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation. Av alla utvärderade sätt att stapla är staplandet med en icke-linjär metainlärningsmetod med en kombination av basinlärningsmetodens klassannolikheter och den ursprungliga representationen som metarepresentation den ansats som presterar bäst i våra experiment över de tio datamängderna. Elektroteknik och elektronik Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.1224 seconds