Global ETD Search

1	Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models Dasu, Pradyumna Upendra 10 January 2025 (has links) Digital libraries hold vast and diverse content, with electronic theses and dissertations (ETDs) being among the most diverse. ETDs span multiple disciplines and include unique terminology, making achieving clear and coherent topic representations challenging. Existing topic modeling techniques often struggle with such heterogeneous collections, leaving a gap in providing interpretable and meaningful topic labels. This thesis addresses these challenges through a three-step framework designed to improve topic modeling outcomes for ETD metadata. First, we developed a custom preprocessing pipeline to enhance data quality and ensure consistency in text analysis. Second, we applied and optimized multiple topic modeling techniques to uncover latent themes, including LDA, ProdLDA, NeuralLDA, Contextualized Topic Models, and BERTopic. Finally, we integrated Large Language Models (LLMs), such as GPT-4, using prompt engineering to augment traditional topic models, refining and interpreting their outputs without replacing them. The framework was tested on a large corpus of ETD metadata, including through preliminary testing on a small subset. Quantitative metrics and user studies were used to evaluate performance, focusing on the clarity, accuracy, and relevance of the generated topics. The results demonstrated significant improvements in topic coherence and interpretability, with user study participants highlighting the value of the enhanced representations. These findings underscore the potential of combining customized preprocessing, advanced topic modeling, and LLM-driven refinements to better represent themes in complex collections like ETDs, providing a foundation for downstream tasks such as searching, browsing, and recommendation. / Master of Science / Digital libraries store vast information, including books, research papers, and electronic theses and dissertations (ETDs). ETDs are incredibly diverse, covering most academic fields and using highly specialized language. This diversity makes it challenging to create clear and meaningful summaries of the main themes within these collections. Our study addresses this challenge by developing a three-step framework and applying it to ETDs. First, we cleaned and standardized the data to make it easier to analyze. Second, we used advanced techniques to uncover patterns and group similar topics together. Finally, we improved these topics using powerful tools like GPT-4, which helped make the themes more precise, more accurate, and easier to interpret. We tested this framework on both a small and a large collection of ETDs. Combining quantitative evaluations and user feedback showed that our methods significantly improved how the topics represented the content. This work lays the foundation for more effective future tools to help people search, explore, and navigate large collections of academic works. Read more Topic Modeling Natural Language Processing Large Language Models Electronic Theses and Dissertations Digital Libraries Information Storage and Retrieval Artificial Intelligence Search and Recommendation
2	Recommandation diversifiée et distribuée pour les données scientifiques / Diversified and Distributed Recommendation for Scientific Data Servajean, Maximilien 16 December 2014 (has links) Dans de nombreux domaines, les nouvelles technologies d'acquisition de l'information ou encore de mesure (e.g. serres de phénotypage robotisées) ont engendré une création phénoménale de données. Nous nous appuyons en particulier sur deux cas d'application réels: les observations de plantes en botanique et les données de phénotypage en biologie. Cependant, nos contributions peuvent être généralisées aux données du Web. Par ailleurs, s'ajoute à la quantité des données leur distribution. Chaque utilisateur stocke en effet ses données sur divers sites hétérogènes (e.g. ordinateurs personnels, serveurs, cloud), données qu'il souhaite partager. Que ce soit pour les observations de botanique ou pour les données de phénotypage en biologie, des solutions collaboratives, comprenant des outils de recherche et de recommandation distribués, bénéficieraient aux utilisateurs. L'objectif général de ce travail est donc de définir un ensemble de techniques permettant le partage et la découverte de données, via l'application d'approches de recherche et de recommandation, dans un environnement distribué (e.g. sites hétérogènes).Pour cela, la recherche et la recommandation permettent aux utilisateurs de se voir présenter des résultats, ou des recommandations, à la fois pertinents par rapport à une requête qu'ils auraient soumise et par rapport à leur profil. Les techniques de diversification permettent de présenter aux utilisateurs des résultats offrant une meilleure nouveauté tout en évitant de les lasser par des contenus redondants et répétitifs. Grâce à la diversité, une distance entre toutes les recommandations est en effet introduite afin que celles-ci soient les plus représentatives possibles de l'ensemble des résultats pertinents. Peu de travaux exploitent la diversité des profils des utilisateurs partageant les données. Dans ce travail de thèse, nous montrons notamment que dans certains scénarios, diversifier les profils des utilisateurs apporte une nette amélioration en ce qui concerne la qualité des résultats~: des sondages montrent que dans plus de 75% des cas, les utilisateurs préfèrent la diversité des profils à celle des contenus. Par ailleurs, afin d'aborder les problèmes de distribution des données sur des sites hétérogènes, deux approches sont possibles. La première, les réseaux P2P, consiste à établir des liens entre chaque pair (noeud du réseau): étant donné un pair p, ceux avec lesquels il a établi un lien représentent son voisinage. Celui-ci est utilisé lorsque p soumet une requête q, pour y répondre. Cependant, dans les solutions de l'état de l'art, la redondance des profils des pairs présents dans les différents voisinages limitent la capacité du système à retrouver des résultats pertinents sur le réseau, étant donné les requêtes soumises par les utilisateurs. Nous montrons, dans ce travail, qu'introduire de la diversité dans le calcul du voisinage, en augmentant la couverture, permet un net gain en termes de qualité. En effet, en tenant compte de la diversité, chaque pair du voisinage a une plus forte probabilité de retourner des résultats nouveaux à l'utilisateur courant: lorsqu'une requête est soumise par un pair, notre approche permet de retrouver jusqu'à trois fois plus de bons résultats sur le réseau. La seconde approche de la distribution est le multisite. Généralement, dans les solutions de l'état de l'art, les sites sont homogènes et représentés par de gros centres de données. Dans notre contexte, nous proposons une approche permettant la collaboration de sites hétérogènes, tels que de petits serveurs d'équipe, des ordinateurs personnels ou de gros sites dans le cloud. Un prototype est issu de cette contribution. Deux versions du prototype ont été réalisées afin de répondre aux deux cas d'application, en s'adaptant notamment aux types des données. / In many fields, novel technologies employed in information acquisition and measurement (e.g. phenotyping automated greenhouses) are at the basis of a phenomenal creation of data. In particular, we focus on two real use cases: plants observations in botany and phenotyping data in biology. Our contributions can be, however, generalized to Web data. In addition to their huge volume, data are also distributed. Indeed, each user stores their data in many heterogeneous sites (e.g. personal computers, servers, cloud); yet he wants to be able to share them. In both use cases, collaborative solutions, including distributed search and recommendation techniques, could benefit to the user.Thus, the global objective of this work is to define a set of techniques enabling sharing and discovery of data in heterogeneous distributed environment, through the use of search and recommendation approaches.For this purpose, search and recommendation allow users to be presented sets of results, or recommendations, that are both relevant to the queries submitted by the users and with respect to their profiles. Diversification techniques allow users to receive results with better novelty while avoiding redundant and repetitive content. By introducing a distance between each result presented to the user, diversity enables to return a broader set of relevant items.However, few works exploit profile diversity, which takes into account the users that share each item. In this work, we show that in some scenarios, considering profile diversity enables a consequent increase in results quality: surveys show that in more than 75% of the cases, users would prefer profile diversity to content diversity.Additionally, in order to address the problems related to data distribution among heterogeneous sites, two approaches are possible. First, P2P networks aim at establishing links between peers (nodes of the network): creating in this way an overlay network, where peers directly connected to a given peer p are known as his neighbors. This overlay is used to process queries submitted by each peer. However, in state of the art solutions, the redundancy of the peers in the various neighborhoods limits the capacity of the system to retrieve relevant items on the network, given the queries submitted by the users. In this work, we show that introducing diversity in the computation of the neighborhood, by increasing the coverage, enables a huge gain in terms of quality. By taking into account diversity, each peer in a given neighborhood has indeed, a higher probability to return different results given a keywords query compared to the other peers in the neighborhood. Whenever a query is submitted by a peer, our approach can retrieve up to three times more relevant items than state of the art solutions.The second category of approaches is called multi-site. Generally, in state of the art multi-sites solutions, the sites are homogeneous and consist in big data centers. In our context, we propose an approach enabling sharing among heterogeneous sites, such as small research teams servers, personal computers or big sites in the cloud. A prototype regrouping all contributions have been developed, with two versions addressing each of the use cases considered in this thesis. Read more Recherche et recommandation Diversité des profils Top-K Pair-À-Pair Multisite Bavardage Search and recommendation Profile diversity Top-K Peer-To-Peer Multi-Sites Gossip

Search results

Topic Modeling for Heterogeneous Digital Libraries: Tailored Approaches Using Large Language Models

Recommandation diversifiée et distribuée pour les données scientifiques / Diversified and Distributed Recommendation for Scientific Data