Spelling suggestions: "subject:"detector embedding"" "subject:"colector embedding""
1 |
Comparing Pso-Based Clustering Over Contextual Vector Embeddings to Modern Topic ModelingMiles, Samuel 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare
|
2 |
Learning deep embeddings by learning to rankHe, Kun 05 February 2019 (has links)
We study the problem of embedding high-dimensional visual data into low-dimensional vector representations. This is an important component in many computer vision applications involving nearest neighbor retrieval, as embedding techniques not only perform dimensionality reduction, but can also capture task-specific semantic similarities. In this thesis, we use deep neural networks to learn vector embeddings, and develop a gradient-based optimization framework that is capable of optimizing ranking-based retrieval performance metrics, such as the widely used Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). Our framework is applied in three applications.
First, we study Supervised Hashing, which is concerned with learning compact binary vector embeddings for fast retrieval, and propose two novel solutions. The first solution optimizes Mutual Information as a surrogate ranking objective, while the other directly optimizes AP and NDCG, based on the discovery of their closed-form expressions for discrete Hamming distances. These optimization problems are NP-hard, therefore we derive their continuous relaxations to enable gradient-based optimization with neural networks. Our solutions establish the state-of-the-art on several image retrieval benchmarks.
Next, we learn deep neural networks to extract Local Feature Descriptors from image patches. Local features are used universally in low-level computer vision tasks that involve sparse feature matching, such as image registration and 3D reconstruction, and their matching is a nearest neighbor retrieval problem. We leverage our AP optimization technique to learn both binary and real-valued descriptors for local image patches. Compared to competing approaches, our solution eliminates complex heuristics, and performs more accurately in the tasks of patch verification, patch retrieval, and image matching.
Lastly, we tackle Deep Metric Learning, the general problem of learning real-valued vector embeddings using deep neural networks. We propose a learning to rank solution through optimizing a novel quantization-based approximation of AP. For downstream tasks such as retrieval and clustering, we demonstrate promising results on standard benchmarks, especially in the few-shot learning scenario, where the number of labeled examples per class is limited.
|
3 |
Information extraction and mapping for KG construction with learned concepts from scientic documents : Experimentation with relations data for development of concept learnerMalik, Muhammad Hamza January 2020 (has links)
Systematic review of research manuscripts is a common procedure in which research studies pertaining a particular field or domain are classified and structured in a methodological way. This process involves, between other steps, an extensive review and consolidation of scientific metrics and attributes of the manuscripts, such as citations, type or venue of publication. The extraction and mapping of relevant publication data, evidently, is a very laborious task if performed manually. Automation of such systematic mapping steps intend to reduce the human effort required and therefore can potentially reduce the time required for this process.The objective of this thesis is to automate the data extraction and mapping steps when systematically reviewing studies. The manual process is replaced by novel graph modelling techniques for effective knowledge representation, as well as novel machine learning techniques that aim to learn these representations. This eventually automates this process by characterising the publications on the basis of certain sub-properties and qualities that give the reviewer a quick high-level overview of each research study. The final model is a concept learner that predicts these sub-properties which in addition addresses the inherent concept-drift of novel manuscripts over time. Different models were developed and explored in this research study for the development of concept learner.Results show that: (1) Graph reasoning techniques which leverage the expressive power in modern graph databases are very effective in capturing the extracted knowledge in a so-called knowledge graph, which allows us to form concepts that can be learned using standard machine learning techniques like logistic regression, decision trees and neural networks etc. (2) Neural network models and ensemble models outperformed other standard machine learning techniques like logistic regression and decision trees based on the evaluation metrics. (3) The concept learner is able to detect and avoid concept drift by retraining the model. / Systematisk granskning av forskningsmanuskript är en vanlig procedur där forskningsstudier inom ett visst område klassificeras och struktureras på ett metodologiskt sätt. Denna process innefattar en omfattande granskning och sammanförande av vetenskapliga mätvärden och attribut för manuskriptet, såsom citat, typ av manuskript eller publiceringsplats. Framställning och kartläggning av relevant publikationsdata är uppenbarligen en mycket mödosam uppgift om den utförs manuellt. Avsikten med automatiseringen av processen för denna typ av systematisk kartläggning är att minska den mänskliga ansträngningen, och den tid som krävs kan på så sätt minskas. Syftet med denna avhandling är att automatisera datautvinning och stegen för kartläggning vid systematisk granskning av studier. Den manuella processen ersätts av avancerade grafmodelleringstekniker för effektiv kunskapsrepresentation, liksom avancerade maskininlärningstekniker som syftar till att lära maskinen dessa representationer. Detta automatiserar så småningom denna process genom att karakterisera publikationerna beserat på vissa subjektiva egenskaper och kvaliter som ger granskaren en snabb god översikt över varje forskningsstudie. Den slutliga modellen är ett inlärningskoncept som förutsäger dessa subjektiva egenskaper och dessutom behandlar den inneboende konceptuella driften i manuskriptet över tiden. Olika modeller utvecklades och undersöktes i denna forskningsstudie för utvecklingen av inlärningskonceptet. Resultaten visar att: (1) Diagrammatiskt resonerande som uttnytjar moderna grafdatabaser är mycket effektiva för att fånga den framställda kunskapen i en så kallad kunskapsgraf, och gör det möjligt att vidareutveckla koncept som kan läras med hjälp av standard tekniker för maskininlärning. (2) Neurala nätverksmodeller och ensemblemodeller överträffade andra standard maskininlärningstekniker baserat på utvärderingsvärdena. (3) Inlärningskonceptet kan detektera och undvika konceptuell drift baserat på F1-poäng och omlärning av algoritmen.
|
4 |
COMPARING PSO-BASED CLUSTERING OVER CONTEXTUAL VECTOR EMBEDDINGS TO MODERN TOPIC MODELINGSamuel Jacob Miles (12462660) 26 April 2022 (has links)
<p>Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare</p>
<p>the proposed model to LDA. The third is a standard benchmark dataset for topic modeling which consists of a collection of messages posted to 20 different news groups. It is used to compare state-of-the-art generative document models (i.e., ETM and NVDM) to pPSO. The results show that pPSO is able to produce interpretable clusters. Moreover, pPSO is able to capture both common topics as well as emergent topics. The topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20News-Groups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus specific vocabulary which is used by ETM and NVDM.</p>
|
Page generated in 0.0708 seconds