Spelling suggestions: "subject:"embedding""
121 |
Cluster selection for Clustered Federated Learning using Min-wise Independent Permutations and Word Embeddings / Kluster selektion för Klustrad Federerad Inlärning med användning av “Min-wise” Oberoende Permutations och OrdinbäddningarRaveen Bandara Harasgama, Pulasthi January 2022 (has links)
Federated learning is a widely established modern machine learning methodology where training is done directly on the client device with local client data and the local training results are shared to compute a global model. Federated learning emerged as a result of data ownership and the privacy concerns of traditional machine learning methodologies where data is collected and trained at a central location. However, in a distributed data environment, the training suffers significantly when the client data is not identically distributed. Hence, clustered federated learning was proposed where similar clients are clustered and trained independently to form specialized cluster models which are then used to compute a global model. In this approach, the cluster selection for clustered federated learning is a major factor that affects the effectiveness of the global model. This research presents two approaches for client clustering using local client data for clustered federated learning while preserving data privacy. The two proposed approaches use min-wise independent permutations to compute client signatures using text and word embeddings. These client signatures are then used as a representation of client data to cluster clients using agglomerative hierarchical clustering. Unlike previously proposed clustering methods, the two presented approaches do not use model updates, provide a better privacy-preserving mechanism and have a lower communication overhead. With extensive experimentation, we show that the proposed approaches outperform the random clustering approach. Finally, we present a client clustering methodology that can be utilized in a practical clustered federated learning environment. / Federerad inlärning är en etablerad och modern maskininlärnings metod. Träningen är utförd direkt på klientenheten med lokal klient data. Sen är dem lokala träningsresultat delad för att beräkna en global modell. Federerad inlärning har utvecklats på grund av dataägarskap- och dataintegritetsproblem vid traditionella maskininlärnings metoder. Dessa metoder samlar och tränar data på en central enhet. I den här metoden är kluster selektionen en viktig faktor som påverkar effektiviteten av den globala modellen. Detta forskningsarbete presenterar två metoder för klient klustring med hjälp av lokala klientdata för federerad inlärning samtidigt tar metoderna hänsyn på dataintegritet. Metoderna använder “min-wise” oberoende permutations och förtränade (“text och word”) inbäddningar. Dessa klientsignaturer används som en klientdata representation för att klustrar klienter med hjälp av agglomerativ hierarkisk klustring. Till skillnad från tidigare klustringsmetoder använder de två presenterade metoderna inte modelluppdateringar. Detta ger en bättre sekretessbevarande mekanism och har lägre kommunikationskostnader. De två presenterade metoderna överträffar den slumpmässiga klustringsmetoden genom omfattande experiment och analys. Till slut presenterar vi en klientklustermetodik som kan användas i en praktisk klustrad federerad inlärningsmiljö.
122 |
A Bridge between Graph Neural Networks and Transformers: Positional Encodings as Node EmbeddingsManu, Bright Kwaku 01 December 2023 (has links) (PDF)
Graph Neural Networks and Transformers are very powerful frameworks for learning machine learning tasks. While they were evolved separately in diverse fields, current research has revealed some similarities and links between them. This work focuses on bridging the gap between GNNs and Transformers by offering a uniform framework that highlights their similarities and distinctions. We perform positional encodings and identify key properties that make the positional encodings node embeddings. We found that the properties of expressiveness, efficiency and interpretability were achieved in the process. We saw that it is possible to use positional encodings as node embeddings, which can be used for machine learning tasks such as node classification, graph classification, and link prediction. We discuss some challenges and provide future directions.
123 |
Extending a Text Classifier to Multiple Languages / Utöka en textklassificeringsmodell till flera språkByström, Albin January 2021 (has links)
This thesis explores the possibility to extend monolingual and bilingual text classifiers to multiple languages. Two different language models are explored, language aligned word embeddings and a transformer model. The goal was to take a classifier based on Swedish and English samples and extend it to Danish, German, and Finnish samples. The result shows that extending a text classifier by word embeddings alignment or by finetuning a multilingual transformer model is possible but with varying accuracy depending on the language. / Denna avhandling undersöker möjligheten att utvidga enspråkiga och tvåspråkiga textklassificatorer till flera språk. Två olika språkmodeller utforskas, justeras ordinbäddningar och en transformatormodell. Målet var att ta en klassificerare baserad på svenska och engelska texter och utvidga den till danska, tyska och finska texter. Resultatet visar att det är möjligt att utöka en textklassificering med ordinbäddning eller genom att finjustera en flerspråkig transformatormodell, men träffsäkerheten varierar beroende på språk.
124 |
Readability: Man and Machine : Using readability metrics to predict results from unsupervised sentiment analysis / Läsbarhet: Människa och maskin : Användning av läsbarhetsmått för att förutsäga resultaten från oövervakad sentimentanalysLarsson, Martin, Ljungberg, Samuel January 2021 (has links)
Readability metrics assess the ease with which human beings read and understand written texts. With the advent of machine learning techniques that allow computers to also analyse text, this provides an interesting opportunity to investigate whether readability metrics can be used to inform on the ease with which machines understand texts. To that end, the specific machine analysed in this paper uses word embeddings to conduct unsupervised sentiment analysis. This specification minimises the need for labelling and human intervention, thus relying heavily on the machine instead of the human. Across two different datasets, sentiment predictions are made using Google’s Word2Vec word embedding algorithm, and are evaluated to produce a dichotomous output variable per sentiment. This variable, representing whether a prediction is correct or not, is then used as the dependent variable in a logistic regression with 17 readability metrics as independent variables. The resulting model has high explanatory power and the effects of readability metrics on the results from the sentiment analysis are mostly statistically significant. However, metrics affect sentiment classification in the two datasets differently, indicating that the metrics are expressions of linguistic behaviour unique to the datasets. The implication of the findings is that readability metrics could be used directly in sentiment classification models to improve modelling accuracy. Moreover, the results also indicate that machines are able to pick up on information that human beings do not pick up on, for instance that certain words are associated with more positive or negative sentiments. / Läsbarhetsmått bedömer hur lätt eller svårt det är för människor att läsa och förstå skrivna texter. Eftersom nya maskininlärningstekniker har utvecklats kan datorer numera också analysera texter. Därför är en intressant infallsvinkel huruvida läsbarhetsmåtten också kan användas för att bedöma hur lätt eller svårt det är för maskiner att förstå texter. Mot denna bakgrund använder den specifika maskinen i denna uppsats ordinbäddningar i syfte att utföra oövervakad sentimentanalys. Således minimeras behovet av etikettering och mänsklig handpåläggning, vilket resulterar i en mer djupgående analys av maskinen istället för människan. I två olika dataset jämförs rätt svar mot sentimentförutsägelser från Googles ordinbäddnings-algoritm Word2Vec för att producera en binär utdatavariabel per sentiment. Denna variabel, som representerar om en förutsägelse är korrekt eller inte, används sedan som beroende variabel i en logistisk regression med 17 olika läsbarhetsmått som oberoende variabler. Den resulterande modellen har högt förklaringsvärde och effekterna av läsbarhetsmåtten på resultaten från sentimentanalysen är mestadels statistiskt signifikanta. Emellertid är effekten på klassificeringen beroende på dataset, vilket indikerar att läsbarhetsmåtten ger uttryck för olika lingvistiska beteenden som är unika till datamängderna. Implikationen av resultaten är att läsbarhetsmåtten kan användas direkt i modeller som utför sentimentanalys för att förbättra deras prediktionsförmåga. Dessutom indikerar resultaten också att maskiner kan plocka upp på information som människor inte kan, exempelvis att vissa ord är associerade med positiva eller negativa sentiment.
125 |
Identifying New Fault Types Using Transformer EmbeddingsKarlsson, Mikael January 2021 (has links)
Continuous integration/delivery and deployment consist of many automated tests, some of which may fail leading to faulty software. Similar faults may occur in different stages of the software production lifecycle and it is necessary to identify similar faults and cluster them into fault types in order to minimize troubleshooting time. Pretrained transformer based language models have been proven to achieve state of the art results in many natural language processing tasks like measuring semantic textual similarity. This thesis aims to investigate whether it is possible to cluster and identify new fault types by using a transformer based model to create context aware vector representations of fault records, which consists of numerical data and logs with domain specific technical terms. The clusters created were compared against the clusters created by an existing system, where log files are grouped by manual specified filters. Relying on already existing fault types with associated log data, this thesis shows that it is possible to finetune a transformer based model for a classification task in order to improve the quality of text embeddings. The embeddings are clustered by using density based and hierarchical clustering algorithms with cosine distance. The results show that it is possible to cluster log data and get comparable results to the existing manual system, where the cluster similarity was assessed with V-measure and Adjusted Rand Index. / Kontinuerlig integration består automatiserade tester där det finns risk för att några misslyckas vilket kan leda till felaktig programvara. Liknande fel kan uppstå under olika faser av en programvarans livscykel och det är viktigt att identifiera och gruppera olika feltyper för att optimera felsökningsprocessen. Det har bevisats att språkmodeller baserade på transformatorarkitekturen kan uppnå höga resultat i många uppgifter inom språkteknologi, inklusive att mäta semantisk likhet mellan två texter. Detta arbete undersöker om det är möjligt att gruppera och identifiera nya feltyper genom att använda en transformatorbaserad språkmodell för att skapa numeriska vektorer av loggtext, som består av domänspecifika tekniska termer och numerisk data. Klustren jämförs mot redan existerande grupperingar som skapats av ett befintligt system där feltyper identifieras med manuellt skrivna filter. Det här arbetet visar att det går att förbättra vektorrepresenationerna skapade av en språkmodell baserad på transformatorarkitekturen genom att tilläggsträna modellen för en klassificeringsuppgift. Vektorerna grupperas med hjälp av densitetsbaserade och hierarkiska klusteralgoritmer. Resultaten visar att det är möjligt att skapa vektorer av logg-texter med hjälp av en transformatorbaserad språkmodell och få jämförbara resultat som ett befintligt manuellt system, när klustren evaluerades med V-måttet och Adjusted Rand Index.
126 |
Finding Street Gang Member Profiles on TwitterBalasuriya, Lakshika January 2017 (has links)
No description available.
127 |
Exploring Language Descriptions through Vector Space ModelsAleksandrova, Anastasiia January 2024 (has links)
The abundance of natural languages and the complexities involved in describingtheir structures pose significant challenges for modern linguists, not only in documentation but also in the systematic organization of knowledge. Computational linguisticstools hold promise in comprehending the “big picture”, provided existing grammars aredigitized and made available for analysis using state-of-the-art language models. Extensive efforts have been made by an international team of linguists to compile such aknowledge base, resulting in the DReaM corpus – a comprehensive dataset comprisingtens of thousands of digital books containing multilingual language descriptions.However, there remains a lack of tools that facilitate understanding of concise language structures and uncovering overlooked topics and dialects. This thesis representsa small step towards elucidating the broader picture by utilizing a subset of the DReaMcorpus as a vector space capable of capturing genetic ties among described languages.To achieve this, we explore several encoding algorithms in conjunction with varioussegmentation strategies and vector summarization approaches for generating bothmonolingual and cross-lingual feature representations of selected grammars in Englishand Russian.Our newly proposed sentence-facets TF-IDF model shows promise in unsupervisedgeneration of monolingual representations, conveying sufficient signal to differentiate historical linguistic relations among 484 languages from 26 language familiesbased on their descriptions. However, the construction of a cross-lingual vector spacenecessitates further exploration of advanced technologies.
128 |
Continuous Appearance for Material Textures with Neural Rendering : Using multiscale embeddings for efficient rendering of material textures at any scale in 3D engines. / Kontinuerligt Utseende för Materialtexturer med Neural Rendering : Användning av flerskaliga inbäddningar för effektiv rendering av materialtexturer i alla skalor i 3D-motorer.de Oliveira, Louis January 2024 (has links)
Neural Rendering has recently shown potential for real-time applications such as video games. However, current state of the art Neural Rendering approaches still suffer from a high memory footprint and often require multiple inferences of large neural networks to produce a properly filtered output. This cost associated to filtering the output of Neural Rendering models makes real-time multiscale rendering difficult. In this work, we propose a neural architecture based on multiscale embeddings that take advantage of current rasterization pipelines to produce a filtered output in a single evaluation, allowing for a continuous appearance through scale using a very small neural network. The model is trained directly on a filtered signal in order to learn a continuous representation of the material instead of relying on a post-processing step. The proposed architecture enables efficient sampling on GPU both in texel position and in level of detail, and closely reproduces material textures while drastically reducing their memory footprint. The results show that this approach is a viable candidate for integration in rendering pipelines, as it can be inferred efficiently in regular fragment shaders and on consumer-level hardware inducing less than 1 millisecond of overhead compared to traditional pipelines while producing an output of similar quality with a 33% reduction in memory footprint. The model also produces a smooth reconstruction through scale, free of artifacts and visual discontinuities that would typically be observed for an unfiltered output. / Neural rendering har på senare år visat potential i realtidsapplikationer som t ex inom dataspel. Dessvärre begränsas dagens state-of-the-art metoder inom neural rendering av hög minnesanvändning och kräver ofta att multipla inferenser görs av relativt stora neuronnät för att skapa adekvat filtrerade resultat. Det är därför svårt att direkt tillämpa neural rendering i spelutveckling. I detta arbete föreslås en neural arkitektur som baserar sig på multiscale embeddings som tar tillvara på egenskaperna hos dagens renderingspipelines för att producera adekvat filtrerade resultat med endast en inferens, vilket möjliggör kontinuerliga utseendeegenskaper genom skalning med ett mycket litet neuronnät. Modellen tränas direkt på en filtrerad signal för att lära en kontinuerlig representation av materialet istället för att behöva ett separat post-processingsteg. Den föreslagna arkitekturen möjliggör effektiv sampling på GPU både i texelposition och level of detail, och reproducerar materialtexturerna väl, samtidigt som den reducerar minnesanvändningen drastiskt. Resultaten visar att denna metod är en gångbar kandidat för integration i en renderingspipeline, eftersom den kan inferreras effektivt i en vanlig fragmentsshader på konsumenthårdvara med under en millisekunds tidstillägg jämfört med en traditionell pipeline utan avkall på kvalitet med 33% lägre minnesanvändning. Modellen producerar också en slät rekonstruktion genom skalning, fri från artefakter och visuella diskontinuiteter som annars ofta syns i ett ofiltrerat resultat.
129 |
Discovering Implant Terms in Medical RecordsJerdhaf, Oskar January 2021 (has links)
Implant terms are terms like "pacemaker" which indicate the presence of artifacts in the body of a human. These implant terms are key to determining if a patient can safely undergo Magnetic Resonance Imaging (MRI). However, to identify these terms in medical records is time-consuming, laborious and expensive, but necessary for taking the correct precautions before an MRI scan. Automating this process is of great interest to radiologists as it ideally saves time, prevents mistakes and as a result saves lives. The electronic medical records (EMR) contain the documented medical history of a patient, including any implants or objects that an individual would have inside their body. Information about such objects and implants are of great interest when determining if and how a patient can be scanned using MRI. This information is unfortunately not easily extracted through automatic means. Due to their sparse presence and the unusual structure of medical records compared to most written text, makes it very difficult to automate using simple means. By leveraging the recent advancements in Artificial Intelligence (AI), this thesis explores the ability to identify and extract such terms automatically in Swedish EMRs. For the task of identifying implant terms in medical records a generally trained Swedish Bidirectional Encoder Representations from Transformers (BERT) model is used, which is then fine-tuned on Swedish medical records. Using this model a variety of approaches are explored two of which will be covered in this thesis. Using this model a variety of approaches are explored, namely BERT-KDTree, BERT-BallTree, Cosine Brute Force and unsupervised NER. The results show that BERT-KDTree and BERT-BallTree are the most rewarding methods. Results from both methods have been evaluated by domain experts and appear promising for such an early stage, given the difficulty of the task. The evaluation of BERT-BallTree shows that multiple methods of extraction may be preferable as they provide different but still useful terms. Cosine brute force is deemed to be an unrealistic approach due to computational and memory requirements. The NER approach was deemed too impractical and laborious to justify for this study, yet is potentially useful if not more suitable given a different set of conditions and goals. While there is much to be explored and improved, these experiments are a clear indication that automatic identification of implant terms is possible, as a large number of implant terms were successfully discovered using automated means.
130 |
Learning representations for Information RetrievalSordoni, Alessandro 03 1900 (has links)
La recherche d'informations s'intéresse, entre autres, à répondre à des questions comme: est-ce qu'un document est pertinent à une requête ?
Est-ce que deux requêtes ou deux documents sont similaires ? Comment la similarité entre deux requêtes ou documents peut être utilisée pour améliorer
l'estimation de la pertinence ? Pour donner réponse à ces questions, il est nécessaire d'associer chaque document et requête à des représentations interprétables
par ordinateur. Une fois ces représentations estimées, la similarité peut correspondre, par exemple, à une distance ou une divergence qui opère dans l'espace de représentation.
On admet généralement que la qualité d'une représentation a un impact direct sur l'erreur d'estimation par rapport à la vraie pertinence, jugée par un humain.
Estimer de bonnes représentations des documents et des requêtes a longtemps été un problème central de la recherche d'informations.
Le but de cette thèse est de proposer des nouvelles méthodes pour estimer les représentations des documents et des requêtes, la relation de pertinence entre eux et ainsi modestement avancer l'état de l'art du domaine.
Nous présentons quatre articles publiés dans des conférences internationales et un article publié dans un forum d'évaluation. Les deux premiers articles concernent des méthodes qui créent l'espace de représentation selon une connaissance à priori sur les caractéristiques qui sont importantes pour la tâche à accomplir. Ceux-ci nous amènent à présenter un nouveau modèle de recherche d'informations qui diffère des modèles existants sur le plan théorique et de l'efficacité expérimentale. Les deux derniers articles marquent un changement fondamental dans l'approche de construction des représentations. Ils bénéficient notamment de l'intérêt de recherche dont les techniques d'apprentissage profond par réseaux de neurones, ou deep learning, ont fait récemment l'objet. Ces modèles d'apprentissage élicitent automatiquement les caractéristiques importantes pour la tâche demandée à partir d'une quantité importante de données. Nous nous intéressons à la modélisation des relations sémantiques entre documents et requêtes ainsi qu'entre deux ou plusieurs requêtes. Ces derniers articles marquent les premières applications de l'apprentissage de représentations par réseaux de neurones à la recherche d'informations. Les modèles proposés ont aussi produit une performance améliorée sur des collections de test standard. Nos travaux nous mènent à la conclusion générale suivante: la performance en recherche d'informations pourrait drastiquement être améliorée en se basant sur les approches d'apprentissage de représentations. / Information retrieval is generally concerned with answering questions such as: is this document relevant to this query?
How similar are two queries or two documents?
How query and document similarity can be used to enhance relevance estimation?
In order to answer these questions, it is necessary to access computational representations of documents and queries.
For example, similarities between documents and queries may correspond to a distance or a divergence defined on the representation space.
It is generally assumed that the quality of the representation has a direct impact on the bias with respect to the true similarity, estimated by means of human intervention.
Building useful representations for documents and queries has always been central to information retrieval research.
The goal of this thesis is to provide new ways of estimating such representations and the relevance relationship between them.
We present four articles that have been published in international conferences and one published in an information retrieval evaluation
forum. The first two articles can be categorized as feature engineering approaches, which transduce a priori knowledge about the domain into the features of the representation.
We present a novel retrieval model that compares favorably to existing models in terms of both theoretical originality and experimental effectiveness.
The remaining two articles mark a significant change in our vision and originate from the widespread interest in deep learning research that took place during the time they were written.
Therefore, they naturally belong to the category of representation learning approaches, also known as feature learning. Differently from previous approaches, the learning model discovers alone the most important features for the task at hand, given a considerable amount of labeled data. We propose to model the semantic relationships between documents and queries and between queries themselves.
The models presented have also shown improved effectiveness on standard test collections. These last articles are amongst the first applications of representation learning with neural networks for information retrieval. This series of research leads to the following observation: future improvements of information retrieval effectiveness has to rely on representation learning techniques instead of manually defining the representation space.
Page generated in 0.0569 seconds