Global ETD Search

41	Predicting Gene Functions and Phenotypes by combining Deep Learning and Ontologies Kulmanov, Maxat 08 April 2020 (has links) The amount of available protein sequences is rapidly increasing, mainly as a consequence of the development and application of high throughput sequencing technologies in the life sciences. It is a key question in the life sciences to identify the functions of proteins, and furthermore to identify the phenotypes that may be associated with a loss (or gain) of function in these proteins. Protein functions are generally determined experimentally, and it is clear that experimental determination of protein functions will not scale to the current { and rapidly increasing { amount of available protein sequences (over 300 million). Furthermore, identifying phenotypes resulting from loss of function is even more challenging as the phenotype is modi ed by whole organism interactions and environmental variables. It is clear that accurate computational prediction of protein functions and loss of function phenotypes would be of signi cant value both to academic research and to the biotechnology industry. We developed and expanded novel methods for representation learning, predicting protein functions and their loss of function phenotypes. We use deep neural network algorithm and combine them with symbolic inference into neural-symbolic algorithms. Our work signi cantly improves previously developed methods for predicting protein functions through methodological advances in machine learning, incorporation of broader data types that may be predictive of functions, and improved systems for neural-symbolic integration. The methods we developed are generic and can be applied to other domains in which similar types of structured and unstructured information exist. In future, our methods can be applied to prediction of protein function for metagenomic samples in order to evaluate the potential for discovery of novel proteins of industrial value. Also our methods can be applied to the prediction of loss of function phenotypes in human genetics and incorporate the results in a variant prioritization tool that can be applied to diagnose patients with Mendelian disorders. gene functions phenotypes ontologies embeddings deep neural networks machine learning
42	Towards Learning Compact Visual Embeddings using Deep Neural Networks January 2019 (has links) abstract: Feature embeddings differ from raw features in the sense that the former obey certain properties like notion of similarity/dissimilarity in it's embedding space. word2vec is a preeminent example in this direction, where the similarity in the embedding space is measured in terms of the cosine similarity. Such language embedding models have seen numerous applications in both language and vision community as they capture the information in the modality (English language) efficiently. Inspired by these language models, this work focuses on learning embedding spaces for two visual computing tasks, 1. Image Hashing 2. Zero Shot Learning. The training set was used to learn embedding spaces over which similarity/dissimilarity is measured using several distance metrics like hamming / euclidean / cosine distances. While the above-mentioned language models learn generic word embeddings, in this work task specific embeddings were learnt which can be used for Image Retrieval and Classification separately. Image Hashing is the task of mapping images to binary codes such that some notion of user-defined similarity is preserved. The first part of this work focuses on designing a new framework that uses the hash-tags associated with web images to learn the binary codes. Such codes can be used in several applications like Image Retrieval and Image Classification. Further, this framework requires no labelled data, leaving it very inexpensive. Results show that the proposed approach surpasses the state-of-art approaches by a significant margin. Zero-shot classification is the task of classifying the test sample into a new class which was not seen during training. This is possible by establishing a relationship between the training and the testing classes using auxiliary information. In the second part of this thesis, a framework is designed that trains using the handcrafted attribute vectors and word vectors but doesn’t require the expensive attribute vectors during test time. More specifically, an intermediate space is learnt between the word vector space and the image feature space using the hand-crafted attribute vectors. Preliminary results on two zero-shot classification datasets show that this is a promising direction to explore. / Dissertation/Thesis / Masters Thesis Computer Engineering 2019 Computer science Image Embeddings Image Hashing Weakly Supervised Learning
43	Neural Enhancement Strategies for Robust Speech Processing Nawar, Mohamed Nabih Ali Mohamed 10 March 2023 (has links) In real-world scenarios, speech signals are often contaminated with environmental noises, and reverberation, which degrades speech quality and intelligibility. Lately, the development of deep learning algorithms has marked milestones in speech- based research fields e.g. speech recognition, spoken language understanding, etc. As one of the crucial topics in the speech processing research area, speech enhancement aims to restore clean speech signals from noisy signals. In the last decades, many conventional speech enhancement statistical-based algorithms had been pro- posed. However, the performance of these approaches is limited in non-stationary noisy conditions. The raising of deep learning-based approaches for speech enhancement has led to revolutionary advances in their performance. In this context, speech enhancement is formulated as a supervised learning problem, which tackles the open challenges introduced by the speech enhancement conventional approaches. In general, deep learning speech enhancement approaches are categorized into frequency-domain and time-domain approaches. In particular, we experiment with the performance of the Wave-U-Net model, a solid and superior time-domain approach for speech enhancement. First, we attempt to improve the performance of back-end speech-based classification tasks in noisy conditions. In detail, we propose a pipeline that integrates the Wave-U-Net (later this model is modified to the Dilated Encoder Wave-U-Net) as a pre-processing stage for noise elimination with a temporal convolution network (TCN) for the intent classification task. Both models are trained independently from each other. Reported experimental results showed that the modified Wave-U-Net model not only improves the speech quality and intelligibility measured in terms of PESQ, and STOI metrics, but also improves the back-end classification accuracy. Later, it was observed that the dis-joint training approach often introduces signal distortion in the output of the speech enhancement module. Thus, it can deteriorate the back-end performance. Motivated by this, we introduce a set of fully time- domain joint training pipelines that combine the Wave-U-Net model with the TCN intent classifier. The difference between these architectures is the interconnections between the front-end and back-end. All architectures are trained with a loss function that combines the MSE loss as the front-end loss with the cross-entropy loss for the classification task. Based on our observations, we claim that the JT architecture with equally balancing both components’ contributions yields better classification accuracy. Lately, the release of large-scale pre-trained feature extraction models has considerably simplified the development of speech classification and recognition algorithms. However, environmental noise and reverberation still negatively affect performance, making robustness in noisy conditions mandatory in real-world applications. One way to mitigate the noise effect is to integrate a speech enhancement front-end that removes artifacts from the desired speech signals. Unlike the state-of-the-art enhancement approaches that operate either on speech spectrogram, or directly on time-domain signals, we study how enhancement can be applied directly on the speech embeddings, extracted using Wav2Vec, and WavLM models. We investigate a variety of training approaches, considering different flavors of joint and disjoint training of the speech enhancement front-end and of the classification/recognition back-end. We perform exhaustive experiments on the Fluent Speech Commands and Google Speech Commands datasets, contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, as well as on LibriSpeech, contaminated with noises from the MUSAN dataset, considering intent classification, keyword spotting, and speech recognition tasks respectively. Results show that enhancing the speech em-bedding is a viable and computationally effective approach, and provide insights about the most promising training approaches.
44	Improving Document Clustering by Refining Overlapping Cluster Regions Upadhye, Akshata Rajendra January 2022 (has links) No description available. Information Science document cluster overlapping embeddings purity silhouette K-means
45	“Embed, embed! There’s knocking at the gate.” Burghardt, Manuel, Liebl, Bernhard 30 May 2024 (has links) The detection of intertextual references in text corpora is a digital humanities topic that has gained a lot of attention in recent years. While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text, the computational problem at hand is the task of text similarity detection, and more concretely, semantic similarity detection. In this notebook, we introduce the Vectorian as a framework to build queries through word embeddings such as fastText and GloVe. We evaluate the influence of computing document similarity through alignments such as Waterman-Smith-Beyer and two variants of Word Mover’s Distance. We also investigate the performance of state-of-art sentence embeddings like Siamese BERT networks for the task - both as document embeddings and as contextual token embeddings. Overall, we find that Waterman-Smith-Beyer with fastText offers highly competitive performance. The notebook can also be used to upload new data for performing custom search queries. Intertextuality, Embeddings, Vectorian info:eu-repo/classification/ddc/410 ddc:410
46	TimeLink: Visualizing Diachronic Word Embeddings and Topics Williams, Lemara Faith 11 June 2024 (has links) The task of analyzing a collection of documents generated over time is daunting. A natural way to ease the task is by summarizing documents into the topics that exist within these documents. The temporal aspect of topics can frame relevance based on when topics are introduced and when topics stop being mentioned. It creates trends and patterns that can be traced by individual key terms taken from the corpus. If trends are being established, there must be a way to visualize them through the key terms. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique. However, creating a visual system for terms is not easy. Work has been done to develop word embeddings, allowing researchers to treat words like any number. This makes it possible to create simple charts based on word embeddings like scatter plots. However, these methods are inefficient due to loss of effectiveness with multiple time slices and point overlap. A visualization method that addresses these problems while also visualizing diachronic word embeddings in an interesting way with added semantic meaning is hard to find. These problems are managed through TimeLink. TimeLink is proposed as a dashboard system to help users gain insights from the movement of diachronic word embeddings. It comprises a Sankey diagram showing the path of a selected key term to a cluster in a time period. This local cluster is also mapped to a global topic based on an original corpus of documents from which the key terms are drawn. On the dashboard, different tools are given to users to aid in a focused analysis, such as filtering key terms and emphasizing specific clusters. TimeLink provides insightful visualizations focused on temporal word embeddings while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time. / Master of Science / The task of analyzing documents collected over time is daunting. Grouping documents into topics can help frame relevancy based on when topics are introduced and hampered. The creation of topics also enables the ability to visualize trends and patterns. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique of browsing individual documents. A visualization system for this analysis typically focuses on the terms that affect established topics. Some visualization methods, like scatter plots, implement this but can be inefficient due to loss of effectiveness as more data is introduced. TimeLink is proposed as a dashboard system to aid users in drawing insights from the development of terms over time. In addition to addressing problems in other visualizations, it visualizes the movement of terms intuitively and adds semantic meaning. TimeLink provides insightful visualizations focused on the movement of terms while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time. High Dimensional Visualizations Clustering Diachronic Word Embeddings Topic Modeling
47	Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and its Applications Shaik, Arshad 12 1900 (has links) Word embeddings is a useful method that has shown enormous success in various NLP tasks, not only in open domain but also in biomedical domain. The biomedical domain provides various domain specific resources and tools that can be exploited to improve performance of these word embeddings. However, most of the research related to word embeddings in biomedical domain focuses on analysis of model architecture, hyper-parameters and input text. In this paper, we use SemMedDB to design new sentences called `Semantic Sentences'. Then we use these sentences in addition to biomedical text as inputs to the word embedding model. This approach aims at introducing biomedical semantic types defined by UMLS, into the vector space of word embeddings. The semantically rich word embeddings presented here rivals state of the art biomedical word embedding in both semantic similarity and relatedness metrics up to 11%. We also demonstrate how these semantic types in word embeddings can be utilized. machine learning word embeddings biomedical resources skip-gram model
48	Extractive Text Summarization of Greek News Articles Based on Sentence-Clusters Kantzola, Evangelia January 2020 (has links) This thesis introduces an extractive summarization system for Greek news articles based on sentence clustering. The main purpose of the paper is to evaluate the impact of three different types of text representation, Word2Vec embeddings, TF-IDF and LASER embeddings, on the summarization task. By taking these techniques into account, we build three different versions of the initial summarizer. Moreover, we create a new corpus of gold standard summaries to evaluate them against the system summaries. The new collection of reference summaries is merged with a part of the MultiLing Pilot 2011 in order to constitute our main dataset. We perform both automatic and human evaluation. Our automatic ROUGE results suggest that System A which employs Average Word2Vec vectors to create sentence embeddings, outperforms the other two systems by yielding higher ROUGE-L F-scores. Contrary to our initial hypotheses, System C using LASER embeddings fails to surpass even the Word2Vec embeddings method, showing sometimes a weak sentence representation. With regard to the scores obtained by the manual evaluation task, we observe that System A using Average Word2Vec vectors and System C with LASER embeddings tend to produce more coherent and adequate summaries than System B employing TF-IDF. Furthermore, the majority of system summaries are rated very high with respect to non-redundancy. Overall, System A utilizing Average Word2Vec embeddings performs quite successfully according to both evaluations. text summarization nlp machine learning word embeddings sentence embeddings clustering
49	Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages Bhowmik, Kowshik January 2022 (has links) No description available. Artificial Intelligence Cross-Lingual Word Embeddings Word Embeddings Low-Resource Languages Bilingual Lexicon Induction Computational Linguistics Natural Language Processing
50	[en] A FAST AND SPACE-ECONOMICAL APPROACH TO WORD MOVER S DISTANCE / [pt] UMA ABORDAGEM RÁPIDA E ECONÔMICA PARA WORD MOVER S DISTANCE MATHEUS TELLES WERNER 02 April 2020 (has links) [pt] O Word Mover s Distance (WMD) proposto por Kusner et al. [ICML,2015] é uma função de distância entre documentos que se aproveita das relações semânticas entre palavras extraidas por suas Word Embeddings. Essa função de distância se mostrou bastante eficaz, obtendo taxas de erro estado da arte para problemas de classificação, porém ao mesmo tempo inviável para largas coleções ou grandes documentos devido a ser necessário computar um problema de transporte em um grafo bipartido completo para cada par de documentos. Assumindo algumas hipóteses, que são respaldadas por propriedades empíricas das distâncias entre as Word Embeddings, nós simplificamos o WMD de forma a obter uma nova função de distância o qual requer a solução de um problema de fluxo máximo em um grafo esparço, que pode ser resolvido mais rapidamente do que um problema de transporte em um grafo denso. Nossos experimentos mostram que conseguimos obter ganhos de performance até três ordens de magnitude acima do WMD enquanto mantendo as mesmas taxas de erro na tarefa de classificação de documentos. / [en] The Word Mover s Distance (WMD) proposed in Kusner et. al. [ICML,2015] is a distance between documents that takes advantage of semantic relations among words that are captured by their Word Embeddings. This distance proved to be quite effective, obtaining state-of-the-art error rates for classification tasks, but also impracticable for large collections or documents because it needs to compute a transportation problem on a complete bipartite graph for each pair of documents. By using assumptions, that are supported by empirical properties of the distances between Word Embeddings, we simplify WMD so that we obtain a new distance whose computation requires the solution of a max flow problem in a sparse graph, which can be solved much faster than the transportation problem in a dense graph. Our experiments show that we can obtain a performance gain up to three orders of magnitude over WMD while maintaining the same error rates in document classification tasks. [pt] DISTANCIA ENTRE DOCUMENTOS [pt] WORD MOVER S DISTANCE [pt] WORD EMBEDDINGS [en] DOCUMENT DISTANCE [en] WORD MOVER S DISTANCE [en] WORD EMBEDDINGS

Search results