Global ETD Search

21	A Latent Dirichlet Allocation/N-gram Composite Language Model Kulhanek, Raymond Daniel 08 November 2013 (has links) No description available. Computer Science natural language processing language models topic clustering Latent Dirichlet Allocation n-grams
22	Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT / Detection and Correction of Inconsistencies in the Multilingual Treebank HamleDT Mašek, Jan January 2015 (has links) We studied the treebanks included in HamleDT and partially unified their label sets. Afterwards, we used a method based on variation n-grams to automatically detect errors in morphological and dependency annotation. Then we used the output of a part-of-speech tagger / dependency parser trained on each treebank to correct the detected errors. The performance of both the detection and the correction of errors on both annotation levels was manually evaluated on a randomly selected samples of suspected errors from several treebanks. Powered by TCPDF (www.tcpdf.org)
23	Detektor plagiátů textových dokumentů / Text document plagiarism detector Kořínek, Lukáš January 2021 (has links) This diploma thesis is concerned with research on available methods of plagiarism detection and then with design and implementation of such detector. Primary aim is to detect plagiarism within academic works or theses issued at BUT. The detector uses sophisticated preprocessing algorithms to store documents in its own corpus (document database). Implemented comparison algorithms are designed for parallel execution on graphical processing units and they compare a single subject document against all other documents within the corpus in the shortest time possible, enabling near real-time detection while maintaining acceptable quality of output.
24	The Use of Corpus and Network Analysis in Teaching Engineering EAP Phrases Maria J Pritchett (8635236) 16 April 2020 (has links) This dissertation is three interlinked studies that pilot new methods for combining corpus linguistics and semantic network analysis (SNA) to understand and teach academic language. Findings indicate that this approach leads to a deeper understanding of technical writing and offers an exciting new avenue for writing curriculum.<br><br>The first phase is a corpus study of fixed and variable formulaic language (n-grams and p-frames) in academic engineering writing. The results were analyzed functionally, semantically and rhetorically. In contrast to previous n-gram analyses, the p-frame analysis found that variable phrases are often participant-oriented and communicate author stance. <br><br>The second phase combined corpus and network analysis tools to create educational materials. Several elements of successful design were highlighted. The final phase tested the materials in two classes with fifteen graduate students, finding evidence for the value of this novel approach.<br> Linguistics esl eap applied linguistics network analysis corpus p-frames n-grams author stance
25	An Evaluation of Machine Learning Approaches for Hierarchical Malware Classification Roth, Robin, Lundblad, Martin January 2019 (has links) With an evermore growing threat of new malware that keeps growing in both number and complexity, the necessity for improvement in automatic detection and classification of malware is increasing. The signature-based approaches used by several Anti-Virus companies struggle with the increasing amount of polymorphic malware. The polymorphic malware change some minor aspects of the code to be able to remain undetected. Malware classification using machine learning have been used to try to solve this issue in previous research. In the proposed work, different hierarchical machine learning approaches are implemented to conduct three experiments. The methods utilise a hierarchical structure in various ways to be able to get a better classification performance. A selection of hierarchical levels and machine learning models are used in the experiments to evaluate how the results are affected. A data set is created, containing over 90000 different labelled malware samples. The proposed work also includes the creation of a labelling method that can be helpful for researchers in malware classification that needs labels for a created data set.The feature vector used contains 500 n-gram features and 3521 Import Address Table features. In the experiments for the proposed work, the thesis includes the testing of four machine learning models and three different amount of hierarchical levels. Stratified 5-fold cross validation is used in the proposed work to reduce bias and variance in the results. The results from the classification approach shows it achieves the highest hF-score, using Random Forest (RF) as the machine learning model and having four hierarchical levels, which got an hF-score of 0.858228. To be able to compare the proposed work with other related work, pure-flat classification accuracy was generated. The highest generated accuracy score was 0.8512816, which was not the highest compared to other related work. Machine Learning Hierarchical Malware Classification Static Malware Analysis Mnemonic N-grams Other Computer and Information Science Annan data- och informationsvetenskap
26	Conception des circuits de polarisation des détecteurs et de maintien de la tension de base du LabPET II Panier, Sylvain January 2014 (has links) Par le passé, la collaboration entre le Centre d'Imagerie Médicale de Sherbrooke (CIMS) et le Groupe de Recherche en Appareillage Médicale de Sherbrooke (GRAMS) a permis de développer le scanner LabPET. Celui-ci fut le premier scanner de Tomographie d'Émission par Positrons (TEP) commercial utilisant des photodiodes à effet avalanche (PDA) comme détecteur. Depuis, cette collaboration a permis de faire évoluer le scanner afin d'améliorer cette modalité d'imagerie et d'y ajouter la tomodensitométrie (TDM). Les attentes pour la prochaine génération du scanner sont donc grandes. Cette nouvelle génération du scanner, le LabPET II, verra les deux modalités nativement intégrées et elles utiliseront la même chaine de détection. Ce scanner se verra doté de nouveaux détecteurs organisés en matrices de 64 cristaux de 1,1 par 1,1 mm². Cette nouvelle matrice, associée à ses deux matrices de 32 PDA, a prouvé sa capacité à fournir une résolution spatiale inférieure au millimètre. L'utilisation de ce nouveau module de détection pourra donc permettre au LabPET II d'être le premier scanner bimodal (TEP/TDM) commercial atteignant une résolution submillimétrique. Ce scanner permettra de s'approcher un peu plus de la résolution spatiale ultime en TEP tout en permettant une bonne localisation anatomique grâce à l'ajout d'une imagerie TDM rudimentaire. Pour atteindre ces objectifs, une intégration complète de l'électronique frontale a été nécessaire. Dans les versions précédentes, seuls les préamplificateurs de charge et les filtres de mise en forme étaient intégrés; dans cette nouvelle version, toute l'électronique analogique ainsi que la numérisation et les liens de communications devront être intégrés. Pour ce faire, la technique de temps de survol au-dessus d'un seuil (ou ToT pour «Time-over-Threshold») a été préférée à la solution utilisée par le LabPET I qui nécessitait un convertisseur analogique-numérique par canal. La contrepartie de cette solution est l'obligation de maintenir la tension de base à une valeur fixe et commune à tous les canaux. Le circuit de polarisation des PDA a aussi dû être intégré dans l'ASIC, car il occupait énormément de place sur la carte d'électronique frontale du LabPET 1. Dans ce mémoire seront décrits la conception, l'intégration et les tests de ces deux circuits du système. Ils ont démontré leur efficacité tout en n'occupant que très peu de place dans le circuit intégré spécialisé (ASIC) du «module de détection». Au vu des sources bibliographiques recensées, le module de détection du LabPET II devrait être l'un de ceux ayant la plus forte densité de canaux (environ 45 par centimètre carré) et le seul combinant électronique analogique faible bruit, numérique et haute tension (~450 V). La réalisation de cette nouvelle génération devrait permettre au partenariat CIMS/GRAMS de réaffirmer leur position de leader dans le domaine en améliorant les outils d'imagerie à la disposition des chercheurs en médecine préclinique. GRAMS Tomodensitométrie (TDM) LabPET II Photodiode à avalanche (PDA) ASIC CMOS 0,18 um Régulateur de haute tension
27	Cross-domain sentiment classification using grams derived from syntax trees and an adapted naive Bayes approach Cheeti, Srilaxmi January 1900 (has links) Master of Science / Department of Computing and Information Sciences / Doina Caragea / There is an increasing amount of user-generated information in online documents, includ- ing user opinions on various topics and products such as movies, DVDs, kitchen appliances, etc. To make use of such opinions, it is useful to identify the polarity of the opinion, in other words, to perform sentiment classification. The goal of sentiment classification is to classify a given text/document as either positive, negative or neutral based on the words present in the document. Supervised learning approaches have been successfully used for sentiment classification in domains that are rich in labeled data. Some of these approaches make use of features such as unigrams, bigrams, sentiment words, adjective words, syntax trees (or variations of trees obtained using pruning strategies), etc. However, for some domains the amount of labeled data can be relatively small and we cannot train an accurate classifier using the supervised learning approach. Therefore, it is useful to study domain adaptation techniques that can transfer knowledge from a source domain that has labeled data to a target domain that has little or no labeled data, but a large amount of unlabeled data. We address this problem in the context of product reviews, specifically reviews of movies, DVDs and kitchen appliances. Our approach uses an Adapted Naive Bayes classifier (ANB) on top of the Expectation Maximization (EM) algorithm to predict the sentiment of a sentence. We use grams derived from complete syntax trees or from syntax subtrees as features, when training the ANB classifier. More precisely, we extract grams from syntax trees correspond- ing to sentences in either the source or target domains. To be able to transfer knowledge from source to target, we identify generalized features (grams) using the frequently co-occurring entropy (FCE) method, and represent the source instances using these generalized features. The target instances are represented with all grams occurring in the target, or with a reduced grams set obtained by removing infrequent grams. We experiment with different types of grams in a supervised framework in order to identify the most predictive types of gram, and further use those grams in the domain adaptation framework. Experimental results on several cross-domains task show that domain adaptation approaches that combine source and target data (small amount of labeled and some unlabeled data) can help learn classifiers for the target that are better than those learned from the labeled target data alone. Adapted naive bayes algorithm Cross domain sentiment classification Grams Domain adaptation Syntax subtrees Computer Engineering (0464) Computer Science (0984) Information Science (0723)
28	Τεχνικές και μηχανισμοί συσταδοποίησης χρηστών και κειμένων για την προσωποποιημένη πρόσβαση περιεχομένου στον Παγκόσμιο Ιστό Τσόγκας, Βασίλειος 16 April 2015 (has links) Με την πραγματικότητα των υπέρογκων και ολοένα αυξανόμενων πηγών κειμένου στο διαδίκτυο, καθίστανται αναγκαία η ύπαρξη μηχανισμών οι οποίοι βοηθούν τους χρήστες ώστε να λάβουν γρήγορες απαντήσεις στα ερωτήματά τους. Η δημιουργία περιεχομένου, προσωποποιημένου στις ανάγκες των χρηστών, κρίνεται απαραίτητη σύμφωνα με τις επιταγές της συνδυαστικής έκρηξης της πληροφορίας που είναι ορατή σε κάθε ``γωνία'' του διαδικτύου. Ζητούνται άμεσες και αποτελεσματικές λύσεις ώστε να ``τιθασευτεί'' αυτό το χάος πληροφορίας που υπάρχει στον παγκόσμιο ιστό, λύσεις που είναι εφικτές μόνο μέσα από ανάλυση των προβλημάτων και εφαρμογή σύγχρονων μαθηματικών και υπολογιστικών μεθόδων για την αντιμετώπισή τους. Η παρούσα διδακτορική διατριβή αποσκοπεί στο σχεδιασμό, στην ανάπτυξη και τελικά στην αξιολόγηση μηχανισμών και καινοτόμων αλγορίθμων από τις περιοχές της ανάκτησης πληροφορίας, της επεξεργασίας φυσικής γλώσσας καθώς και της μηχανικής εκμάθησης, οι οποίοι θα παρέχουν ένα υψηλό επίπεδο φιλτραρίσματος της πληροφορίας του διαδικτύου στον τελικό χρήστη. Πιο συγκεκριμένα, στα διάφορα στάδια επεξεργασίας της πληροφορίας αναπτύσσονται τεχνικές και μηχανισμοί που συλλέγουν, δεικτοδοτούν, φιλτράρουν και επιστρέφουν κατάλληλα στους χρήστες κειμενικό περιεχόμενο που πηγάζει από τον παγκόσμιο ιστό. Τεχνικές και μηχανισμοί που σκοπό έχουν την παροχή υπηρεσιών πληροφόρησης πέρα από τα καθιερωμένα πρότυπα της υφιστάμενης κατάστασης του διαδικτύου. Πυρήνας της διδακτορικής διατριβής είναι η ανάπτυξη ενός μηχανισμού συσταδοποίησης (clustering) τόσο κειμένων, όσο και των χρηστών του διαδικτύου. Στο πλαίσιο αυτό μελετήθηκαν κλασικοί αλγόριθμοι συσταδοποίησης οι οποίοι και αξιολογήθηκαν για την περίπτωση των άρθρων νέων προκειμένου να εκτιμηθεί αν και πόσο αποτελεσματικός είναι ο εκάστοτε αλγόριθμος. Σε δεύτερη φάση υλοποιήθηκε αλγόριθμος συσταδοποίησης άρθρων νέων που αξιοποιεί μια εξωτερική βάση γνώσης, το WordNet, και είναι προσαρμοσμένος στις απαιτήσεις των άρθρων νέων που πηγάζουν από το διαδίκτυο. Ένας ακόμη βασικός στόχος της παρούσας εργασίας είναι η μοντελοποίηση των κινήσεων που ακολουθούν κοινοί χρήστες καθώς και η αυτοματοποιημένη αξιολόγηση των συμπεριφορών, με ορατό θετικό αποτέλεσμα την πρόβλεψη των προτιμήσεων που θα εκφράσουν στο μέλλον οι χρήστες. Η μοντελοποίηση των χρηστών έχει άμεση εφαρμογή στις δυνατότητες προσωποποίησης της πληροφορίας με την πρόβλεψη των προτιμήσεων των χρηστών. Ως εκ' τούτου, υλοποιήθηκε αλγόριθμος προσωποποίησης ο οποίος λαμβάνει υπ' όψιν του πληθώρα παραμέτρων που αποκαλύπτουν έμμεσα τις προτιμήσεις των χρηστών. / With the reality of the ever increasing information sources from the internet, both in sizes and indexed content, it becomes necessary to have methodologies that will assist the users in order to get the information they need, exactly the moment they need it. The delivery of content, personalized to the user needs is deemed as a necessity nowadays due to the combinatoric explosion of information visible to every corner of the world wide web. Solutions effective and swift are desperately needed in order to deal with this information overload. These solutions are achievable only via the analysis of the refereed problems, as well as the application of modern mathematics and computational methodologies. This Ph.d. dissertation aims to the design, development and finally to the evaluation of mechanisms, as well as, novel algorithms from the areas of information retrieval, natural language processing and machine learning. These mechanisms shall provide a high level of filtering capabilities regarding information originating from internet sources and targeted to end users. More precisely, through the various stages of information processing, various techniques are proposed and developed. Techniques that will gather, index, filter and return textual content well suited to the user tastes. These techniques and mechanisms aim to go above and beyond the usual information delivery norms of today, dealing via novel means with several issues that are discussed. The kernel of this Ph.d. dissertation is the development of a clustering mechanism that will operate both on news articles, as well as, users of the web. Within this context several classical clustering algorithms were studied and evaluated for the case of news articles, allowing as to estimate the level of efficiency of each one within this domain of interest. This left as with a clear choice as to which algorithm should be extended for our work. As a second phase, we formulated a clustering algorithm that operates on news articles and user profiles making use of the external knowledge base of WordNet. This algorithm is adapted to the requirements of diversity and quick churn of news articles originating from the web. Another central goal of this Ph.d. dissertation is the modeling of the browsing behavior of system users within the context of our recommendation system, as well as, the automatic evaluation of these behaviors with the obvious desired outcome or predicting the future preferences of users. The user modeling process has direct application upon the personalization capabilities that we can over on information as far as user preferences predictions are concerned. As a result, a personalization algorithm we formulated which takes into consideration a plethora or parameters that indirectly reveal the user preferences. Συσταδοποίηση Προσωποποίηση Άρθρα νέων 004.35 W-kmeans K-means N-grams Clustering News articles Text preprocessing News clustering Articles clustering
29	Conception des circuits de polarisation des détecteurs et de maintien de la tension de base du LabPET II Panier, Sylvain January 2014 (has links) Par le passé, la collaboration entre le Centre d'Imagerie Médicale de Sherbrooke (CIMS) et le Groupe de Recherche en Appareillage Médicale de Sherbrooke (GRAMS) a permis de développer le scanner LabPET. Celui-ci fut le premier scanner de Tomographie d'Émission par Positrons (TEP) commercial utilisant des photodiodes à effet avalanche (PDA) comme détecteur. Depuis, cette collaboration a permis de faire évoluer le scanner afin d'améliorer cette modalité d'imagerie et d'y ajouter la tomodensitométrie (TDM). Les attentes pour la prochaine génération du scanner sont donc grandes. Cette nouvelle génération du scanner, le LabPET II, verra les deux modalités nativement intégrées et elles utiliseront la même chaine de détection. Ce scanner se verra doté de nouveaux détecteurs organisés en matrices de 64 cristaux de 1,1 par 1,1 mm². Cette nouvelle matrice, associée à ses deux matrices de 32 PDA, a prouvé sa capacité à fournir une résolution spatiale inférieure au millimètre. L'utilisation de ce nouveau module de détection pourra donc permettre au LabPET II d'être le premier scanner bimodal (TEP/TDM) commercial atteignant une résolution submillimétrique. Ce scanner permettra de s'approcher un peu plus de la résolution spatiale ultime en TEP tout en permettant une bonne localisation anatomique grâce à l'ajout d'une imagerie TDM rudimentaire. Pour atteindre ces objectifs, une intégration complète de l'électronique frontale a été nécessaire. Dans les versions précédentes, seuls les préamplificateurs de charge et les filtres de mise en forme étaient intégrés; dans cette nouvelle version, toute l'électronique analogique ainsi que la numérisation et les liens de communications devront être intégrés. Pour ce faire, la technique de temps de survol au-dessus d'un seuil (ou ToT pour «Time-over-Threshold») a été préférée à la solution utilisée par le LabPET I qui nécessitait un convertisseur analogique-numérique par canal. La contrepartie de cette solution est l'obligation de maintenir la tension de base à une valeur fixe et commune à tous les canaux. Le circuit de polarisation des PDA a aussi dû être intégré dans l'ASIC, car il occupait énormément de place sur la carte d'électronique frontale du LabPET 1. Dans ce mémoire seront décrits la conception, l'intégration et les tests de ces deux circuits du système. Ils ont démontré leur efficacité tout en n'occupant que très peu de place dans le circuit intégré spécialisé (ASIC) du «module de détection». Au vu des sources bibliographiques recensées, le module de détection du LabPET II devrait être l'un de ceux ayant la plus forte densité de canaux (environ 45 par centimètre carré) et le seul combinant électronique analogique faible bruit, numérique et haute tension (~450 V). La réalisation de cette nouvelle génération devrait permettre au partenariat CIMS/GRAMS de réaffirmer leur position de leader dans le domaine en améliorant les outils d'imagerie à la disposition des chercheurs en médecine préclinique. GRAMS Tomodensitométrie (TDM) LabPET II Photodiode à avalanche (PDA) ASIC CMOS 0,18 um Régulateur de haute tension
30	Recurrent neural network language models for automatic speech recognition Gangireddy, Siva Reddy January 2017 (has links) The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).

Search results