Global ETD Search

21	Detektor plagiátů textových dokumentů / Text document plagiarism detector Kořínek, Lukáš January 2021 (has links) This diploma thesis is concerned with research on available methods of plagiarism detection and then with design and implementation of such detector. Primary aim is to detect plagiarism within academic works or theses issued at BUT. The detector uses sophisticated preprocessing algorithms to store documents in its own corpus (document database). Implemented comparison algorithms are designed for parallel execution on graphical processing units and they compare a single subject document against all other documents within the corpus in the shortest time possible, enabling near real-time detection while maintaining acceptable quality of output.
22	The Use of Corpus and Network Analysis in Teaching Engineering EAP Phrases Maria J Pritchett (8635236) 16 April 2020 (has links) This dissertation is three interlinked studies that pilot new methods for combining corpus linguistics and semantic network analysis (SNA) to understand and teach academic language. Findings indicate that this approach leads to a deeper understanding of technical writing and offers an exciting new avenue for writing curriculum.<br><br>The first phase is a corpus study of fixed and variable formulaic language (n-grams and p-frames) in academic engineering writing. The results were analyzed functionally, semantically and rhetorically. In contrast to previous n-gram analyses, the p-frame analysis found that variable phrases are often participant-oriented and communicate author stance. <br><br>The second phase combined corpus and network analysis tools to create educational materials. Several elements of successful design were highlighted. The final phase tested the materials in two classes with fifteen graduate students, finding evidence for the value of this novel approach.<br> Linguistics esl eap applied linguistics network analysis corpus p-frames n-grams author stance
23	An Evaluation of Machine Learning Approaches for Hierarchical Malware Classification Roth, Robin, Lundblad, Martin January 2019 (has links) With an evermore growing threat of new malware that keeps growing in both number and complexity, the necessity for improvement in automatic detection and classification of malware is increasing. The signature-based approaches used by several Anti-Virus companies struggle with the increasing amount of polymorphic malware. The polymorphic malware change some minor aspects of the code to be able to remain undetected. Malware classification using machine learning have been used to try to solve this issue in previous research. In the proposed work, different hierarchical machine learning approaches are implemented to conduct three experiments. The methods utilise a hierarchical structure in various ways to be able to get a better classification performance. A selection of hierarchical levels and machine learning models are used in the experiments to evaluate how the results are affected. A data set is created, containing over 90000 different labelled malware samples. The proposed work also includes the creation of a labelling method that can be helpful for researchers in malware classification that needs labels for a created data set.The feature vector used contains 500 n-gram features and 3521 Import Address Table features. In the experiments for the proposed work, the thesis includes the testing of four machine learning models and three different amount of hierarchical levels. Stratified 5-fold cross validation is used in the proposed work to reduce bias and variance in the results. The results from the classification approach shows it achieves the highest hF-score, using Random Forest (RF) as the machine learning model and having four hierarchical levels, which got an hF-score of 0.858228. To be able to compare the proposed work with other related work, pure-flat classification accuracy was generated. The highest generated accuracy score was 0.8512816, which was not the highest compared to other related work. Machine Learning Hierarchical Malware Classification Static Malware Analysis Mnemonic N-grams Other Computer and Information Science Annan data- och informationsvetenskap
24	Τεχνικές και μηχανισμοί συσταδοποίησης χρηστών και κειμένων για την προσωποποιημένη πρόσβαση περιεχομένου στον Παγκόσμιο Ιστό Τσόγκας, Βασίλειος 16 April 2015 (has links) Με την πραγματικότητα των υπέρογκων και ολοένα αυξανόμενων πηγών κειμένου στο διαδίκτυο, καθίστανται αναγκαία η ύπαρξη μηχανισμών οι οποίοι βοηθούν τους χρήστες ώστε να λάβουν γρήγορες απαντήσεις στα ερωτήματά τους. Η δημιουργία περιεχομένου, προσωποποιημένου στις ανάγκες των χρηστών, κρίνεται απαραίτητη σύμφωνα με τις επιταγές της συνδυαστικής έκρηξης της πληροφορίας που είναι ορατή σε κάθε ``γωνία'' του διαδικτύου. Ζητούνται άμεσες και αποτελεσματικές λύσεις ώστε να ``τιθασευτεί'' αυτό το χάος πληροφορίας που υπάρχει στον παγκόσμιο ιστό, λύσεις που είναι εφικτές μόνο μέσα από ανάλυση των προβλημάτων και εφαρμογή σύγχρονων μαθηματικών και υπολογιστικών μεθόδων για την αντιμετώπισή τους. Η παρούσα διδακτορική διατριβή αποσκοπεί στο σχεδιασμό, στην ανάπτυξη και τελικά στην αξιολόγηση μηχανισμών και καινοτόμων αλγορίθμων από τις περιοχές της ανάκτησης πληροφορίας, της επεξεργασίας φυσικής γλώσσας καθώς και της μηχανικής εκμάθησης, οι οποίοι θα παρέχουν ένα υψηλό επίπεδο φιλτραρίσματος της πληροφορίας του διαδικτύου στον τελικό χρήστη. Πιο συγκεκριμένα, στα διάφορα στάδια επεξεργασίας της πληροφορίας αναπτύσσονται τεχνικές και μηχανισμοί που συλλέγουν, δεικτοδοτούν, φιλτράρουν και επιστρέφουν κατάλληλα στους χρήστες κειμενικό περιεχόμενο που πηγάζει από τον παγκόσμιο ιστό. Τεχνικές και μηχανισμοί που σκοπό έχουν την παροχή υπηρεσιών πληροφόρησης πέρα από τα καθιερωμένα πρότυπα της υφιστάμενης κατάστασης του διαδικτύου. Πυρήνας της διδακτορικής διατριβής είναι η ανάπτυξη ενός μηχανισμού συσταδοποίησης (clustering) τόσο κειμένων, όσο και των χρηστών του διαδικτύου. Στο πλαίσιο αυτό μελετήθηκαν κλασικοί αλγόριθμοι συσταδοποίησης οι οποίοι και αξιολογήθηκαν για την περίπτωση των άρθρων νέων προκειμένου να εκτιμηθεί αν και πόσο αποτελεσματικός είναι ο εκάστοτε αλγόριθμος. Σε δεύτερη φάση υλοποιήθηκε αλγόριθμος συσταδοποίησης άρθρων νέων που αξιοποιεί μια εξωτερική βάση γνώσης, το WordNet, και είναι προσαρμοσμένος στις απαιτήσεις των άρθρων νέων που πηγάζουν από το διαδίκτυο. Ένας ακόμη βασικός στόχος της παρούσας εργασίας είναι η μοντελοποίηση των κινήσεων που ακολουθούν κοινοί χρήστες καθώς και η αυτοματοποιημένη αξιολόγηση των συμπεριφορών, με ορατό θετικό αποτέλεσμα την πρόβλεψη των προτιμήσεων που θα εκφράσουν στο μέλλον οι χρήστες. Η μοντελοποίηση των χρηστών έχει άμεση εφαρμογή στις δυνατότητες προσωποποίησης της πληροφορίας με την πρόβλεψη των προτιμήσεων των χρηστών. Ως εκ' τούτου, υλοποιήθηκε αλγόριθμος προσωποποίησης ο οποίος λαμβάνει υπ' όψιν του πληθώρα παραμέτρων που αποκαλύπτουν έμμεσα τις προτιμήσεις των χρηστών. / With the reality of the ever increasing information sources from the internet, both in sizes and indexed content, it becomes necessary to have methodologies that will assist the users in order to get the information they need, exactly the moment they need it. The delivery of content, personalized to the user needs is deemed as a necessity nowadays due to the combinatoric explosion of information visible to every corner of the world wide web. Solutions effective and swift are desperately needed in order to deal with this information overload. These solutions are achievable only via the analysis of the refereed problems, as well as the application of modern mathematics and computational methodologies. This Ph.d. dissertation aims to the design, development and finally to the evaluation of mechanisms, as well as, novel algorithms from the areas of information retrieval, natural language processing and machine learning. These mechanisms shall provide a high level of filtering capabilities regarding information originating from internet sources and targeted to end users. More precisely, through the various stages of information processing, various techniques are proposed and developed. Techniques that will gather, index, filter and return textual content well suited to the user tastes. These techniques and mechanisms aim to go above and beyond the usual information delivery norms of today, dealing via novel means with several issues that are discussed. The kernel of this Ph.d. dissertation is the development of a clustering mechanism that will operate both on news articles, as well as, users of the web. Within this context several classical clustering algorithms were studied and evaluated for the case of news articles, allowing as to estimate the level of efficiency of each one within this domain of interest. This left as with a clear choice as to which algorithm should be extended for our work. As a second phase, we formulated a clustering algorithm that operates on news articles and user profiles making use of the external knowledge base of WordNet. This algorithm is adapted to the requirements of diversity and quick churn of news articles originating from the web. Another central goal of this Ph.d. dissertation is the modeling of the browsing behavior of system users within the context of our recommendation system, as well as, the automatic evaluation of these behaviors with the obvious desired outcome or predicting the future preferences of users. The user modeling process has direct application upon the personalization capabilities that we can over on information as far as user preferences predictions are concerned. As a result, a personalization algorithm we formulated which takes into consideration a plethora or parameters that indirectly reveal the user preferences. Συσταδοποίηση Προσωποποίηση Άρθρα νέων 004.35 W-kmeans K-means N-grams Clustering News articles Text preprocessing News clustering Articles clustering
25	Recurrent neural network language models for automatic speech recognition Gangireddy, Siva Reddy January 2017 (has links) The goal of this thesis is to advance the use of recurrent neural network language models (RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs are currently state-of-the-art and shown to consistently reduce the word error rates (WERs) of LVCSR tasks when compared to other language models. In this thesis we propose various advances to RNNLMs. The advances are: improved learning procedures for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned better parameters by a novel pre-training approach and enhanced the context using prosody and syntactic features. We present a pre-training method for RNNLMs, in which the output weights of a feed-forward neural network language model (NNLM) are shared with the RNNLM. This is accomplished by first fine-tuning the weights of the NNLM, which are then used to initialise the output weights of an RNNLM with the same number of hidden units. To investigate the effectiveness of the proposed pre-training method, we have carried out text-based experiments on the Penn Treebank Wall Street Journal data, and ASR experiments on the TED lectures data. Across the experiments, we observe small but significant improvements in perplexity (PPL) and ASR WER. Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs to a target domain (topic or genre or television programme (show)) at test time using ASR transcripts from first pass recognition. We investigated two approaches to adapt the RNNLMs. In the first approach the forward propagating hidden activations are scaled - learning hidden unit contributions (LHUC). In the second approach we adapt all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs on multi genre broadcast speech data. We observe small (on an average 0.1% absolute) but significant improvements in WER compared to a strong unadapted RNNLM model. Finally, we present the context-enhancement of RNNLMs using prosody and syntactic features. The prosody features were computed from the acoustics of the context words and the syntactic features were from the surface form of the words in the context. We trained the RNNLMs with word duration, pause duration, final phone duration, syllable duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar (CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated by reporting PPL and WER on two speech recognition tasks, Switchboard and TED lectures. We observed substantial improvements in PPL (5% to 15% relative) and small but significant improvements in WER (0.1% to 0.5% absolute).
26	[pt] ANOTAÇÃO MORFOSSINTÁTICA A PARTIR DO CONTEXTO MORFOLÓGICO / [en] MORPHOSYNTACTIC ANNOTATION BASED ON MORPHOLOGICAL CONTEXT EDUARDO DE JESUS COELHO REIS 20 December 2016 (has links) [pt] Rotular as classes gramaticais ao longo de uma sentença - part-ofspeech tagging - é uma das primeiras tarefas de processamento de linguagem natural, fornecendo atributos importantes para realizar tarefas de alta complexidade. A representação de texto a nível de palavra tem sido amplamente adotada, tanto através de uma codificação esparsa convencional, e.g. bagofwords; quanto por uma representação distribuída, como os sofisticados modelos de word-embedding usados para descrever informações sintáticas e semânticas. Um problema importante desse tipo de codificação é a carência de aspectos morfológicos. Além disso, os sistemas atuais apresentam uma precisão por token em torno de 97 por cento. Contudo, quando avaliados por sentença, apresentam um resultado mais modesto com uma taxa de acerto em torno de 55−57 por cento. Neste trabalho, nós demonstramos como utilizar n-grams para derivar automaticamente atributos esparsos e morfológicos para processamento de texto. Essa representação permite que redes neurais realizem a tarefa de POS-Tagging a partir de uma representação a nível de caractere. Além disso, introduzimos uma estratégia de regularização capaz de selecionar atributos específicos para cada neurônio. A utilização de regularização embutida em nossos modelos produz duas variantes. A primeira compartilha os n-grams selecionados globalmente entre todos os neurônios de uma camada; enquanto que a segunda opera uma seleção individual para cada neurônio, de forma que cada neurônio é sensível apenas aos n-grams que mais o estimulam. Utilizando a abordagem apresentada, nós geramos uma alta quantidade de características que representam afeições morfossintáticas relevantes baseadas a nível de caractere. Nosso POS tagger atinge a acurácia de 96, 67 por cento no corpus Mac-Morpho para o Português. / [en] Part-of-speech tagging is one of the primary stages in natural language processing, providing useful features for performing higher complexity tasks. Word level representations have been largely adopted, either through a conventional sparse codification, such as bag-of-words, or through a distributed representation, like the sophisticated word embedded models used to describe syntactic and semantic information. A central issue on these codifications is the lack of morphological aspects. In addition, recent taggers present per-token accuracies around 97 percent. However, when using a persentence metric, the good taggers show modest accuracies, scoring around 55-57 percent. In this work, we demonstrate how to use n-grams to automatically derive morphological sparse features for text processing. This representation allows neural networks to perform POS tagging from a character-level input. Additionally, we introduce a regularization strategy capable of selecting specific features for each layer unit. As a result, regarding n-grams selection, using the embedded regularization in our models produces two variants. The first one shares globally selected features among all layer units, whereas the second operates individual selections for each layer unit, so that each unit is sensible only to the n-grams that better stimulate it. Using the proposed approach, we generate a high number of features which represent relevant morphosyntactic affection based on a character-level input. Our POS tagger achieves the accuracy of 96.67 percent in the Mac-Morpho corpus for Portuguese. [pt] REDE NEURAL [pt] REGULARIZACAO ESPARSA [pt] N GRAMS [pt] REPRESENTACAO MORFOLOGICA [pt] PART OF SPEECH TAGGING [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [en] NEURAL NETWORKS [en] NATURAL LANGUAGE PROCESSING
27	N-gramy v mluveném projevu českých a rodilých mluvčích angličtiny / N-grams in the speech of Czech and native speakers of English Zvěřinová, Simona January 2016 (has links) The diploma thesis is concerned with the analysis of recurrent word-combinations in the speech of advanced Czech speakers of English and native speakers of English. The data used for the analysis is extracted from two corpora, learner corpus LINDSEI and native speaker corpus LOCNEC. The aim of the thesis is to compare the two groups of speakers, determine differences in their use of recurrent word-combinations and compare the findings to previous studies involving speakers of different languages. The quantitative analysis is performed on a sample of 50 speakers from each corpus and the frequency data is used to compare the two groups as to the number of types of word-combinations they use and how frequently they do so. The qualitative analysis is performed on a sample of 15 speakers from each corpus to determine functional differences. Four categories of word-combinations are determined in the analysis. In the conclusion, the quantitative and qualitative findings are compared to previous research involving speakers of different languages. Keywords: spoken language, learner language, n-grams, n-gram analysis, recurrent word- combinations, lexical bundles, learner corpus
28	Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on Categorization Šabatka, Ondřej January 2010 (has links) The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.
29	The research on chinese text multi-label classification / Avancée en classification multi-labels de textes en langue chinoise / 中文文本多标签分类研究 Wei, Zhihua 07 May 2010 (has links) Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC. / La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels. / 文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化，分类目标也逐渐多样化，研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务，对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上，针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题，从不同的角度尝试运用粗糙集理论加以解决，提出了相应的算法，主要包括：针对n-gram作为中文文本特征时带来的维数灾难问题，提出了两步特征选择的方法，即去除类内稀有特征和类间特征选择相结合的方法，并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验，得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低，开销大等问题，提出了基于LDA模型的多标签文本分类算法，利用LDA模型提取的主题作为文本特征，构建高效的分类器。在PT3多标签分类转换方法下，该分类算法在中英文数据集上都表现出很好的效果，与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点，提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类，再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能，在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题，提出了一种基于可变精度粗糙集的复合多标签文本分类框架，该框架通过可变精度粗糙集方法划分文本特征空间，进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即，当一篇未知文本被划分到某一类文本的下近似区域时，可以直接用简单的单标签文本分类器判断其类别；当未知文本被划分在边界域时，则采用相应区域的多标签分类器进行分类。实验表明，这种分类框架下，分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统（MLWC），该系统能够直接调用搜索引擎返回的搜索结果，并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类，使用户可以快速地定位搜索结果中感兴趣的文本。 La Classification de texte N-grams Codage de la texte chiniose La classification multi-labels Latent Dirichlet Model L’ensembles approximatifs Assouplissement Le corpus de textes chinois multi-labels Chinese text classification Text representation Multi-label classification Rough Set Latent Dirichlet Allocation (LDA) Classification method Smoothing model Chinese text multi-label corpus 中文文本分类文本表示多标签分类 N-gram 粗糙集隐含狄利克雷分配分类器设计同济多标签网页分类系统中文文本多标签语料库

Search results