171 |
Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk EngelskaLund, Max January 2019 (has links)
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
|
172 |
Aprendizado semissupervisionado multidescrição em classificação de textos / Multi-view semi-supervised learning in text classificationBraga, Ígor Assis 23 April 2010 (has links)
Algoritmos de aprendizado semissupervisionado aprendem a partir de uma combinação de dados rotulados e não rotulados. Assim, eles podem ser aplicados em domínios em que poucos exemplos rotulados e uma vasta quantidade de exemplos não rotulados estão disponíveis. Além disso, os algoritmos semissupervisionados podem atingir um desempenho superior aos algoritmos supervisionados treinados nos mesmos poucos exemplos rotulados. Uma poderosa abordagem ao aprendizado semissupervisionado, denominada aprendizado multidescrição, pode ser usada sempre que os exemplos de treinamento são descritos por dois ou mais conjuntos de atributos disjuntos. A classificação de textos é um domínio de aplicação no qual algoritmos semissupervisionados vêm obtendo sucesso. No entanto, o aprendizado semissupervisionado multidescrição ainda não foi bem explorado nesse domínio dadas as diversas maneiras possíveis de se descrever bases de textos. O objetivo neste trabalho é analisar o desempenho de algoritmos semissupervisionados multidescrição na classificação de textos, usando unigramas e bigramas para compor duas descrições distintas de documentos textuais. Assim, é considerado inicialmente o difundido algoritmo multidescrição CO-TRAINING, para o qual são propostas modificações a fim de se tratar o problema dos pontos de contenção. É também proposto o algoritmo COAL, o qual pode melhorar ainda mais o algoritmo CO-TRAINING pela incorporação de aprendizado ativo como uma maneira de tratar pontos de contenção. Uma ampla avaliação experimental desses algoritmos foi conduzida em bases de textos reais. Os resultados mostram que o algoritmo COAL, usando unigramas como uma descrição das bases textuais e bigramas como uma outra descrição, atinge um desempenho significativamente melhor que um algoritmo semissupervisionado monodescrição. Levando em consideração os bons resultados obtidos por COAL, conclui-se que o uso de unigramas e bigramas como duas descrições distintas de bases de textos pode ser bastante compensador / Semi-supervised learning algorithms learn from a combination of both labeled and unlabeled data. Thus, they can be applied in domains where few labeled examples and a vast amount of unlabeled examples are available. Furthermore, semi-supervised learning algorithms may achieve a better performance than supervised learning algorithms trained on the same few labeled examples. A powerful approach to semi-supervised learning, called multi-view learning, can be used whenever the training examples are described by two or more disjoint sets of attributes. Text classification is a domain in which semi-supervised learning algorithms have shown some success. However, multi-view semi-supervised learning has not yet been well explored in this domain despite the possibility of describing textual documents in a myriad of ways. The aim of this work is to analyze the effectiveness of multi-view semi-supervised learning in text classification using unigrams and bigrams as two distinct descriptions of text documents. To this end, we initially consider the widely adopted CO-TRAINING multi-view algorithm and propose some modifications to it in order to deal with the problem of contention points. We also propose the COAL algorithm, which further improves CO-TRAINING by incorporating active learning as a way of dealing with contention points. A thorough experimental evaluation of these algorithms was conducted on real text data sets. The results show that the COAL algorithm, using unigrams as one description of text documents and bigrams as another description, achieves significantly better performance than a single-view semi-supervised algorithm. Taking into account the good results obtained by COAL, we conclude that the use of unigrams and bigrams as two distinct descriptions of text documents can be very effective
|
173 |
對使用者評論之情感分析研究-以Google Play市集為例 / Research into App user opinions with Sentimental Analysis on the Google Play market林育龍, Lin, Yu Long Unknown Date (has links)
全球智慧型手機的出貨量持續提升,且熱門市集的App下載次數紛紛突破500億次。而在iOS和Android手機App市集中,App的評價和評論對App在市集的排序有很大的影響;對於App開發者而言,透過評論確實可掌握使用者的需求,並在產生抱怨前能快速反應避免危機。然而,每日多達上百篇的評論,透過人力逐篇查看,不止耗費時間,更無法整合性的瞭解使用者的需求與問題。
文字情感分析通常會使用監督式或非監督式的方法分析文字評論,其中監督式方法被證實透過簡單的文件量化方法就可達到很高的正確率。但監督式方法有無法預期未知趨勢的限制,且需要進行耗費人力的文章類別標注工作。
本研究透過情感傾向和熱門關注議題兩個面向來分析App評論,提出一個混合非監督式與監督式的中文情感分析方法。我們先透過非監督式方法標注評論類別,並作視覺化整理呈現,最後再用監督式方法建立分類模型,並驗證其效果。
在實驗結果中,利用中文詞彙網路所建立的情感詞集,確實可用來判斷評論的正反情緒,唯判斷負面評論效果不佳需作改善。在議題擷取方面,嘗試使用兩種不同分群方法,其中使用NPMI衡量字詞間關係強度,再配合社群網路分析的Concor方法結果有不錯的成效。最後在使用監督式學習的分類結果中,情感傾向的分類正確率達到87%,關注議題的分類正確率達到96%,皆有不錯表現。
本研究利用中文詞彙網路與社會網路分析,來發展一個非監督式的中文類別判斷方法,並建立一個中文情感分析的範例。另外透過建立全面性的視覺化報告來瞭解使用者的正反回饋意見,並可透過分類模型來掌握新評論的內容,以提供App開發者在市場上之競爭智慧。 / While the number of smartphone shipment is continuesly growing, the number of App downloads from the popular app markets has been already over 50 billion. By Apple App Store and Google Play, ratings and reviews play a more important role in influencing app difusion. While app developers can realize users’ needs by app reviews, more than thousands of reviews produced by user everday become difficult to be read and collated.
Sentiment Analysis researchs encompass supervised and unsupervised methods for analyzing review text. The supervised learning is proven as a useful method and can reach high accuracy, but there are limits where future trend can not be recognized and the labels of individual classes must be made manually.
We concentrate on two issues, viz Sentiment Orientation and Popular Topic, to propose a Chinese Sentiment Analysis method which combines supervised and unsupervised learning. At First, we use unsupervised learning to label every review articles and produce visualized reports. Secondly, we employee supervised learning to build classification model and verify the result.
In the experiment, the Chinese WordNet is used to build sentiment lexicon to determin review’s sentiment orientation, but the result shows it is weak to find out negative review opinions. In the Topic Extraction phase, we apply two clustering methods to extract Popular Topic classes and its result is excellent by using of NPMI Model with Social Network Analysis Method i.e. Concor. In the supervised learning phase, the accuracy of Sentiment Orientation class is 87% and the accuracy of Popular Topic class is 96%.
In this research, we conduct an exemplification of the unsupervised method by means of Chinese WorkNet and Social Network Analysis to determin the review classes. Also, we build a comprehensive visualized report to realize users’ feedbacks and utilize classification to explore new comments. Last but not least, with Chinese Sentiment Analysis of this research, and the competitive intelligence in App market can be provided to the App develops.
|
174 |
基於語意框架之讀者情緒偵測研究 / Semantic Frame-based Approach for Reader-Emotion Detection陳聖傑, Chen, Cen Chieh Unknown Date (has links)
過往對於情緒分析的研究顯少聚焦在讀者情緒,往往著眼於筆者情緒之研究。讀者情緒是指讀者閱讀文章後產生之情緒感受。然而相同一篇文章可能會引起讀者多種情緒反應,甚至產生與筆者迥異之情緒感受,也突顯其讀者情緒分析存在更複雜的問題。本研究之目的在於辨識讀者閱讀文章後之切確情緒,而文件分類的方法能有效地應用於讀者情緒偵測的研究,除了能辨識出正確的讀者情緒之外,並且能保留讀者情緒文件之相關內容。然而,目前的資訊檢索系統仍缺乏對隱含情緒之文件有效的辨識能力,特別是對於讀者情緒的辨識。除此之外,基於機器學習的方法難以讓人類理解,也很難查明辨識失敗的原因,進而無法了解何種文章引發讀者切確的情緒感受。有鑑於此,本研究提出一套基於語意框架(frame-based approach, FBA)之讀者情緒偵測研究的方法,FBA能模擬人類閱讀文章的方式外,並且可以有效地建構讀者情緒之基礎知識,以形成讀者情緒的知識庫。FBA具備高自動化抽取語意概念的基礎知識,除了利用語法結構的特徵,我們進一步考量周邊語境和語義關聯,將相似的知識整合成具有鑑別力之語意框架,並且透過序列比對(sequence alignment)的方式進行讀者情緒文件之匹配。經實驗結果顯示證明,本研究方法能有效地運用於讀者情緒偵測之相關研究。 / Previous studies on emotion classification mainly focus on the writer's emotional state. By contrast, this research emphasizes emotion detection from the readers' perspective. The classification of documents into reader-emotion categories can be applied in several ways, and one of the applications is to retain only the documents that cause desired emotions for enabling users to retrieve documents that contain relevant contents and at the same time instill proper emotions. However, current IR systems lack of ability to discern emotion within texts, reader-emotion has yet to achieve comparable performance. Moreover, the pervious machine learning-based approaches are generally not human understandable, thereby, it is difficult to pinpoint the reason for recognition failures and understand what emotions do articles trigger in their readers.
We propose a flexible semantic frame-based approach (FBA) for reader's emotion detection that simulates such process in human perception. FBA is a highly automated process that incorporates various knowledge sources to learn semantic frames that characterize an emotion and is comprehensible for humans from raw text. Generated frames are adopted to predict readers' emotion through an alignment-based matching algorithm that allows a semantic frame to be partially matched through a statistical scoring scheme. Experiment results demonstrate that our approach can effectively detect readers' emotion by exploiting the syntactic structures and semantic associations in the context as well as outperforms currently well-known statistical text classification methods and the stat-of-the-art reader-emotion detection method.
|
175 |
De l'usage de la sémantique dans la classification supervisée de textes : application au domaine médical / On the use of semantics in supervised text classification : application in the medical domainAlbitar, Shereen 12 December 2013 (has links)
Cette thèse porte sur l’impact de l’usage de la sémantique dans le processus de la classification supervisée de textes. Cet impact est évalué au travers d’une étude expérimentale sur des documents issus du domaine médical et en utilisant UMLS (Unified Medical Language System) en tant que ressource sémantique. Cette évaluation est faite selon quatre scénarii expérimentaux d’ajout de sémantique à plusieurs niveaux du processus de classification. Le premier scénario correspond à la conceptualisation où le texte est enrichi avant indexation par des concepts correspondant dans UMLS ; le deuxième et le troisième scénario concernent l’enrichissement des vecteurs représentant les textes après indexation dans un sac de concepts (BOC – bag of concepts) par des concepts similaires. Enfin le dernier scénario utilise la sémantique au niveau de la prédiction des classes, où les concepts ainsi que les relations entre eux, sont impliqués dans la prise de décision. Le premier scénario est testé en utilisant trois des méthodes de classification: Rocchio, NB et SVM. Les trois autres scénarii sont uniquement testés en utilisant Rocchio qui est le mieux à même d’accueillir les modifications nécessaires. Au travers de ces différentes expérimentations nous avons tout d’abord montré que des améliorations significatives pouvaient être obtenues avec la conceptualisation du texte avant l’indexation. Ensuite, à partir de représentations vectorielles conceptualisées, nous avons constaté des améliorations plus modérées avec d’une part l’enrichissement sémantique de cette représentation vectorielle après indexation, et d’autre part l’usage de mesures de similarité sémantique en prédiction. / The main interest of this research is the effect of using semantics in the process of supervised text classification. This effect is evaluated through an experimental study on documents related to the medical domain using the UMLS (Unified Medical Language System) as a semantic resource. This evaluation follows four scenarios involving semantics at different steps of the classification process: the first scenario incorporates the conceptualization step where text is enriched with corresponding concepts from UMLS; both the second and the third scenarios concern enriching vectors that represent text as Bag of Concepts (BOC) with similar concepts; the last scenario considers using semantics during class prediction, where concepts as well as the relations between them are involved in decision making. We test the first scenario using three popular classification techniques: Rocchio, NB and SVM. We choose Rocchio for the other scenarios for its extendibility with semantics. According to experiment, results demonstrated significant improvement in classification performance using conceptualization before indexing. Moderate improvements are reported using conceptualized text representation with semantic enrichment after indexing or with semantic text-to-text semantic similarity measures for prediction.
|
176 |
Aprendizado semissupervisionado multidescrição em classificação de textos / Multi-view semi-supervised learning in text classificationÍgor Assis Braga 23 April 2010 (has links)
Algoritmos de aprendizado semissupervisionado aprendem a partir de uma combinação de dados rotulados e não rotulados. Assim, eles podem ser aplicados em domínios em que poucos exemplos rotulados e uma vasta quantidade de exemplos não rotulados estão disponíveis. Além disso, os algoritmos semissupervisionados podem atingir um desempenho superior aos algoritmos supervisionados treinados nos mesmos poucos exemplos rotulados. Uma poderosa abordagem ao aprendizado semissupervisionado, denominada aprendizado multidescrição, pode ser usada sempre que os exemplos de treinamento são descritos por dois ou mais conjuntos de atributos disjuntos. A classificação de textos é um domínio de aplicação no qual algoritmos semissupervisionados vêm obtendo sucesso. No entanto, o aprendizado semissupervisionado multidescrição ainda não foi bem explorado nesse domínio dadas as diversas maneiras possíveis de se descrever bases de textos. O objetivo neste trabalho é analisar o desempenho de algoritmos semissupervisionados multidescrição na classificação de textos, usando unigramas e bigramas para compor duas descrições distintas de documentos textuais. Assim, é considerado inicialmente o difundido algoritmo multidescrição CO-TRAINING, para o qual são propostas modificações a fim de se tratar o problema dos pontos de contenção. É também proposto o algoritmo COAL, o qual pode melhorar ainda mais o algoritmo CO-TRAINING pela incorporação de aprendizado ativo como uma maneira de tratar pontos de contenção. Uma ampla avaliação experimental desses algoritmos foi conduzida em bases de textos reais. Os resultados mostram que o algoritmo COAL, usando unigramas como uma descrição das bases textuais e bigramas como uma outra descrição, atinge um desempenho significativamente melhor que um algoritmo semissupervisionado monodescrição. Levando em consideração os bons resultados obtidos por COAL, conclui-se que o uso de unigramas e bigramas como duas descrições distintas de bases de textos pode ser bastante compensador / Semi-supervised learning algorithms learn from a combination of both labeled and unlabeled data. Thus, they can be applied in domains where few labeled examples and a vast amount of unlabeled examples are available. Furthermore, semi-supervised learning algorithms may achieve a better performance than supervised learning algorithms trained on the same few labeled examples. A powerful approach to semi-supervised learning, called multi-view learning, can be used whenever the training examples are described by two or more disjoint sets of attributes. Text classification is a domain in which semi-supervised learning algorithms have shown some success. However, multi-view semi-supervised learning has not yet been well explored in this domain despite the possibility of describing textual documents in a myriad of ways. The aim of this work is to analyze the effectiveness of multi-view semi-supervised learning in text classification using unigrams and bigrams as two distinct descriptions of text documents. To this end, we initially consider the widely adopted CO-TRAINING multi-view algorithm and propose some modifications to it in order to deal with the problem of contention points. We also propose the COAL algorithm, which further improves CO-TRAINING by incorporating active learning as a way of dealing with contention points. A thorough experimental evaluation of these algorithms was conducted on real text data sets. The results show that the COAL algorithm, using unigrams as one description of text documents and bigrams as another description, achieves significantly better performance than a single-view semi-supervised algorithm. Taking into account the good results obtained by COAL, we conclude that the use of unigrams and bigrams as two distinct descriptions of text documents can be very effective
|
177 |
VGCN-BERT : augmenting BERT with graph embedding for text classification : application to offensive language detectionLu, Zhibin 05 1900 (has links)
Le discours haineux est un problème sérieux sur les média sociaux. Dans ce mémoire, nous étudions le problème de détection automatique du langage haineux sur réseaux sociaux. Nous traitons ce problème comme un problème de classification de textes.
La classification de textes a fait un grand progrès ces dernières années grâce aux techniques d’apprentissage profond. En particulier, les modèles utilisant un mécanisme d’attention tel que BERT se sont révélés capables de capturer les informations contextuelles contenues dans une phrase ou un texte. Cependant, leur capacité à saisir l’information globale sur le vocabulaire d’une langue dans une application spécifique est plus limitée.
Récemment, un nouveau type de réseau de neurones, appelé Graph Convolutional Network (GCN), émerge. Il intègre les informations des voisins en manipulant un graphique global pour prendre en compte les informations globales, et il a obtenu de bons résultats dans de nombreuses tâches, y compris la classification de textes.
Par conséquent, notre motivation dans ce mémoire est de concevoir une méthode qui peut combiner à la fois les avantages du modèle BERT, qui excelle en capturant des informations locales, et le modèle GCN, qui fournit les informations globale du langage.
Néanmoins, le GCN traditionnel est un modèle d'apprentissage transductif, qui effectue une opération convolutionnelle sur un graphe composé d'éléments à traiter dans les tâches (c'est-à-dire un graphe de documents) et ne peut pas être appliqué à un nouveau document qui ne fait pas partie du graphe pendant l'entraînement. Dans ce mémoire, nous proposons d'abord un nouveau modèle GCN de vocabulaire (VGCN), qui transforme la convolution au niveau du document du modèle GCN traditionnel en convolution au niveau du mot en utilisant les co-occurrences de mots. En ce faisant, nous transformons le mode d'apprentissage transductif en mode inductif, qui peut être appliqué à un nouveau document.
Ensuite, nous proposons le modèle Interactive-VGCN-BERT qui combine notre modèle VGCN avec BERT. Dans ce modèle, les informations locales captées par BERT sont combinées avec les informations globales captées par VGCN. De plus, les informations locales et les informations globales interagissent à travers différentes couches de BERT, ce qui leur permet d'influencer mutuellement et de construire ensemble une représentation finale pour la classification. Via ces interactions, les informations de langue globales peuvent aider à distinguer des mots ambigus ou à comprendre des expressions peu claires, améliorant ainsi les performances des tâches de classification de textes.
Pour évaluer l'efficacité de notre modèle Interactive-VGCN-BERT, nous menons des expériences sur plusieurs ensembles de données de différents types -- non seulement sur le langage haineux, mais aussi sur la détection de grammaticalité et les commentaires sur les films. Les résultats expérimentaux montrent que le modèle Interactive-VGCN-BERT surpasse tous les autres modèles tels que Vanilla-VGCN-BERT, BERT, Bi-LSTM, MLP, GCN et ainsi de suite. En particulier, nous observons que VGCN peut effectivement fournir des informations utiles pour aider à comprendre un texte haiteux implicit quand il est intégré avec BERT, ce qui vérifie notre intuition au début de cette étude. / Hate speech is a serious problem on social media. In this thesis, we investigate the problem of automatic detection of hate speech on social media. We cast it as a text classification problem.
With the development of deep learning, text classification has made great progress in recent years. In particular, models using attention mechanism such as BERT have shown great capability of capturing the local contextual information within a sentence or document. Although local connections between words in the sentence can be captured, their ability of capturing certain application-dependent global information and long-range semantic dependency is limited.
Recently, a new type of neural network, called the Graph Convolutional Network (GCN), has attracted much attention. It provides an effective mechanism to take into account the global information via the convolutional operation on a global graph and has achieved good results in many tasks including text classification.
In this thesis, we propose a method that can combine both advantages of BERT model, which is excellent at exploiting the local information from a text, and the GCN model, which provides the application-dependent global language information.
However, the traditional GCN is a transductive learning model, which performs a convolutional operation on a graph composed of task entities (i.e. documents graph) and cannot be applied directly to a new document. In this thesis, we first propose a novel Vocabulary GCN model (VGCN), which transforms the document-level convolution of the traditional GCN model to word-level convolution using a word graph created from word co-occurrences. In this way, we change the training method of GCN, from the transductive learning mode to the inductive learning mode, that can be applied to new documents.
Secondly, we propose an Interactive-VGCN-BERT model that combines our VGCN model with BERT. In this model, local information including dependencies between words in a sentence, can be captured by BERT, while the global information reflecting the relations between words in a language (e.g. related words) can be captured by VGCN.
In addition, local information and global information can interact through different layers of BERT, allowing them to influence mutually and to build together a final representation for classification. In so doing, the global language information can help distinguish ambiguous words or understand unclear expressions, thereby improving the performance of text classification tasks.
To evaluate the effectiveness of our Interactive-VGCN-BERT model, we conduct experiments on several datasets of different types -- hate language detection, as well as movie review and grammaticality, and compare them with several state-of-the-art baseline models. Experimental results show that our Interactive-VGCN-BERT outperforms all other models such as Vanilla-VGCN-BERT, BERT, Bi-LSTM, MLP, GCN, and so on. In particular, we have found that VGCN can indeed help understand a text when it is integrated with BERT, confirming our intuition to combine the two mechanisms.
|
178 |
Web mining for social network analysisElhaddad, Mohamed Kamel Abdelsalam 09 August 2021 (has links)
Undoubtedly, the rapid development of information systems and the widespread use of electronic means and social networks have played a significant role in accelerating the pace of events worldwide, such as, in the 2012 Gaza conflict (the 8-day war), in the pro-secessionist rebellion in the 2013-2014 conflict in Eastern Ukraine, in the 2016 US Presidential elections, and in conjunction with the COVID-19 outbreak pandemic since the beginning of 2020. As the number of daily shared data grows quickly on various social networking platforms in different languages, techniques to carry out automatic classification of this huge amount of data timely and correctly are needed.
Of the many social networking platforms, Twitter is of the most used ones by netizens. It allows its users to communicate, share their opinions, and express their emotions (sentiments) in the form of short blogs easily at no cost. Moreover, unlike other social networking platforms, Twitter allows research institutions to access its public and historical data, upon request and under control. Therefore, many organizations, at different levels (e.g., governmental, commercial), are seeking to benefit from the analysis and classification of the shared tweets to serve in many application domains, for examples, sentiment analysis to evaluate and determine user’s polarity from the content of their shared text, and misleading information detection to ensure the legitimacy and the credibility of the shared information. To attain this objective, one can apply numerous data representation, preprocessing, natural language processing techniques, and machine/deep learning algorithms. There are several challenges and limitations with existing approaches, including issues with the management of tweets in multiple languages, the determination of what features the feature vector should include, and the assignment of representative and descriptive weights to these features for different mining tasks. Besides, there are limitations in existing performance evaluation metrics to fully assess the developed classification systems.
In this dissertation, two novel frameworks are introduced; the first is to efficiently analyze and classify bilingual (Arabic and English) textual content of social networks, while the second is for evaluating the performance of binary classification algorithms. The first framework is designed with: (1) An approach to handle Arabic and English written tweets, and can be extended to cover data written in more languages and from other social networking platforms, (2) An effective data preparation and preprocessing techniques, (3) A novel feature selection technique that allows utilizing different types of features (content-dependent, context-dependent, and domain-dependent), in addition to (4) A novel feature extraction technique to assign weights to the linguistic features based on how representative they are in in the classes they belong to. The proposed framework is employed in performing sentiment analysis and misleading information detection. The performance of this framework is compared to state-of-the-art classification approaches utilizing 11 benchmark datasets comprising both Arabic and English textual content, demonstrating considerable improvement over all other performance evaluation metrics. Then, this framework is utilized in a real-life case study to detect misleading information surrounding the spread of COVID-19.
In the second framework, a new multidimensional classification assessment score (MCAS) is introduced. MCAS can determine how good the classification algorithm is when dealing with binary classification problems. It takes into consideration the effect of misclassification errors on the probability of correct detection of instances from both classes. Moreover, it should be valid regardless of the size of the dataset and whether the dataset has a balanced or unbalanced distribution of its instances over the classes. An empirical and practical analysis is conducted on both synthetic and real-life datasets to compare the comportment of the proposed metric against those commonly used. The analysis reveals that the new measure can distinguish the performance of different classification techniques. Furthermore, it allows performing a class-based assessment of classification algorithms, to assess the ability of the classification algorithm when dealing with data from each class separately. This is useful if one of the classifying instances from one class is more important than instances from the other class, such as in COVID-19 testing where the detection of positive patients is much more important than negative ones. / Graduate
|
179 |
Filtrování spamových zpráv pomocí metod umělé inteligence / Email spam filtering using artificial intelligenceSafonov, Yehor January 2020 (has links)
In the modern world, email communication defines itself as the most used technology for exchanging messages between users. It is based on three pillars which contribute to the popularity and stimulate its rapid growth. These pillars are represented by free availability, efficiency and intuitiveness during exchange of information. All of them constitute a significant advantage in the provision of communication services. On the other hand, the growing popularity of email technologies poses considerable security risks and transforms them into an universal tool for spreading unsolicited content. Potential attacks may be aimed at either a specific endpoints or whole computer infrastructures. Despite achieving high accuracy during spam filtering, traditional techniques do not often catch up to rapid growth and evolution of spam techniques. These approaches are affected by overfitting issues, converging into a poor local minimum, inefficiency in highdimensional data processing and have long-term maintainability issues. One of the main goals of this master's thesis is to develop and train deep neural networks using the latest machine learning techniques for successfully solving text-based spam classification problem belonging to the Natural Language Processing (NLP) domain. From a theoretical point of view, the master's thesis is focused on the e-mail communication area with an emphasis on spam filtering. Next parts of the thesis bring attention to the domain of machine learning and artificial neural networks, discuss principles of their operations and basic properties. The theoretical part also covers possible ways of applying described techniques to the area of text analysis and solving NLP. One of the key aspects of the study lies in a detailed comparison of current machine learning methods, their specifics and accuracy when applied to spam filtering. At the beginning of the practical part, focus will be placed on the e-mail dataset processing. This phase was divided into five stages with the motivation of maintaining key features of the raw data and increasing the final quality of the dataset. The created dataset was used for training, testing and validation of types of the chosen deep neural networks. Selected models ULMFiT, BERT and XLNet have been successfully implemented. The master's thesis includes a description of the final data adaptation, neural networks learning process, their testing and validation. In the end of the work, the implemented models are compared using a confusion matrix and possible improvements and concise conclusion are also outlined.
|
180 |
Job dissatisfaction detection through progress noteWu, Jiechen 11 1900 (has links)
La détection d'insatisfaction basée sur les notes de progression rédigées par des soignants de la santé domestique attire de plus en plus d'attention en tant que méthode de sondage, ce qui aidera à réduire le taux de rotation du personnel soignant. Nous proposons d'étudier la détection d'insatisfaction du soignant comme un problème de classification binaire (le soignant est susceptible de quitter ou pas).
Dans ce mémoire, les données réelles de six mois recueillies à partir de deux agences de soins à domicile sont utilisées. Après avoir montré la nature des données et le prétraitement des données, trois tâches de classification avec des granularités d'échantillonnage différentes (par note, par période et par soignant) sont conçues et abordées. Différentes combinaisons d'hyper-paramètres d'étiquetage sont soigneusement testées. Différentes méthodes de découpage sont couvertes pour montrer les limites des performances théoriques des modèles. L'aire sous la courbe ROC est utilisée pour évaluer les limites des approches mises en place que nous aurons mis en place. Les 6 ensembles d'attributs textuels et statistiques sont comparées. Enfin, les caractéristiques importantes des résultats sont analysées manuellement et automatiquement.
Nous montrons que les modèles fonctionnent mieux "par note" et "par période" que "par soignant" en termes de classification des notes. L'analyse manuelle montre que les modèles capturent les facteurs d'insatisfaction bien qu'il y en ait assez peu. L'analyse automatique n'exprime cependant aucune information utile. / Dissatisfaction detection based on the home health caregiver's progress note draws more and more attention as a probing method, which will help lower down the turnover rate. We propose to study the detection of dissatisfaction of health caregiver as a binary classification problem (the caregiver is likely to "leave" or "stay").
In this master thesis, the real six-month data collected from two home care agencies are used. After showing the nature of the data and the prepossessing of data, three classification tasks with different sample granularity (note wise, period wise and employee wise) are designed and tackled. Different combinations of labeling hyper-parameters are tested thoroughly. Different split methods are covered to show the theoretical performance boundaries of the models. The under the ROC curve area (AUC) scores are reported to show the description ability of each model. The 6 sets of textual and statistical features' performance are compared. Lastly, the important features from the results are analyzed manually and automatically.
We show that models work better on note wise and period wise than employee wise in terms of classifying the notes. The result of manual analysis shows the models capture the dissatisfaction factors, although there are quite few. The result of automatic analysis doesn't show any useful information.
|
Page generated in 0.0771 seconds