Global ETD Search

131	Expansão de recursos para análise de sentimentos usando aprendizado semi-supervisionado / Extending sentiment analysis resources using semi-supervised learning Henrico Bertini Brum 23 March 2018 (has links) O grande volume de dados que temos disponíveis em ambientes virtuais pode ser excelente fonte de novos recursos para estudos em diversas tarefas de Processamento de Linguagem Natural, como a Análise de Sentimentos. Infelizmente é elevado o custo de anotação de novos córpus, que envolve desde investimentos financeiros até demorados processos de revisão. Nossa pesquisa propõe uma abordagem de anotação semissupervisionada, ou seja, anotação automática de um grande córpus não anotado partindo de um conjunto de dados anotados manualmente. Para tal, introduzimos o TweetSentBR, um córpus de tweets no domínio de programas televisivos que possui anotação em três classes e revisões parciais feitas por até sete anotadores. O córpus representa um importante recurso linguístico de português brasileiro, e fica entre os maiores córpus anotados na literatura para classificação de polaridades. Além da anotação manual do córpus, realizamos a implementação de um framework de aprendizado semissupervisionado que faz uso de dados anotados e, de maneira iterativa, expande o mesmo usando dados não anotados. O TweetSentBR, que possui 15:000 tweets anotados é assim expandido cerca de oito vezes. Para a expansão, foram treinados modelos de classificação usando seis classificadores de polaridades, assim como foram avaliados diferentes parâmetros e representações a fim de obter um córpus confiável. Realizamos experimentos gerando córpus expandidos por cada classificador, tanto para a classificação em três polaridades (positiva, neutra e negativa) quanto para classificação binária. Avaliamos os córpus gerados usando um conjunto de held-out e comparamos a FMeasure da classificação usando como treinamento os córpus anotados manualmente e semiautomaticamente. O córpus semissupervisionado que obteve os melhores resultados para a classificação em três polaridades atingiu 62;14% de F-Measure média, superando a média obtida com as avaliações no córpus anotado manualmente (61;02%). Na classificação binária, o melhor córpus expandido obteve 83;11% de F1-Measure média, superando a média obtida na avaliação do córpus anotado manualmente (79;80%). Além disso, simulamos nossa expansão em córpus anotados da literatura, medindo o quão corretas são as etiquetas anotadas semi-automaticamente. Nosso melhor resultado foi na expansão de um córpus de reviews de produtos que obteve FMeasure de 93;15% com dados binários. Por fim, comparamos um córpus da literatura obtido por meio de supervisão distante e nosso framework semissupervisionado superou o primeiro na classificação de polaridades binária em cross-domain. / The high volume of data available in the Internet can be a good resource for studies of several tasks in Natural Language Processing as in Sentiment Analysis. Unfortunately there is a high cost for the annotation of new corpora, involving financial support and long revision processes. Our work proposes an approach for semi-supervised labeling, an automatic annotation of a large unlabeled set of documents starting from a manually annotated corpus. In order to achieve that, we introduced TweetSentBR, a tweet corpora on TV show programs domain with annotation for 3-point (positive, neutral and negative) sentiment classification partially reviewed by up to seven annotators. The corpus is an important linguistic resource for Brazilian Portuguese language and it stands between the biggest annotated corpora for polarity classification. Beyond the manual annotation, we implemented a semi-supervised learning based framework that uses this labeled data and extends it using unlabeled data. TweetSentBR corpus, containing 15:000 documents, had its size augmented in eight times. For the extending process, we trained classification models using six polarity classifiers, evaluated different parameters and representation schemes in order to obtain the most reliable corpora. We ran experiments generating extended corpora for each classifier, both for 3-point and binary classification. We evaluated the generated corpora using a held-out subset and compared the obtained F-Measure values with the manually and the semi-supervised annotated corpora. The semi-supervised corpus that obtained the best values for 3-point classification achieved 62;14% on average F-Measure, overcoming the results obtained by the same classification with the manually annotated corpus (61;02%). On binary classification, the best extended corpus achieved 83;11% on average F-Measure, overcoming the results on the manually corpora (79;80%). Furthermore, we simulated the extension of labeled corpora in literature, measuring how well the semi-supervised annotation works. Our best results were in the extension of a product review corpora, achieving 93;15% on F1-Measure. Finally, we compared a literature corpus which was labeled by using distant supervision with our semi-supervised corpus, and this overcame the first in binary polarity classification on cross-domain data. Análise de sentimentos Anotação de córpus Aprendizado semisupervisionado Corpus annotation Semi-supervised learning Sentiment analysis
132	Coronavirus public sentiment analysis with BERT deep learning Ling, Jintao January 2020 (has links) Microblog has become a central platform where people express their thoughts and opinions toward public events in China. With the sudden outbreak of coronavirus, the posts related to coronavirus are usually followed by a burst immediately in microblog volume, which provides a great opportunity to explore public sentiment about the events. In this context, sentiment analysis is helpful to explore how coronavirus affects public opinions. Deep learning has become a very popular technique for sentiment analysis. This thesis uses Bidirectional Encoder Representations from Transformers (BERT), a pre-trained unsupervised language representation model based on deep learning, to generate initial token embeddings that are further tuned by a neural network model on a supervised corpus, a sentiment classifier is constructed. We utilize data recently made available by the government of Beijing which contains 1 million blog posts from January 1 to February 20, 2020. Also, the model developed in this thesis can be used to track the sentiment variation with Weibo microblog data in the future. At the final stage, the variation of public sentiment is analyzed and presented with visualization charts of preformed people sentiment variation with the development of coronavirus in China. Comparison of the results between labeled data and all data is performed in order to explore how thoughts and opinions evolve in time. The result shows a significant growth of the negative sentiment on January 20 when a lockdown started in Wuhan, and afterward the growth becomes slower. Around February 7 when doctor Wenliang Li died, the number of negative sentiments reached its peak. coronavirus deep learning sentiment analysis token embedding social media Information Systems
133	Analýza recenzí výrobků / Analysis of Product Reviews Klocok, Andrej January 2020 (has links) Online store customers generate vast amounts of product and service information through reviews, which are an important source of feedback. This thesis deals with the creation of a system for the analysis of product and shop reviews in the czech language. It describes the current methods of sentiment analysis and builds on current solutions. The resulting system implements automatic data download and their indexing, subsequently sentiment analysis together with text summary in the form of clustering of similar sentences based on vector representation of the text. A graphical user interface in the form of a web page is also included. A review data set with a total of more than six million reviews was created during the semester along with an interface for easy data export.
134	EXPLORING PSEUDO-TOPIC-MODELING FOR CREATING AUTOMATED DISTANT-ANNOTATION SYSTEMS Sommers, Alexander Mitchell 01 September 2021 (has links) We explore the use a Latent Dirichlet Allocation (LDA) imitating pseudo-topic-model, based on our original relevance metric, as a tool to facilitate distant annotation of short (often one to two sentence or less) documents. Our exploration manifests as annotating tweets for emotions, this being the current use-case of interest to us, but we believe the method could be extended to any multi-class labeling task of documents of similar length. Tweets are gathered via the Twitter API using "track" terms thought likely to capture tweets with a greater chance of exhibiting each emotional class, 3,000 tweets for each of 26 topics anticipated to elicit emotional discourse. Our pseudo-topic-model is used to produce relevance-ranked vocabularies for each corpus of tweets and these are used to distribute emotional annotations to those tweets not manually annotated, magnifying the number of annotated tweets by a factor of 29. The vector labels the annotators produce for the topics are cascaded out to the tweets via three different schemes which are compared for performance by proxy through the competition of bidirectional-LSMTs trained using the tweets labeled at a distance. An SVM and two emotionally annotated vocabularies are also tested on each task to provide context and comparison. distant annotation emotion detection Natural language processing sentiment analysis topic modeling
135	How can a module for sentiment analysis be designed to classify tweets about covid19 / Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19 Ly, Denny, Saad Abdul Malik, Tamara January 2021 (has links) The sentiment analysis of a text is getting more focus nowadays from different entities for a variety of reasons. Emotions mining (sentiment analysis) is a very interesting subject to explore thus the research question is How can a module for sentiment analysis be designed to classify tweets about Covid-19. The dataset used for this project was taken from Kaggle and preprocessed with various methods such as Bag of Words and term frequency-inverse document frequency. The models are based on the following algorithms: KNN, SVM, DT, and NB. Some models are also based on the combination of ML and Lexicon. The outcome of the experiment showed that the lexicon method with an accuracy of 87% exceeded the machine learning methods implemented in this thesis and the experiments done by the ML community in Kaggle. This implies that the traditional lexicon approach is still considered a fit choice in the sentiment analysis field. / På senaste tiden har sentimentanalyser av text fått ett större fokus. Känsloutvinning (Emotions mining) är ett väldigt intressant ämne att utforska, Forskningsfrågan är då Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19. Datasetet som används är hämtat från Kaggle och sedan preprocesserat med hjälp av olika metoder såsom Bag of Words och term frequency-inverse document frequency. Modellerna är baserad på följande algoritmer: KNN, SVM, DT, och NB. Vissa modeller är baserad på en kombination of ML och Lexicon. Slutresultatet av experimentet visade sig vara att lexikon metoden med en prestanda av 87% översteg maskin inlärningsmetoderna som utfördes i denna uppsatsen och övriga experiment från ML gemensamhet i kaggle. Detta antyder att lexikon metoden är fortfarande ett bra val inom sentimentanalys området. Sentiment Analysis Machine Learning Lexicon technique Kaggle Preprocessing Computer and Information Sciences Data- och informationsvetenskap
136	Sentiment analysis of movie reviews in Chinese Zhang, Jun January 2020 (has links) Sentiment analysis aims at figuring out the opinions of the users towards a certain service or product. In this research, the aim is at classifying the sentiments of users based on the comments they have posed on Douban movie website. In this thesis, I try two different ways to classify the sentiments: with the first one classifying comments into five classes of ratings from 1 to 5, and with the second one classifying comments into three classes of ratings: negative, neutral and positive. For the latter, the ratings of 1 and 2 are grouped as negative, the ratings of 3 neutral and the ratings of 4 and 5 positive. First, Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for machine learning algorithms. Chi Square and Mutual Information are used for feature selection. The selected features are fed into different machine learning methods: Logistic Regression, Linear SVC, SGD classifier and Multinomial Naive Bayes. The performance of models with feature selection will be compared with the performance of models without feature selection for 5-class classification as well as 3-class classification. Also, fastText and Skip-Gram are used as embedding methods for deep learning algorithms LSTM and BILSTM. FastText will also be used for both embedding as well as being a classifier. The aim is to compare different machine learning and deep learning algorithms using different vectorization methods to see which model performs the best regarding both 5-class and 3-class classification. The two classification strategies will be compared with each other in terms of error analysis. The aim is to figure out the similarities and differences of misclassifications made by two different classification strategies. sentiment analysis classification strategies feature selection machine learning embedding deep learning Humanities and the Arts Humaniora och konst
137	Genre and Domain Dependencies in Sentiment Analysis Remus, Robert 23 April 2015 (has links) Genre and domain influence an author\''s style of writing and therefore a text\''s characteristics. Natural language processing is prone to such variations in textual characteristics: it is said to be genre and domain dependent. This thesis investigates genre and domain dependencies in sentiment analysis. Its goal is to support the development of robust sentiment analysis approaches that work well and in a predictable manner under different conditions, i.e. for different genres and domains. Initially, we show that a prototypical approach to sentiment analysis -- viz. a supervised machine learning model based on word n-gram features -- performs differently on gold standards that originate from differing genres and domains, but performs similarly on gold standards that originate from resembling genres and domains. We show that these gold standards differ in certain textual characteristics, viz. their domain complexity. We find a strong linear relation between our approach\''s accuracy on a particular gold standard and its domain complexity, which we then use to estimate our approach\''s accuracy. Subsequently, we use certain textual characteristics -- viz. domain complexity, domain similarity, and readability -- in a variety of applications. Domain complexity and domain similarity measures are used to determine parameter settings in two tasks. Domain complexity guides us in model selection for in-domain polarity classification, viz. in decisions regarding word n-gram model order and word n-gram feature selection. Domain complexity and domain similarity guide us in domain adaptation. We propose a novel domain adaptation scheme and apply it to cross-domain polarity classification in semi- and unsupervised domain adaptation scenarios. Readability is used for feature engineering. We propose to adopt readability gradings, readability indicators as well as word and syntax distributions as features for subjectivity classification. Moreover, we generalize a framework for modeling and representing negation in machine learning-based sentiment analysis. This framework is applied to in-domain and cross-domain polarity classification. We investigate the relation between implicit and explicit negation modeling, the influence of negation scope detection methods, and the efficiency of the framework in different domains. Finally, we carry out a case study in which we transfer the core methods of our thesis -- viz. domain complexity-based accuracy estimation, domain complexity-based model selection, and negation modeling -- to a gold standard that originates from a genre and domain hitherto not used in this thesis. info:eu-repo/classification/ddc/500 ddc:500
138	Logické spojky v postojové analýze / Logical connectives in sentiment analysis Přikrylová, Katrin January 2015 (has links) No description available.
139	Umělé neuronové sítě a jejich využití při analýze textových dat / Artificial neural networks and their application in text analysis Jankovič, Radovan January 2016 (has links) This thesis is devoted to the area of sentiment analysis. Its goal is to discuss and compare various methods applicable to sentiment classification of short texts. When analyzing the described techniques, we will orient ourselves towards the context of social networks. Recently, this type of media became the source of vast amounts of data and the demand for its automatic processing is high. Interesting results have been obtained for clustering used in combination with supervised learning and convolution, which is primarily used for image data.
140	Predicting the Movement Direction of OMXS30 Stock Index Using XGBoost and Sentiment Analysis Elena, Podasca January 2021 (has links) Background. Stock market prediction is an active yet challenging research area. A lot of effort has been put in by both academia and practitioners to produce accurate stock market predictions models, in the attempt to maximize investment objectives. Tree-based ensemble machine learning methods such as XGBoost have proven successful in practice. At the same time, there is a growing trend to incorporate multiple data sources in prediction models, such as historical prices and text, in order to achieve superior forecasting performance. However, most applications and research have so far focused on the American or Asian stock markets, while the Swedish stock market has not been studied extensively from the perspective of hybrid models using both price and text derived features. Objectives. The purpose of this thesis is to investigate whether augmenting a numerical dataset based on historical prices with sentiment features extracted from financial news improves classification performance when predicting the daily price trend of the Swedish stock market index, OMXS30. Methods. A dataset of 3,517 samples between 2006 - 2020 was collected from two sources, historical prices and financial news. XGBoost was used as classifier and four different metrics were employed for model performance comparison given three complementary datasets: the dataset which contains only the sentiment feature, the dataset with only price-derived features and finally, the dataset augmented with sentiment feature extracted from financial news. Results. Results show that XGBoost has a good performance in classifying the daily trend of OMXS30 given historical price features, achieving an accuracy of 73% on the test set. A small improvement across all metrics is recorded on the test set when augmenting the numerical dataset with sentiment features extracted from financial news. Conclusions. XGBoost is a powerful ensemble method for stock market prediction, reflected in a satisfactory classification performance of the daily movement direction of OMXS30. However, augmenting the numerical input set with sentiment features extracted from text did not have a powerful impact on classification performance in this case, as the improvements across all employed metrics were small. Machine learning XGBoost Sentiment analysis Stock market prediction OMXS30 Computer Sciences Datavetenskap (datalogi)

Search results