Global ETD Search

81	Multi-Label Text Classification with Transfer Learning for Policy Documents : The Case of the Sustainable Development Goals Rodríguez Medina, Samuel January 2019 (has links) We created and analyzed a text classification dataset from freely-available web documents from the United Nation's Sustainable Development Goals. We then used it to train and compare different multi-label text classifiers with the aim of exploring the alternatives for methods that facilitate the search of information of this type of documents. We explored the effectiveness of deep learning and transfer learning in text classification by fine-tuning different pre-trained language representations — Word2Vec, GloVe, ELMo, ULMFiT and BERT. We also compared these approaches against a baseline of more traditional algorithms without using transfer learning. More specifically, we used multinomial Naive Bayes, logistic regression, k-nearest neighbors and Support Vector Machines. We then analyzed the results of our experiments quantitatively and qualitatively. The best results in terms of micro-averaged F1 scores and AUROC are obtained by BERT. However, it is also interesting that the second best classifier in terms of micro-averaged F1 scores is the Support Vector Machines, closely followed by the logistic regression classifier, which both have the advantage of being less computationally expensive than BERT. The results also show a close relation between our dataset size and the effectiveness of the classifiers. machine learning deep neural networks transfer learning text classification sustainable development goals sdgs
82	MaSTA: a text-based machine learning approach for systems-of-systems in the big data context / MaSTA: uma abordagem de aprendizado de máquina orientado a textos para sistemas-de-sistemas no contexto de big data Bianchi, Thiago 11 April 2019 (has links) Systems-of-systems (SoS) have gained a very important status in industry and academia as an answer to the growing complexity of software-intensive systems. SoS are particular in the sense that their capabilities transcend the mere sum of the capacities of their diverse independent constituents. In parallel, the current growth in the amount of data collected in different formats is impressive and imposes a considerable challenge for researchers and professionals, characterizing hence the Big Data context. In this scenario, Machine Learning techniques have been increasingly explored to analyze and extract relevant knowledge from such data. SoS have also generated a large amount of data and text information and, in many situations, users of SoS need to manually register unstructured, critical texts, e.g., work orders and service requests, and also need to map them to structured information. Besides that, these are repetitive, time-/effort-consuming, and even error-prone tasks. The main objective of this Thesis is to present MaSTA, an approach composed of an innovative classification method to infer classifiers from large textual collections and an evaluation method that measures the reliability and performance levels of such classifiers. To evaluate the effectiveness of MaSTA, we conducted an experiment with a commercial SoS used by large companies that provided us four datasets containing near one million records related with three classification tasks. As a result, this experiment indicated that MaSTA is capable of automatically classifying the documents and also improve the user assertiveness by reducing the list of possible classifications. Moreover, this experiment indicated that MaSTA is a scalable solution for the Big Data scenarios in which document collections have hundreds of thousands (even millions) of documents, even produced by different constituents of an SoS. / Sistemas-de-sistemas (SoS) conquistaram um status muito importante na indústria e na academia como uma resposta à crescente complexidade dos sistemas intensivos de software. SoS são particulares no sentido de que suas capacidades transcendem a mera soma das capacidades de seus diversos constituintes independentes. Paralelamente, o crescimento atual na quantidade de dados coletados em diferentes formatos é impressionante e impõe um desafio considerável para pesquisadores e profissionais, caracterizando consequentemente o contexto de Big Data. Nesse cenário, técnicas de Aprendizado de Máquina têm sido cada vez mais exploradas para analisar e extrair conhecimento relevante de tais dados. SoS também têm gerado uma grande quantidade de dados e informações de texto e, em muitas situações, os usuários do SoS precisam registrar manualmente textos críticos não estruturados, por exemplo, ordens de serviço e solicitações de serviço, e também precisam mapeá-los para informações estruturadas. Além disso, essas tarefas são repetitivas, demoradas, e até mesmo propensas a erros. O principal objetivo desta Tese é apresentar o MaSTA, uma abordagem composta por um método de classificação inovador para inferir classificadores a partir de grandes coleções de texto e um método de avaliação que mensura os níveis de confiabilidade e desempenho desses classificadores. Para avaliar a eficácia do MaSTA, nós conduzimos um experimento com um SoS comercial utilizado por grandes empresas que nos forneceram quatro conjuntos de dados contendo quase um milhão de registros relacionados com três tarefas de classificação. Como resultado, esse experimento indicou que o MaSTA é capaz de classificar automaticamente os documentos e também melhorar a assertividade do usuário através da redução da lista de possíveis classificações. Além disso, esse experimento indicou que o MaSTA é uma solução escalável para os cenários de Big Data, nos quais as coleções de documentos têm centenas de milhares (até milhões) de documentos, até mesmo produzidos por diferentes constituintes de um SoS. Aprendizado de máquina Big Data Big Data Classificação de texto Machine learning Naive Bayes Naive Bayes Sistema-de-sistemas System-of-systems Text classification
83	Multi-scale analysis of languages and knowledge through complex networks / Análise multi-escala de línguas e conecimento por meio de redes complexas Arruda, Henrique Ferraz de 24 January 2019 (has links) There any many different aspects in natural languages and their related dynamics that have been studied. In the case of languages, some quantitative analyses have been done by using stochastic models. Furthermore, natural languages can be understood as complex systems. Thus, there is a possibility to use set of tools development to analyse complex networks, which are computationally represented by graphs, also to analyse natural languages. Furthermore, these tools can be used to represent and analyse some related dynamics taking place on the networks. Observe that knowledge is intrinsically related to language, because language is the vehicle used by humans beings to transmit dicoveries, and the language itself is also a type of knowledge. This thesis is divided into two types of analyses: (i) texts and (II) dynamical aspects. In the first part, we proposed networks representations of text in different scales analyses, starting from the analysis of writing style considering word adjacency networks (co-occurence) to understand local patterns of words, to a mesoscopic representation, which is created from chunks of text and grasps information of the unfolding of the story. In the second part, we considered the structure and dynamics related to knowledge and language, in this case, starting from the larger scale, in which we studied the connectivity between applied and theoretical physics. In the following, we simulated the knowledge acquisition by researchers in a multi-agent dynamics and an intelligent machine that solves problems, which is represented by a network. At the smallest considered scale, we simulate the transmission of networks. This transmission considers the data as a series of organized symbols that is obtained from a dynamics. In order to improve the speed of transmission, the series can be compacted. For that, we considered the information theory and Huffman code. The proposed network-based approaches were found to be suitable to deal with the employed analysis for all of the tested scales. / Existem diversos aspectos das linguagens naturais e de dinâmicas relacionadas que estão sendo estudadas. No caso das línguas, algumas análises quantitativas foram feitas usando modelos estocásticos. Ademais, linguagens naturais podem ser entendidas como sistemas complexos. Para analisar linguagens naturais, existe a possibilidade de utilizar o conjunto de ferramentas que já foram desenvolvidas para analisar redes complexas, que são representadas computacionalmente. Além disso, tais ferramentas podem ser utilizadas para representar e analisar algumas dinâmicas relacionadas a redes complexas. Observe que o conhecimento está intrinsecamente relacionado à linguagem, pois a linguagem é o veículo usado para transmitir novas descobertas, sendo que a própria linguagem também é um tipo de conhecimento. Esta tese é dividida em dois tipos de análise : (i) textos e (ii) aspectos dinâmicos. Na primeira parte foram propostas representações de redes de texto em diferentes escalas de análise. A partir da análise do estilo de escrita, considerando redes de adjacência de palavras (co-ocorrência) para entender padrões locais de palavras, até uma representação mesoscópica, que é criada a partir de pedaços de texto e que representa informações do texto de acordo com o desenrolar da história. Na segunda parte, foram consideradas a estrutura e dinâmica relacionadas ao conhecimento e à linguagem. Neste caso, partiu-se da escala maior, com a qual estudamos a conectividade entre física aplicada e física teórica. A seguir, simulou-se a aquisição de conhecimento por pesquisadores em uma dinâmica multi-agente e uma máquina inteligente que resolve problemas, que é representada por uma rede. Como a menor escala considerada, foi simulada a transmissão de redes. Essa transmissão considera os dados como uma série de símbolos organizados que são obtidos a partir de uma dinâmica. Para melhorar a velocidade de transmissão, a série pode ser compactada. Para tanto, foi utilizada a teoria da informação e o código de Huffman. As propostas de abordagens baseadas em rede foram consideradas adequadas para lidar com a análise empregada, em todas as escalas testadas. Classificação de textos Complex networks Dinâmicas relacionadas ao conhecimento Dynamics related to knowledge Mineração de textos Redes complexas Text classification Text mining
84	Utility of Feedback Given by Students During Courses Atkisson, Michael Alton 01 July 2017 (has links) This two-article dissertation summarizes the end-of-course survey and formative feedback literatures, as well as proposes actionability as a useful construct in the analysis of feedback from students captured in real-time during their courses. The present inquiry grew out of my work as the founder of DropThought Education, a Division of DropThought. DropThought Education was a student feedback system that helped instructional designers, instructors, and educational systems to use feedback from students to improve learning and student experience. To find out whether the DropThought style of feedback was more effective than other forms of capturing and analyzing student feedback, I needed to (1) examine the formative feedback literature and (2) test DropThought style feedback against traditional feedback forms. The method and theory proposed demonstrates that feedback from students can be specific and actionable when captured in the moment at students' activity level, in their own words. Application of the real-time feedback approach are relevant to practitioners and researchers alike, whether an instructor looking to improve her class activities, or a learning scientist carrying out interventionist, design-based research. formative feedback end-of-course feedback real-time feedback DropThought hierarchical generalized linear model text classification Educational Psychology
85	Text Classificaton In Turkish Marketing Domain And Context-sensitive Ad Distribution Engin, Melih 01 February 2009 (has links) (PDF) Online advertising has a continuously increasing popularity. Target audience of this new advertising method is huge. Additionally, there is another rapidly growing and crowded group related to internet advertising that consists of web publishers. Contextual advertising systems make it easier for publishers to present online ads on their web sites, since these online marketing systems automatically divert ads to web sites with related contents. Web publishers join ad networks and gain revenue by enabling ads to be displayed on their sites. Therefore, the accuracy of automated ad systems in determining ad-context relevance is crucial. In this thesis we construct a method for semantic classification of web site contexts in Turkish language and develop an ad serving system to display context related ads on web documents. The classification method uses both semantic and statistical techniques. The method is supervised, and therefore, needs processed sample data for learning classification rules. Therefore, we generate a Turkish marketing dataset and use it in our classification approaches. We form successful classification methods using different feature spaces and support vector machine configurations. Our results present a good comparison between these methods.
86	Collocation Segmentation for Text Chunking / Teksto skaidymas pastoviųjų junginių segmentais Daudaravičius, Vidas 04 February 2013 (has links) Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications. / Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose. Informatics Collocation segmentation Multi-word Terminology Machine translation Text classification Pastovieji junginiai Daugiažodžiai junginiai Terminologija Mašininis vertimas Tekstų klasifikavimas
87	Teksto skaidymas pastoviųjų junginių segmentais / Collocation segmentation for text chunking Daudaravičius, Vidas 04 February 2013 (has links) Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose. / Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications. Informatics Pastovieji junginiai Daugiažodžiai junginiai Terminologija Mašininis vertimas Tekstų klasifikavimas Collocation segmentation Multi-word Terminology Machine translation Text classification
88	A data mining approach to ontology learning for automatic content-related question-answering in MOOCs Shatnawi, Safwan January 2016 (has links) The advent of Massive Open Online Courses (MOOCs) allows massive volume of registrants to enrol in these MOOCs. This research aims to offer MOOCs registrants with automatic content related feedback to fulfil their cognitive needs. A framework is proposed which consists of three modules which are the subject ontology learning module, the short text classification module, and the question answering module. Unlike previous research, to identify relevant concepts for ontology learning a regular expression parser approach is used. Also, the relevant concepts are extracted from unstructured documents. To build the concept hierarchy, a frequent pattern mining approach is used which is guided by a heuristic function to ensure that sibling concepts are at the same level in the hierarchy. As this process does not require specific lexical or syntactic information, it can be applied to any subject. To validate the approach, the resulting ontology is used in a question-answering system which analyses students' content-related questions and generates answers for them. Textbook end of chapter questions/answers are used to validate the question-answering system. The resulting ontology is compared vs. the use of Text2Onto for the question-answering system, and it achieved favourable results. Finally, different indexing approaches based on a subject's ontology are investigated when classifying short text in MOOCs forum discussion data; the investigated indexing approaches are: unigram-based, concept-based and hierarchical concept indexing. The experimental results show that the ontology-based feature indexing approaches outperform the unigram-based indexing approach. Experiments are done in binary classification and multiple labels classification settings . The results are consistent and show that hierarchical concept indexing outperforms both concept-based and unigram-based indexing. The BAGGING and random forests classifiers achieved the best result among the tested classifiers. 371.33
89	[en] USING MACHINE LEARNING TO BUILD A TOOL THAT HELPS COMMENTS MODERATION / [pt] UTILIZANDO APRENDIZADO DE MÁQUINA PARA CONSTRUÇÃO DE UMA FERRAMENTA DE APOIO A MODERAÇÃO DE COMENTÁRIOS SILVANO NOGUEIRA BUBACK 05 March 2012 (has links) [pt] Uma das mudanças trazidas pela Web 2.0 é a maior participação dos usuários na produção do conteúdo, através de opiniões em redes sociais ou comentários nos próprios sites de produtos e serviços. Estes comentários são muito valiosos para seus sites pois fornecem feedback e incentivam a participação e divulgação do conteúdo. Porém excessos podem ocorrer através de comentários com palavrões indesejados ou spam. Enquanto para alguns sites a própria moderação da comunidade é suficiente, para outros as mensagens indesejadas podem comprometer o serviço. Para auxiliar na moderação dos comentários foi construída uma ferramenta que utiliza técnicas de aprendizado de máquina para auxiliar o moderador. Para testar os resultados, dois corpora de comentários produzidos na Globo.com foram utilizados, o primeiro com 657.405 comentários postados diretamente no site, e outro com 451.209 mensagens capturadas do Twitter. Nossos experimentos mostraram que o melhor resultado é obtido quando se separa o aprendizado dos comentários de acordo com o tema sobre o qual está sendo comentado. / [en] One of the main changes brought by Web 2.0 is the increase of user participation in content generation mainly in social networks and comments in news and service sites. These comments are valuable to the sites because they bring feedback and motivate other people to participate and to spread the content. On the other hand these comments also bring some kind of abuse as bad words and spam. While for some sites their own community moderation is enough, for others this impropriate content may compromise its content. In order to help theses sites, a tool that uses machine learning techniques was built to mediate comments. As a test to compare results, two datasets captured from Globo.com were used: the first one with 657.405 comments posted through its site and the second with 451.209 messages captured from Twitter. Our experiments show that best result is achieved when comment learning is done according to the subject that is being commented. [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] PROCESSAMENTO DA LINGUAGEM NATURAL [en] NATURAL LANGUAGE PROCESSING [pt] SVM [en] SVM [pt] BOOSTING [en] BOOSTING
90	Uma investigação de aspectos da classificação de tópicos para textos curtos Oliveira, Ewerton Lopes Silva de 23 February 2015 (has links) Submitted by Clebson Anjos (clebson.leandro54@gmail.com) on 2016-02-15T17:35:03Z No. of bitstreams: 1 arquivototal.pdf: 1768771 bytes, checksum: 5e8df60284fb114853ef61923cb2ec0d (MD5) / Made available in DSpace on 2016-02-15T17:35:03Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 1768771 bytes, checksum: 5e8df60284fb114853ef61923cb2ec0d (MD5) Previous issue date: 2015-02-23 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / In recent years a large number of scientific research has stimulated the use of web data as inputs for the epidemiological surveillance and knowledge discovery/mining related to public health in general. In order to make use of social media content, especially tweets, some approaches proposed before transform a content identification problem to a text classification problem, following the supervised learning scenario. However, during this process, some limitations attributed to the representation of messages as well as the extraction of attributes arise. From this, the present research is aimed to investigate the performance impact in the short social messages classification task using a continuous expansion of the training set approach with support of a measure of confidence in the predictions made. At the same time, the survey also aimed to evaluate alternatives for consideration and extraction of terms used for the classification in order to reduce dependencies on term-frequency based metrics. Restricted to the binary classification of tweets related to health events and written in English, the results showed a 9% improvement in F1, compared to the baseline used, showing that the action of expanding the classifier increases the performance, even in the case of short message classification task for health concerns. For the term weighting objective, the main contribution obtained is the ability to automatically indentify high discriminative terms in the dataset, without suffering limitations regarding term-frequency. This may, for example, be able to help build more robust and dynamic classification processes which make use of lists of specific terms for indexing contents on external database ( textit background knowledge). Overall, the results can benefit, by the improvement of the discussed hypotheses, the emergence of more robust applications in the field of surveillance, control and decision making to real health events (epidemiology, health campaigns, etc.), through the task of classifying short social messages. / Nos últimos anos um grande número de pesquisas científicas fomentou o uso de informações da web como insumos para a vigilância epidemiológica e descoberta/mineração de conhecimentos relacionados a saúde pública em geral. Ao fazerem uso de conteúdo das mídias sociais, principalmente tweets, as abordagens propostas transformam o problema de identificação de conteúdo em um problema de classificação de texto, seguindo o cenário de aprendizagem supervisionada. Neste processo, algumas limitações atribuídas à representação das mensagens, atualização de modelo assim como a extração de atributos discriminativos, surgem. Partido disso, a presente pesquisa propõe investigar o impacto no desempenho de classificação de mensagens sociais curtas através da expansão contínua do conjunto de treinamento tendo como referência a medida de confiança nas predições realizadas. Paralelamente, a pesquisa também teve como objetivo avaliar alternativas para ponderação e extração de termos utilizados para a classificação, de modo a reduzir a dependência em métricas baseadas em frequência de termos. Restringindo-se à classificação binária de tweets relacionados a eventos de saúde e escritos em língua inglesa, os resultados obtidos revelaram uma melhoria de F1 de 9%, em relação a linha de base utilizada, evidenciando que a ação de expandir o classificador eleva o desempenho de classificação, também para o caso da classificação de mensagens curtas em domínio de saúde. Sobre a ponderação de termos, tem-se que a principal contribuição obtida, está na capacidade de levantar termos característicos do conjunto de dados e suas classes de interesse automaticamente, sem sofrer com limitações de frequência de termos, o que pode, por exemplo, ser capaz de ajudar a construir processos de classificação mais robustos e dinâmicos ao qual façam uso de listas de termos específicos para indexação em consultas à bancos de dados externos (background knowledge). No geral, os resultados apresentados podem beneficiar, pelo aprimoramento das hipóteses levantadas, o surgimento de aplicações mais robustas no campo da vigilância, controle e contrapartida à eventos reais de saúde (epidemiologia, campanhas de saúde, etc.), por meio da tarefa de classificação de mensagens sociais curtas. Aprendizagem de máquina Classificação de texto Mensagens sociais Text classification Social messages Machine learning

Search results