171 |
Sistema de recomendação de objeto de aprendizagem baseado em postagens extraídas do ambiente virtual de aprendizagemSilva, Reinaldo de Jesus da January 2016 (has links)
Os fóruns de discussões apresentam-se com umas das ferramentas de interação utilizadas nos ambientes virtuais de aprendizagem (AVAs). Esta pesquisa tem como objetivo propor um sistema computacional para recomendação de Objeto de Aprendizagem (OA), levando em consideração as postagens feitas de dentro dos fóruns de um Ambiente Virtual de Aprendizagem (AVA). A metodologia utilizada foi a pesquisa qualitativa, dos tipos descritiva e explicativa. Esse sistema identifica as palavras-chave nos fóruns de um AVA; usam as palavras-chave como indícios dos interesses dos usuários; classifica (atributos pesos) as palavras mais relevantes (Hot Topics); submete a um mecanismo de busca (repositório), neste trabalho foram usados os motores de busca, para fins de teste e oferece os resultados da busca aos usuários. As contribuições deste sistema para os sujeitos participantes desta pesquisa são: recomendação automática de OA para os alunos e professores; aplicação de mineração de dados para sistema gestão educacional; técnica de mineração de textos, utilizando algoritmo TF*PDF (Term Frequency * Proportional Document Frequency) e integração do AVA com repositório digital. Para validar o sistema de recomendação de OA em um AVA foi desenvolvido protótipo do sistema com uma amostra, contendo vinte e cinco alunos e cinco professores de duas turmas das disciplinas de Modelagem de Banco de Dados e Interface de Usuários e Sistemas Computacionais do curso de Engenharia de Computação da Universidade Estadual do Maranhão. O estudo realizado sobre o tema, e relatado nessa tese, tem como foco a recomendação de OA nos fóruns de um AVA. A avaliação e validação realizadas, através de protótipo do sistema com professores e alunos evidenciaram que o sistema de recomendação de Web Services (RECOAWS) proposto atende às expectativas e pode apoiar professores e alunos, nas suas atividades pedagógicas, dentro dos fóruns. / Discussion forums get present with one of interaction tools used in virtual learning environments (VLEs). This research aims to propose a computational system for Learning Object recommendation (LO), taking into account the posts made from within the forums of a Virtual Learning Environment (VLE). The methodology used was a qualitative study of descriptive and explanatory types. This system identifies the keywords in the forums of a VLE; It uses the keywords as evidence of the interests of users; ranks (attributes weights) the most relevant words (Hot Topics); It submits to a search engine (repository), this work were used search engines for testing purposes and provides the search results to users. The contributions of this system to the participants in this study are: automatic recommendation of LO for students and teachers; data mining application to educational management system; text mining techniques, using TF * PDF algorithm (Term Frequency * Proportional Document Frequency) and integration of the VLE with digital repository. To validate the LO recommendation system in a VLE was developed prototype system with a sample, with twenty-five students and five teachers from two classes of database modeling disciplines and User Interface and Computational Systems of Engineering course Computing of the State University of Maranhão. The study on the subject, and reported in this thesis is focused on LO recommendation in the forums of a VLE. The evaluation and validation performed by the prototype system with teachers and students showed that the Web Services recommendation system (RecoaWS) proposed meets expectations and can support teachers and students in their educational activities within the forums.
|
172 |
Aprendizado não supervisionado de hierarquias de tópicos a partir de coleções textuais dinâmicas / Unsupervised learning of topic hierarchies from dynamic text collectionsRicardo Marcondes Marcacini 19 May 2011 (has links)
A necessidade de extrair conhecimento útil e inovador de grandes massas de dados textuais, tem motivado cada vez mais a investigação de métodos para Mineração de Textos. Dentre os métodos existentes, destacam-se as iniciativas para organização de conhecimento por meio de hierarquias de tópicos, nas quais o conhecimento implícito nos textos é representado em tópicos e subtópicos, e cada tópico contém documentos relacionados a um mesmo tema. As hierarquias de tópicos desempenham um papel importante na recupera ção de informação, principalmente em tarefas de busca exploratória, pois permitem a análise do conhecimento de interesse em diversos níveis de granularidade e exploração interativa de grandes coleções de documentos. Para apoiar a construção de hierarquias de tópicos, métodos de agrupamento hierárquico têm sido utilizados, uma vez que organizam coleções textuais em grupos e subgrupos, de forma não supervisionada, por meio das similaridades entre os documentos. No entanto, a maioria dos métodos de agrupamento hierárquico não é adequada em cenários que envolvem coleções textuais dinâmicas, pois são exigidas frequentes atualizações dos agrupamentos. Métodos de agrupamento que respeitam os requisitos existentes em cenários dinâmicos devem processar novos documentos assim que são adicionados na coleção, realizando o agrupamento de forma incremental. Assim, neste trabalho é explorado o uso de métodos de agrupamento incremental para o aprendizado não supervisionado de hierarquias de tópicos em coleções textuais dinâmicas. O agrupamento incremental é aplicado na construção e atualização de uma representação condensada dos textos, que mantém um sumário das principais características dos dados. Os algoritmos de agrupamento hierárquico podem, então, ser aplicados sobre as representa ções condensadas, obtendo-se a organização da coleção textual de forma mais eficiente. Foram avaliadas experimentalmente três estratégias de agrupamento incremental da literatura, e proposta uma estratégia alternativa mais apropriada para hierarquias de tópicos. Os resultados indicaram que as hierarquias de tópicos construídas com uso de agrupamento incremental possuem qualidade próxima às hierarquias de tópicos construídas por métodos não incrementais, com significativa redução do custo computacional / The need to extract new and useful knowledge from large textual collections has motivated researchs on Text Mining methods. Among the existing methods, initiatives for the knowledge organization by topic hierarchies are very popular. In the topic hierarchies, the knowledge is represented by topics and subtopics, and each topic contains documents of similar content. They play an important role in information retrieval, especially in exploratory search tasks, allowing the analysis of knowledge in various levels of granularity and interactive exploration of large document collections. Hierarchical clustering methods have been used to support the construction of topic hierarchies. These methods organize textual collections in clusters and subclusters, in an unsupervised manner, using similarities among documents. However, most existing hierarchical clustering methods is not suitable for scenarios with dynamic text collections, since frequent clustering updates are necessary. Clustering methods that meet these requirements must process new documents that are inserted into textual colections, in general, through incremental clustering. Thus, we studied the incremental clustering methods for unsupervised learning of topic hierarchies for dynamic text collections. The incremental clustering is used to build and update a condensed representation of texts, which maintains a summary of the main features of the data. The hierarchical clustering algorithms are applied in these condensed representations, obtaining the textual organization more efficiently. We experimentally evaluate three incremental clustering algorithms available in the literature. Also, we propose an alternative strategy more appropriate for construction of topic hieararchies. The results indicated that the topic hierarchies construction using incremental clustering have quality similar to non-incremental methods. Furthermore, the computational cost is considerably reduced using incremental clustering methods
|
173 |
Mineração textual e produção de fanfictions : processos desencadeadores de oportunidades de letramento no ensino de língua estrangeiraBarcellos, Patrícia da Silva Campelo Costa January 2013 (has links)
Esta tese tem por objetivo investigar como o letramento em língua estrangeira (LE) pode ser apoiado pelo uso de um recurso digital passível de auxiliar os processos de leitura e produção textual. Assim, a presente pesquisa baseia-se nos estudos de Feldman e Sanger (2006) acerca da mineração de textos e nas pesquisas de Black (2007, 2009) sobre a incorporação de um gênero textual característico da internet (fanfiction) na aprendizagem de línguas. Através da utilização de um recurso de mineração de texto (Sobek), a partir do qual ocorre a extração dos termos mais recorrentes em um texto, os participantes deste estudo criaram narrativas, em meio digital. Os doze alunos participantes da pesquisa utilizaram a ferramenta Sobek como mediadora da produção de histórias conhecidas como fanfictions, nas quais novas tramas são criadas a partir de elementos culturais já reconhecidos na mídia. Os informantes eram seis graduandos em Letras e seis alunos de um curso de extensão, ambos os grupos na Universidade Federal do Rio Grande do Sul (UFRGS). Na tarefa proposta, cada aprendiz leu uma fanfiction de sua escolha, publicada na web, e utilizou a ferramenta de mineração para formar grafos com os termos mais recorrentes da história. Durante tal processo, o aluno tinha oportunidade de fazer associações entre as expressões do texto, de modo a formar, na ferramenta Sobek, uma imagem em rede (grafo) que representasse termos recorrentes nesse gênero textual (tais como o uso de tempos verbais no passado e adjetivos para caracterizar personagens e contexto). Posteriormente, esse grafo foi repassado a um colega, que assim iniciou seu processo de composição com base nessa imagem representativa do texto. A partir da análise dos dados, observou-se que a utilização da ferramenta digital deu suporte à produção textual em LE, e sua subsequente prática de letramento, visto que os autores se apoiaram no recurso de mineração para criar suas narrativas fanfiction. / This doctoral thesis aims at investigating how literacy in a foreign language (FL) may be supported by the use of a digital resource which can help the processes of reading and writing. Thus, the present research is based on studies by Feldman and Sanger (2006) about text mining, and on research by Black (2007, 2009) about the incorporation of a textual genre characteristic of the Internet (fanfiction) in language learning. Through the use of a text mining resource (Sobek), which promotes the extraction of frequent terms present in a text, the participants of this study created narratives, in digital media. The twelve students who participated in the research used the tool Sobek to mediate the production of stories known as fanfictions, in which new plots are created from cultural elements already recognized in the media. The participants were six undergraduate students of Languages and six students who were part of an extension course, both groups at the Federal University of Rio Grande do Sul (UFRGS). In the proposed task, each student read a fanfiction of his/her choice, which was published on a website, and used the mining tool to develop graphs with the recurrent terms found in the story. During this process, the student had the opportunity to make associations between expressions from the text, using the software Sobek, so as to form an image (graph) that represented terms used in this textual genre (such as the use of verbal tenses in the past and adjectives to describe characters and context). Later, this graph was forwarded to a peer, who then began his/her writing process based on this picture originated from a text. From the data analysis, it was observed that the use of a digital tool supported the text production in the FL, and its following practice of literacy, as the authors relied on the mining resource to create their fanfictions.
|
174 |
Systematic Analysis of the Factors Contributing to the Variation and Change of the MicrobiomeJanuary 2018 (has links)
abstract: Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications, citations, funding, collaborations, and other explanatory variables or contextual factors. What is observed in the microbiome, or any historical evolution of a scientific field or scientific knowledge, is that these changes are related to changes in knowledge, but what is not understood is how to measure and track changes in knowledge. This investigation highlights how contextual factors from the language and social context of the microbiome are related to changes in the usage, meaning, and scientific knowledge on the microbiome. Two interconnected studies integrating qualitative and quantitative evidence examine the variation and change of the microbiome evidence are presented. First, the concepts microbiome, metagenome, and metabolome are compared to determine the boundaries of the microbiome concept in relation to other concepts where the conceptual boundaries have been cited as overlapping. A collection of publications for each concept or corpus is presented, with a focus on how to create, collect, curate, and analyze large data collections. This study concludes with suggestions on how to analyze biomedical concepts using a hybrid approach that combines results from the larger language context and individual words. Second, the results of a systematic review that describes the variation and change of microbiome research, funding, and knowledge are examined. A corpus of approximately 28,000 articles on the microbiome are characterized, and a spectrum of microbiome interpretations are suggested based on differences related to context. The collective results suggest the microbiome is a separate concept from the metagenome and metabolome, and the variation and change to the microbiome concept was influenced by contextual factors. These results provide insight into how concepts with extensive resources behave within biomedicine and suggest the microbiome is possibly representative of conceptual change or a preview of new dynamics within science that are expected in the future. / Dissertation/Thesis / Doctoral Dissertation Biology 2018
|
175 |
Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithmsWright, Lindsey 01 May 2018 (has links)
Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media.
|
176 |
Modeling words for online sexual behavior surveillance and clinical text information extractionFries, Jason Alan 01 July 2015 (has links)
How do we model the meaning of words? In domains like information retrieval, words have classically been modeled as discrete entities using 1-of-n encoding, a representation that elides most of a word's syntactic and semantic structure. Recent research, however, has begun exploring more robust representations called word embeddings. Embeddings model words as a parameterized function mapping into an n-dimensional continuous space and implicitly encode a number of interesting semantic and syntactic properties. This dissertation examines two application areas where existing, state-of-the-art terminology modeling improves the task of information extraction (IE) -- the process of transforming unstructured data into structured form. We show that a large amount of word meaning can be learned directly from very large document collections.
First, we explore the feasibility of mining sexual health behavior data directly from the unstructured text of online “hookup" requests. The Internet has fundamentally changed how individuals locate sexual partners. The rise of dating websites, location-aware smartphone apps like Grindr and Tinder that facilitate casual sexual encounters (“hookups"), as well as changing trends in sexual health practices all speak to the shifting cultural dynamics surrounding sex in the digital age. These shifts also coincide with an increase in the incidence rate of sexually transmitted infections (STIs) in subpopulations such as young adults, racial and ethnic minorities, and men who have sex with men (MSM). The reasons for these increases and their possible connections to Internet cultural dynamics are not completely understood. What is apparent, however, is that sexual encounters negotiated online complicate many traditional public health intervention strategies such as contact tracing and partner notification. These circumstances underline the need to examine online sexual communities using computational tools and techniques -- as is done with other social networks -- to provide new insight and direction for public health surveillance and intervention programs.
One of the central challenges in this task is constructing lexical resources that reflect how people actually discuss and negotiate sex online. Using a 2.5-year collection of over 130 million Craigslist ads (a large venue for MSM casual sexual encounters), we discuss computational methods for automatically learning terminology characterizing risk behaviors in the MSM community. These approaches range from keyword-based dictionaries and topic modeling to semi-supervised methods using word embeddings for query expansion and sequence labeling. These methods allow us to gather information similar (in part) to the types of questions asked in public health risk assessment surveys, but automatically aggregated directly from communities of interest, in near real-time, and at geographically high-resolution. We then address the methodological limitations of this work, as well as the fundamental validation challenges posed by the lack of large-scale sexual sexual behavior survey data and limited availability of STI surveillance data.
Finally, leveraging work on terminology modeling in Craigslist, we present new research exploring representation learning using 7 years of University of Iowa Hospitals and Clinics (UIHC) clinical notes. Using medication names as an example, we show that modeling a low-dimensional representation of a medication's neighboring words, i.e., a word embedding, encodes a large amount of non-obvious semantic information. Embeddings, for example, implicitly capture a large degree of the hierarchical structure of drug families as well as encode relational attributes of words, such as generic and brand names of medications. These representations -- learned in a completely unsupervised fashion -- can then be used as features in other machine learning tasks. We show that incorporating clinical word embeddings in a benchmark classification task of medication labeling leads to a 5.4% increase in F1-score over a baseline of random initialization and a 1.9% over just using non-UIHC training data. This research suggests clinical word embeddings could be shared for use in other institutions and other IE tasks.
|
177 |
Computational methods for mining health communications in web 2.0Bhattacharya, Sanmitra 01 May 2014 (has links)
Data from social media platforms are being actively mined for trends and patterns of interests. Problems such as sentiment analysis and prediction of election outcomes have become tremendously popular due to the unprecedented availability of social interactivity data of different types. In this thesis we address two problems that have been relatively unexplored. The first problem relates to mining beliefs, in particular health beliefs, and their surveillance using social media. The second problem relates to investigation of factors associated with engagement of U.S. Federal Health Agencies via Twitter and Facebook.
In addressing the first problem we propose a novel computational framework for belief surveillance. This framework can be used for 1) surveillance of any given belief in the form of a probe, and 2) automatically harvesting health-related probes. We present our estimates of support, opposition and doubt for these probes some of which represent true information, in the sense that they are supported by scientific evidence, others represent false information and the remaining represent debatable propositions. We show for example that the levels of support in false and debatable probes are surprisingly high. We also study the scientific novelty of these probes and find that some of the harvested probes with sparse scientific evidence may indicate novel hypothesis. We also show the suitability of off-the-shelf classifiers for belief surveillance. We find these classifiers are quite generalizable and can be used for classifying newly harvested probes. Finally, we show the ability of harvesting and tracking probes over time. Although our work is focused in health care, the approach is broadly applicable to other domains as well.
For the second problem, our specific goals are to study factors associated with the amount and duration of engagement of organizations. We use negative binomial hurdle regression models and Cox proportional hazards survival models for these. For Twitter, the hurdle analysis shows that presence of user-mention is positively associated with the amount of engagement while negative sentiment has inverse association. Content of tweets is also equally important for engagement. The survival analyses indicate that engagement duration is positively associated with follower count. For Facebook, both hurdle and survival analyses show that number of page likes and positive sentiment are correlated with higher and prolonged engagement while few content types are negatively correlated with engagement. We also find patterns of engagement that are consistent across Twitter and Facebook.
|
178 |
Mining for evidence in enterprise corporaAlmquist, Brian Alan 01 May 2011 (has links)
The primary research aim of this dissertation is to identify the strategies that best meet the information retrieval needs as expressed in the "e-discovery" scenario. This task calls for a high-recall system that, in response to a request for all available relevant documents to a legal complaint, effectively prioritizes documents from an enterprise document collection in order of likelihood of relevance. High recall information retrieval strategies, such as those employed for e-discovery and patent or medical literature searches, reflect high costs when relevant documents are missed, but they also carry high document review costs.
Our approaches parallel the evaluation opportunities afforded by the TREC Legal Track. Within the ad hoc framework, we propose an approach that includes query field selection, techniques for mitigating OCR error, term weighting strategies, query language reduction, pseudo-relevance feedback using document metadata and terms extracted from documents, merging result sets, and biasing results to favor documents responsive to lawyer-negotiated queries. We conduct several experiments to identify effective parameters for each of these strategies.
Within the relevance feedback framework, we use an active learning approach informed by signals from collected prior relevance judgments and ranking data. We train a classifier to prioritize the unjudged documents retrieved using different ad hoc information retrieval techniques applied to the same topic. We demonstrate significant improvements over heuristic rank aggregation strategies when choosing from a relatively small pool of documents. With a larger pool of documents, we validate the effectiveness of the merging strategy as a means to increase recall, but that sparseness of judgment data prevents effective ranking by the classifier-based ranker.
We conclude our research by optimizing the classifier-based ranker and applying it to other high recall datasets. Our concluding experiments consider the potential benefits to be derived by modifying the merged runs using methods derived from social choice models. We find that this technique, Local Kemenization, is hampered by the large number of documents and the minimal number of contributing result sets to the ranked list. This two-stage approach to high-recall information retrieval tasks continues to offer a rich set of research questions for future research.
|
179 |
Improving the performance of Hierarchical Hidden Markov Models on Information Extraction tasksChou, Lin-Yi January 2006 (has links)
This thesis presents novel methods for creating and improving hierarchical hidden Markov models. The work centers around transforming a traditional tree structured hierarchical hidden Markov model (HHMM) into an equivalent model that reuses repeated sub-trees. This process temporarily breaks the tree structure constraint in order to leverage the benefits of combining repeated sub-trees. These benefits include lowered cost of testing and an increased accuracy of the final model-thus providing the model with greater performance. The result is called a merged and simplified hierarchical hidden Markov model (MSHHMM). The thesis goes on to detail four techniques for improving the performance of MSHHMMs when applied to information extraction tasks, in terms of accuracy and computational cost. Briefly, these techniques are: a new formula for calculating the approximate probability of previously unseen events; pattern generalisation to transform observations, thus increasing testing speed and prediction accuracy; restructuring states to focus on state transitions; and an automated flattening technique for reducing the complexity of HHMMs. The basic model and four improvements are evaluated by applying them to the well-known information extraction tasks of Reference Tagging and Text Chunking. In both tasks, MSHHMMs show consistently good performance across varying sizes of training data. In the case of Reference Tagging, the accuracy of the MSHHMM is comparable to other methods. However, when the volume of training data is limited, MSHHMMs maintain high accuracy whereas other methods show a significant decrease. These accuracy gains were achieved without any significant increase in processing time. For the Text Chunking task the accuracy of the MSHHMM was again comparable to other methods. However, the other methods incurred much higher processing delays compared to the MSHHMM. The results of these practical experiments demonstrate the benefits of the new method-increased accuracy, lower computation costs, and better performance.
|
180 |
Improving scalability and accuracy of text mining in grid environmentZhai, Yuzheng January 2009 (has links)
The advance in technologies such as massive storage devices and high speed internet has led to an enormous increase in the volume of available documents in electronic form. These documents represent information in a complex and rich manner that cannot be analysed using conventional statistical data mining methods. Consequently, text mining is developed as a growing new technology for discovering knowledge from textual data and managing textual information. Processing and analysing textual information can potentially obtain valuable and important information, yet these tasks also requires enormous amount of computational resources due to the sheer size of the data available. Therefore, it is important to enhance the existing methodologies to achieve better scalability, efficiency and accuracy. / The emerging Grid technology shows promising results in solving the problem of scalability by splitting the works from text clustering algorithms into a number of jobs, each to be executed separately and simultaneously on different computing resources. That allows for a substantial decrease in the processing time and maintaining the similar level of quality at the same time. / To improve the quality of the text clustering results, a new document encoding method is introduced that takes into consideration of the semantic similarities of the words. In this way, documents that are similar in content will be more likely to be group together. / One of the ultimate goals of text mining is to help us to gain insights to the problem and to assist in the decision making process together with other source of information. Hence we tested the effectiveness of incorporating text mining method in the context of stock market prediction. This is achieved by integrating the outcomes obtained from text mining with the ones from data mining, which results in a more accurate forecast than using any single method.
|
Page generated in 0.0306 seconds