Spelling suggestions: "subject:"[een] LABEL PROPAGATION"" "subject:"[enn] LABEL PROPAGATION""
1 |
Using Social Media Websites to Support Scenario-Based Design of Assistive TechnologyYu, Xing 01 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Having representative users, who have the targeted disability, in accessibility
studies is vital to the validity of research findings. Although it is a widely accepted tenet
in the HCI community, many barriers and difficulties make it very resource-demanding
for accessibility researchers to recruit representative users. As a result, researchers recruit
non-representative users, who do not have the targeted disability, instead of
representative users in accessibility studies. Although such an approach has been widely
justified, evidence showed that findings derived from non-representative users could be
biased and even misleading. To address this problem, researchers have come up with
different solutions such as building pools of users to recruit from. But still, the data is not
widely available and needs a lot of effort and resource to build and maintain.
On the other hand, online social media websites have become popular in the last
decade. Many online communities have emerged that allow online users to discuss
health-related subjects, exchange useful information, or provide emotional support. A
large amount of data accumulated in such online communities have gained attention from
researchers in the healthcare domain. And many researches have been done based on data
from social media websites to better understand health problems to improve the wellbeing
of people.
Despite the increasing popularity, the value of data from social media websites for
accessibility research remains untapped. Hence, my work aims to create methods that
could extract valuable information from data collected on social media websites for accessibility practitioners to support their design process. First, I investigate methods that
enable researchers to effectively collect representative data from social media websites.
More specifically, I look into machine learning approaches that could allow researchers
to automatically identify online users who have disabilities (representative users).
Second, I investigate methods that could extract useful information from user-generated
free-text using techniques drawn from the information extraction domain. Last, I explore
how such information should be visualized and presented for designers to support the
scenario-based design process in accessibility studies.
|
2 |
Android malware detection using network-based approachesAlfs, Emily January 1900 (has links)
Master of Science / Department of Mathematics / Nathan Albin / This thesis is focused on the use of networks to identify potentially malicious Android applications. There are many techniques that determine if an application is malicious, and they are ever-changing. Techniques to identify malicious applications must be robust as the schemes of creating malicious applications are changing as well. We propose the use of a network-based approach that is potentially effective at separating malicious from benign apps, given a small and noisy training set.
The applications in our data set come from the Google Play Store and have been scanned for malicious behavior using Virus Total to produce a ground truth dataset. The apps in the resulting dataset have been represented as binary feature vectors (where the features represent permissions, intent actions, discriminative APIs, obfuscation signatures, and native code signatures). We use the feature vectors corresponding to apps to build a weighted network that captures the \closeness" between applications. We propagate labels, benign or malicious, from the labeled applications that form the training set to unlabeled applications (which we aim to label), and evaluate the effectiveness of the proposed approach in terms of precision, recall and F1-measure.
We outline the algorithms for propagating labels that were used in our research and discuss the fine tuning of hyper-parameters. We compare our results to known supervised learning algorithms, such as k-nearest-neighbors and Naive Bayes, that can be used to learn classifiers from the training labeled data and subsequently use the classifiers to label the unlabeled test data. We discuss potential improvements on our methods and ways to further this research.
|
3 |
Benchmarking Methods For Predicting Phenotype Gene AssociationsTyagi, Tanya 16 September 2020 (has links)
Assigning human genes to diseases and related phenotypes is an important topic in modern genomics. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities that occur in human diseases. Computational methods such as label-propagation and supervised-learning address challenges posed by traditional approaches such as manual curation to link genes to phenotypes in the HPO. It is only in recent years that computational methods have been applied in a network-based approach for predicting genes to disease-related phenotypes. In this thesis, we present an extensive benchmarking of various computational methods for the task of network-based gene classification. These methods are evaluated on multiple protein interaction networks and feature representations. We empirically evaluate the performance of multiple prediction tasks using two evaluation experiments: cross-fold validation and the more stringent temporal holdout. We demonstrate that all of the prediction methods considered in our benchmarking analysis have similar performance, with each of the methods outperforming a random predictor. / Master of Science / For many years biologists have been working towards studying diseases, characterizing dis- ease history and identifying what factors and genetic variants lead to diseases. Such studies are critical to working towards the advanced prognosis of diseases and being able to iden- tify targeted treatment plans to cure diseases. An important characteristic of diseases is that they can be expressed by a set of phenotypes. Phenotypes are defined as observable characteristics or traits of an organism, such as height and the color of the eyes and hair. In the context of diseases, the phenotypes that describe diseases are referred to as clinical phenotypes, with some examples being short stature, abnormal hair pattern, etc. Biologists have identified the importance of deep phenotyping, which is defined as a concise analysis that gathers information about diseases and their observed traits in humans, in finding genetic variants underlying human diseases. We make use of the Human Phenotype Ontology (HPO), a standardized vocabulary of phenotypic abnormalities that occur in human diseases. The HPO provides relationships between phenotypes as well as associations between phenotypes and genes. In our study, we perform a systematic benchmarking to evaluate different types of computational approaches for the task of phenotype-gene prediction, across multiple molecular networks using various feature representations and for multiple evaluation strategies.
|
4 |
Pós-processamento de regras de associação via redes e propagação de rótulos / Post-processing association rules using networks and label propagationPadua, Renan de 27 February 2015 (has links)
Dentre as técnicas de mineração existentes encontra-se a associação, responsável por identificar relações que ocorrem no conjunto de dados. Embora a associação seja uma das técnicas mais utilizadas, a quantidade de padrões extraídos pode vir a sobrecarregar o usuário de tal maneira que encontrar algo interessante dentre a imensidão de padrões obtidos passa a ser um novo desafio. Para solucionar esse problema, uma grande parte dos trabalhos relacionados à associação está voltada a etapa de pós-processamento. Esses trabalhos geralmente propõem abordagens de pós-processamento que visam, segundo determinada estratégia, facilitar a busca pelos padrões interessantes ao domínio. Nos últimos anos, essas abordagens têm incluído no processo o conhecimento e/ou interesse do usuário sobre o domínio. Contudo, nas abordagens atualmente existentes, o usuário deve, por meio de algum formalismo descrever explicitamente seu conhecimento e/ou interesse, requerendo do usuário um tempo considerável, podendo levar, inclusive, a especificações incompletas e/ou incorretas. Além disso, na maioria das vezes, o usuário não tem ideia do que é provavelmente interessante, nem a partir de quais relações iniciar a busca. Nota-se, portanto, que um dos desafios dessas abordagens é considerar o conhecimento e/ou interesse do usuário. Além disso, é necessário considerar também o número de regras que o usuário analisará. A análise de regras feita por um especialista é custosa e, na maioria dos casos, o usuário quer explorar as regras geradas sem limitar a exploração ao conhecimento que ele já possui. Portanto, é importante que o usuário avalie o menor número de regras possível e, com base nessa avaliação, abordagens de pós-processamento consigam o auxiliar na busca pelas regras que ele poderá considerar interessante. Para tanto, é proposto neste trabalho que o pós-processamento seja tratado como um problema de classificação semissupervisionada transdutiva, uma vez que permite que o usuário rotule, considerando classes pré-definidas (por exemplo, \"Interessante\" ou \"Não Interessante\"), apenas algumas regras do conjunto a ser explorado para que todas as outras regras sejam automaticamente rotuladas. Além disso, por meio da definição dos rótulos de algumas regras, é possível capturar implicitamente o conhecimento e/ou interesse do usuário sobre o domínio. Para tanto, é necessário que as regras sejam modeladas de maneira a permitir: (a) selecionar as regras a serem rotuladas pelo usuário a fim de capturar implicitamente seu conhecimento e/ou interesse; (b) propagar os rótulos das regras já classificadas pelo usuário a todas as outras regras não rotuladas. Desse modo, neste trabalho, as regras foram modeladas via redes, uma vez que: (i) uma vasta quantidade de medidas de exploração de redes pode ser utilizada, em conjunto com as informações fornecidas pelo usuário, a fim de viabilizar o item (a); (ii) algoritmos de propagação de rótulos podem ser utilizados a fim de viabilizar o item (b). Diante do apresentado, ressalta-se que as contribuições deste trabalho estão na capacidade de se extrair o conhecimento e/ou interesse do usuário de acordo com as características da base de dados e direcionar sua exploração sem a necessidade de se definir previamente o que será explorado. Além disso, os resultados obtidos demonstram a capacidade da PARLP em direcionar o usuário para o conhecimento considerado interessante, reduzindo, para tanto, a quantidade de regras a serem exploradas. Por fim, este trabalho contribui também para demonstrar que é possível tratar o pós-processamento de regras de associação como um problema de propagação de rótulos. / One of the existing data mining techniques is association rules, responsible for identifying relationships that occur in the data set. Although the association rule is one of the most widely used techniques, the amount of extracted patterns can overload the user in such a way that finding interesting patterns among the large amount of obtained patterns becomes a challenge. To solve this problem, a large part of the association-related work is focused on the post-processing step. These works generally propose a post-processing approaches that, according to a certain strategy, aims facilitating the search for interesting patterns. Nowadays, approaches have included the user knowledge in the domain and / or interests on the process. However, in the current existing approaches, the user knowledge and/or interest must be explicitly described by some formalism, requiring a considerable time and may even lead to incomplete and / or incorrect specifications. In addition, the user has no idea what probably is interesting or which patterns to begin the searching. Notice that one of the challenges of these approaches is to consider the knowledge and / or user interest. In addition, consider the number of rules the user will examine is necessary. The analysis of the rules by an expert is expensive and, in most cases, the user wants to explore the rules generated without limiting exploration to the knowledge he already has. Therefore, the user evaluate the fewest amount of rules possible is important and, based on this assessment, the post-processing approaches be able to assist in the search for the rules that he may consider interesting. So, in this work is proposed that the post-processing is treated as a transductive semi supervised classification problem, since it allows the user to label some rules based on two predefined classes (e.g. \"interesting\"or \"not interesting\"), in a way that just a small amount of the rule set needs to be explored and all other association rules are automatically labeled. Furthermore, you can implicitly capture the knowledge and / or user interest in the domain by labeling some rules. Thus, the rules need to be modeled to allow: (a) select the rules to be labeled by the user to implicitly capture their knowledge and / or interest; (b) propagate the rules\' labels classified by the user to all not labeled rules. To do so, the rules were modeled via networks in this work, due to: (i) a large amount of network measures can be used in conjunction with the information provided by the user, to make item (a) possible; (ii) label propagation algorithms can be used in order to make item (b) possible. Therefore, we highlight that the contributions of this work are the ability to extract knowledge and / or user interest according to database characteristics and direct the user exploration without previously defining what will be explored. In addition, the results demonstrate that the proposed approach is able to direct the user to the knowledge considered interesting, reducing the amount of rules to be explored. Finally, this work also contributes to demonstrate that treat the post-processing of association rules as a problem of propagation of labels is possible.
|
5 |
Classificação automática de textos por meio de aprendizado de máquina baseado em redes / Text automatic classification through machine learning based on networksRossi, Rafael Geraldeli 26 October 2015 (has links)
Nos dias atuais há uma quantidade massiva de dados textuais sendo produzida e armazenada diariamente na forma de e-mails, relatórios, artigos e postagens em redes sociais ou blogs. Processar, organizar ou gerenciar essa grande quantidade de dados textuais manualmente exige um grande esforço humano, sendo muitas vezes impossível de ser realizado. Além disso, há conhecimento embutido nos dados textuais, e analisar e extrair conhecimento de forma manual também torna-se inviável devido à grande quantidade de textos. Com isso, técnicas computacionais que requerem pouca intervenção humana e que permitem a organização, gerenciamento e extração de conhecimento de grandes quantidades de textos têm ganhado destaque nos últimos anos e vêm sendo aplicadas tanto na academia quanto em empresas e organizações. Dentre as técnicas, destaca-se a classificação automática de textos, cujo objetivo é atribuir rótulos (identificadores de categorias pré-definidos) à documentos textuais ou porções de texto. Uma forma viável de realizar a classificação automática de textos é por meio de algoritmos de aprendizado de máquina, que são capazes de aprender, generalizar, ou ainda extrair padrões das classes das coleções com base no conteúdo e rótulos de documentos textuais. O aprendizado de máquina para a tarefa de classificação automática pode ser de 3 tipos: (i) indutivo supervisionado, que considera apenas documentos rotulados para induzir um modelo de classificação e classificar novos documentos; (ii) transdutivo semissupervisionado, que classifica documentos não rotulados de uma coleção com base em documentos rotulados; e (iii) indutivo semissupervisionado, que considera documentos rotulados e não rotulados para induzir um modelo de classificação e utiliza esse modelo para classificar novos documentos. Independente do tipo, é necessário que as coleções de documentos textuais estejam representadas em um formato estruturado para os algoritmos de aprendizado de máquina. Normalmente os documentos são representados em um modelo espaço-vetorial, no qual cada documento é representado por um vetor, e cada posição desse vetor corresponde a um termo ou atributo da coleção de documentos. Algoritmos baseados no modelo espaço-vetorial consideram que tanto os documentos quanto os termos ou atributos são independentes, o que pode degradar a qualidade da classificação. Uma alternativa à representação no modelo espaço-vetorial é a representação em redes, que permite modelar relações entre entidades de uma coleção de textos, como documento e termos. Esse tipo de representação permite extrair padrões das classes que dificilmente são extraídos por algoritmos baseados no modelo espaço-vetorial, permitindo assim aumentar a performance de classificação. Além disso, a representação em redes permite representar coleções de textos utilizando diferentes tipos de objetos bem como diferentes tipos de relações, o que permite capturar diferentes características das coleções. Entretanto, observa-se na literatura alguns desafios para que se possam combinar algoritmos de aprendizado de máquina e representações de coleções de textos em redes para realizar efetivamente a classificação automática de textos. Os principais desafios abordados neste projeto de doutorado são (i) o desenvolvimento de representações em redes que possam ser geradas eficientemente e que também permitam realizar um aprendizado de maneira eficiente; (ii) redes que considerem diferentes tipos de objetos e relações; (iii) representações em redes de coleções de textos de diferentes línguas e domínios; e (iv) algoritmos de aprendizado de máquina eficientes e que façam um melhor uso das representações em redes para aumentar a qualidade da classificação automática. Neste projeto de doutorado foram propostos e desenvolvidos métodos para gerar redes que representem coleções de textos, independente de domínio e idioma, considerando diferentes tipos de objetos e relações entre esses objetos. Também foram propostos e desenvolvidos algoritmos de aprendizado de máquina indutivo supervisionado, indutivo semissupervisionado e transdutivo semissupervisionado, uma vez que não foram encontrados na literatura algoritmos para lidar com determinados tipos de relações, além de sanar a deficiência dos algoritmos existentes em relação à performance e/ou tempo de classificação. É apresentado nesta tese (i) uma extensa avaliação empírica demonstrando o benefício do uso das representações em redes para a classificação de textos em relação ao modelo espaço-vetorial, (ii) o impacto da combinação de diferentes tipos de relações em uma única rede e (iii) que os algoritmos propostos baseados em redes são capazes de superar a performance de classificação de algoritmos tradicionais e estado da arte tanto considerando algoritmos de aprendizado supervisionado quanto semissupervisionado. As soluções propostas nesta tese demonstraram ser úteis e aconselháveis para serem utilizadas em diversas aplicações que envolvam classificação de textos de diferentes domínios, diferentes características ou para diferentes quantidades de documentos rotulados. / A massive amount of textual data, such as e-mails, reports, articles and posts in social networks or blogs, has been generated and stored on a daily basis. The manual processing, organization and management of this huge amount of texts require a considerable human effort and sometimes these tasks are impossible to carry out in practice. Besides, the manual extraction of knowledge embedded in textual data is also unfeasible due to the large amount of texts. Thus, computational techniques which require little human intervention and allow the organization, management and knowledge extraction from large amounts of texts have gained attention in the last years and have been applied in academia, companies and organizations. The tasks mentioned above can be carried out through text automatic classification, in which labels (identifiers of predefined categories) are assigned to texts or portions of texts. A viable way to perform text automatic classification is through machine learning algorithms, which are able to learn, generalize or extract patterns from classes of text collections based on the content and labels of the texts. There are three types of machine learning algorithms for automatic classification: (i) inductive supervised, in which only labeled documents are considered to induce a classification model and this model are used to classify new documents; (ii) transductive semi-supervised, in which all known unlabeled documents are classified based on some labeled documents; and (iii) inductive semi-supervised, in which labeled and unlabeled documents are considered to induce a classification model in order to classify new documents. Regardless of the learning algorithm type, the texts of a collection must be represented in a structured format to be interpreted by the algorithms. Usually, the texts are represented in a vector space model, in which each text is represented by a vector and each dimension of the vector corresponds to a term or feature of the text collection. Algorithms based on vector space model consider that texts, terms or features are independent and this assumption can degrade the classification performance. Networks can be used as an alternative to vector space model representations. Networks allow the representations of relations among the entities of a text collection, such as documents and terms. This type of representation allows the extraction patterns which are not extracted by algorithms based on vector-space model. Moreover, text collections can be represented by networks composed of different types of entities and relations, which provide the extraction of different patterns from the texts. However, there are some challenges to be solved in order to allow the combination of machine learning algorithms and network-based representations to perform text automatic classification in an efficient way. The main challenges addressed in this doctoral project are (i) the development of network-based representations efficiently generated which also allows an efficient learning; (ii) the development of networks which represent different types of entities and relations; (iii) the development of networks which can represent texts written in different languages and about different domains; and (iv) the development of efficient learning algorithms which make a better use of the network-based representations and increase the classification performance. In this doctoral project we proposed and developed methods to represent text collections into networks considering different types of entities and relations and also allowing the representation of texts written in any language or from any domain. We also proposed and developed supervised inductive, semi-supervised transductive and semi-supervised inductive learning algorithms to interpret and learn from the proposed network-based representations since there were no algorithms to handle certain types of relations considered in this thesis. Besides, the proposed algorithms also attempt to obtain a higher classification performance and a faster classification than the existing network-based algorithms. In this doctoral thesis we present (i) an extensive empirical evaluation demonstrating the benefits about the use of network-based representations for text classification, (ii) the impact of the combination of different types of relations in a single network and (iii) that the proposed network-based algorithms are able to surpass the classification performance of traditional and state-of-the-art algorithms considering both supervised and semi-supervised learning. The solutions proposed in this doctoral project have proved to be advisable to be used in many applications involving classification of texts from different domains, areas, characteristics or considering different numbers of labeled documents.
|
6 |
Using social network information in recommender systemsSudan, Nikita Maple 30 September 2011 (has links)
Recommender Systems are used to select online information relevant
to a given user. Traditional (memory based) recommenders explore the user-item rating matrix and make recommendations based on users who have rated similarly or items that have been rated similarly. With the growing popularity of social networks, recommender systems can benefit from combining history of user preferences with information from the social/trust network of users. This thesis explores two techniques of combining user-item rating history with trust network information to make better user-item rating predictions. The first
approach (SCOAL [5]) simultaneously co-clusters and learns separate models
for each co-cluster. The co-clustering is based on the user features as well as
the rating history. This captures the intuition that certain groups of users have similar preferences for certain groups of items. The grouping of certain users is affected by the similarity in the rating behavior and the trust network.
The second graph-based label propagation approach (MAD [27]) works in a transductive setting and propagates ratings of user-item pairs directly on the
user social graph. We evaluate both approaches on two large public data-sets from Epinions.com and Flixster.com.
The thesis is amongst the first to explore the role of distrust in rating prediction. Since distrust is not as transitive as trust i.e. an enemy's enemy need not be an enemy or a friend, distrust can't directly replace trust in trust
propagation approaches. By using a low dimensional representation of the original trust network in SCOAL, we use distrust as it is and don't propagate it. Using SCOAL, we can pin-point the groups of users and the groups of
items that have the same preference model. Both SCOAL and MAD are able to seamlessly integrate side information such as item-subject and item-author
information into the trust based rating prediction model. / text
|
7 |
Weakly supervised part-of-speech tagging for Chinese using label propagationDing, Weiwei, 1985- 02 February 2012 (has links)
Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission
parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit.
In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text
|
8 |
Classifying Everyday Activity Through Label Propagation With Sparse Training DataJanuary 2013 (has links)
abstract: We solve the problem of activity verification in the context of sustainability. Activity verification is the process of proving the user assertions pertaining to a certain activity performed by the user. Our motivation lies in incentivizing the user for engaging in sustainable activities like taking public transport or recycling. Such incentivization schemes require the system to verify the claim made by the user. The system verifies these claims by analyzing the supporting evidence captured by the user while performing the activity. The proliferation of portable smart-phones in the past few years has provided us with a ubiquitous and relatively cheap platform, having multiple sensors like accelerometer, gyroscope, microphone etc. to capture this evidence data in-situ. In this research, we investigate the supervised and semi-supervised learning techniques for activity verification. Both these techniques make use the data set constructed using the evidence submitted by the user. Supervised learning makes use of annotated evidence data to build a function to predict the class labels of the unlabeled data points. The evidence data captured can be either unimodal or multimodal in nature. We use the accelerometer data as evidence for transportation mode verification and image data as evidence for recycling verification. After training the system, we achieve maximum accuracy of 94% when classifying the transport mode and 81% when detecting recycle activity. In the case of recycle verification, we could improve the classification accuracy by asking the user for more evidence. We present some techniques to ask the user for the next best piece of evidence that maximizes the probability of classification. Using these techniques for detecting recycle activity, the accuracy increases to 93%. The major disadvantage of using supervised models is that it requires extensive annotated training data, which expensive to collect. Due to the limited training data, we look at the graph based inductive semi-supervised learning methods to propagate the labels among the unlabeled samples. In the semi-supervised approach, we represent each instance in the data set as a node in the graph. Since it is a complete graph, edges interconnect these nodes, with each edge having some weight representing the similarity between the points. We propagate the labels in this graph, based on the proximity of the data points to the labeled nodes. We estimate the performance of these algorithms by measuring how close the probability distribution of the data after label propagation is to the probability distribution of the ground truth data. Since labeling has a cost associated with it, in this thesis we propose two algorithms that help us in selecting minimum number of labeled points to propagate the labels accurately. Our proposed algorithm achieves a maximum of 73% increase in performance when compared to the baseline algorithm. / Dissertation/Thesis / M.S. Computer Science 2013
|
9 |
Pós-processamento de regras de associação via redes e propagação de rótulos / Post-processing association rules using networks and label propagationRenan de Padua 27 February 2015 (has links)
Dentre as técnicas de mineração existentes encontra-se a associação, responsável por identificar relações que ocorrem no conjunto de dados. Embora a associação seja uma das técnicas mais utilizadas, a quantidade de padrões extraídos pode vir a sobrecarregar o usuário de tal maneira que encontrar algo interessante dentre a imensidão de padrões obtidos passa a ser um novo desafio. Para solucionar esse problema, uma grande parte dos trabalhos relacionados à associação está voltada a etapa de pós-processamento. Esses trabalhos geralmente propõem abordagens de pós-processamento que visam, segundo determinada estratégia, facilitar a busca pelos padrões interessantes ao domínio. Nos últimos anos, essas abordagens têm incluído no processo o conhecimento e/ou interesse do usuário sobre o domínio. Contudo, nas abordagens atualmente existentes, o usuário deve, por meio de algum formalismo descrever explicitamente seu conhecimento e/ou interesse, requerendo do usuário um tempo considerável, podendo levar, inclusive, a especificações incompletas e/ou incorretas. Além disso, na maioria das vezes, o usuário não tem ideia do que é provavelmente interessante, nem a partir de quais relações iniciar a busca. Nota-se, portanto, que um dos desafios dessas abordagens é considerar o conhecimento e/ou interesse do usuário. Além disso, é necessário considerar também o número de regras que o usuário analisará. A análise de regras feita por um especialista é custosa e, na maioria dos casos, o usuário quer explorar as regras geradas sem limitar a exploração ao conhecimento que ele já possui. Portanto, é importante que o usuário avalie o menor número de regras possível e, com base nessa avaliação, abordagens de pós-processamento consigam o auxiliar na busca pelas regras que ele poderá considerar interessante. Para tanto, é proposto neste trabalho que o pós-processamento seja tratado como um problema de classificação semissupervisionada transdutiva, uma vez que permite que o usuário rotule, considerando classes pré-definidas (por exemplo, \"Interessante\" ou \"Não Interessante\"), apenas algumas regras do conjunto a ser explorado para que todas as outras regras sejam automaticamente rotuladas. Além disso, por meio da definição dos rótulos de algumas regras, é possível capturar implicitamente o conhecimento e/ou interesse do usuário sobre o domínio. Para tanto, é necessário que as regras sejam modeladas de maneira a permitir: (a) selecionar as regras a serem rotuladas pelo usuário a fim de capturar implicitamente seu conhecimento e/ou interesse; (b) propagar os rótulos das regras já classificadas pelo usuário a todas as outras regras não rotuladas. Desse modo, neste trabalho, as regras foram modeladas via redes, uma vez que: (i) uma vasta quantidade de medidas de exploração de redes pode ser utilizada, em conjunto com as informações fornecidas pelo usuário, a fim de viabilizar o item (a); (ii) algoritmos de propagação de rótulos podem ser utilizados a fim de viabilizar o item (b). Diante do apresentado, ressalta-se que as contribuições deste trabalho estão na capacidade de se extrair o conhecimento e/ou interesse do usuário de acordo com as características da base de dados e direcionar sua exploração sem a necessidade de se definir previamente o que será explorado. Além disso, os resultados obtidos demonstram a capacidade da PARLP em direcionar o usuário para o conhecimento considerado interessante, reduzindo, para tanto, a quantidade de regras a serem exploradas. Por fim, este trabalho contribui também para demonstrar que é possível tratar o pós-processamento de regras de associação como um problema de propagação de rótulos. / One of the existing data mining techniques is association rules, responsible for identifying relationships that occur in the data set. Although the association rule is one of the most widely used techniques, the amount of extracted patterns can overload the user in such a way that finding interesting patterns among the large amount of obtained patterns becomes a challenge. To solve this problem, a large part of the association-related work is focused on the post-processing step. These works generally propose a post-processing approaches that, according to a certain strategy, aims facilitating the search for interesting patterns. Nowadays, approaches have included the user knowledge in the domain and / or interests on the process. However, in the current existing approaches, the user knowledge and/or interest must be explicitly described by some formalism, requiring a considerable time and may even lead to incomplete and / or incorrect specifications. In addition, the user has no idea what probably is interesting or which patterns to begin the searching. Notice that one of the challenges of these approaches is to consider the knowledge and / or user interest. In addition, consider the number of rules the user will examine is necessary. The analysis of the rules by an expert is expensive and, in most cases, the user wants to explore the rules generated without limiting exploration to the knowledge he already has. Therefore, the user evaluate the fewest amount of rules possible is important and, based on this assessment, the post-processing approaches be able to assist in the search for the rules that he may consider interesting. So, in this work is proposed that the post-processing is treated as a transductive semi supervised classification problem, since it allows the user to label some rules based on two predefined classes (e.g. \"interesting\"or \"not interesting\"), in a way that just a small amount of the rule set needs to be explored and all other association rules are automatically labeled. Furthermore, you can implicitly capture the knowledge and / or user interest in the domain by labeling some rules. Thus, the rules need to be modeled to allow: (a) select the rules to be labeled by the user to implicitly capture their knowledge and / or interest; (b) propagate the rules\' labels classified by the user to all not labeled rules. To do so, the rules were modeled via networks in this work, due to: (i) a large amount of network measures can be used in conjunction with the information provided by the user, to make item (a) possible; (ii) label propagation algorithms can be used in order to make item (b) possible. Therefore, we highlight that the contributions of this work are the ability to extract knowledge and / or user interest according to database characteristics and direct the user exploration without previously defining what will be explored. In addition, the results demonstrate that the proposed approach is able to direct the user to the knowledge considered interesting, reducing the amount of rules to be explored. Finally, this work also contributes to demonstrate that treat the post-processing of association rules as a problem of propagation of labels is possible.
|
10 |
Classificação automática de textos por meio de aprendizado de máquina baseado em redes / Text automatic classification through machine learning based on networksRafael Geraldeli Rossi 26 October 2015 (has links)
Nos dias atuais há uma quantidade massiva de dados textuais sendo produzida e armazenada diariamente na forma de e-mails, relatórios, artigos e postagens em redes sociais ou blogs. Processar, organizar ou gerenciar essa grande quantidade de dados textuais manualmente exige um grande esforço humano, sendo muitas vezes impossível de ser realizado. Além disso, há conhecimento embutido nos dados textuais, e analisar e extrair conhecimento de forma manual também torna-se inviável devido à grande quantidade de textos. Com isso, técnicas computacionais que requerem pouca intervenção humana e que permitem a organização, gerenciamento e extração de conhecimento de grandes quantidades de textos têm ganhado destaque nos últimos anos e vêm sendo aplicadas tanto na academia quanto em empresas e organizações. Dentre as técnicas, destaca-se a classificação automática de textos, cujo objetivo é atribuir rótulos (identificadores de categorias pré-definidos) à documentos textuais ou porções de texto. Uma forma viável de realizar a classificação automática de textos é por meio de algoritmos de aprendizado de máquina, que são capazes de aprender, generalizar, ou ainda extrair padrões das classes das coleções com base no conteúdo e rótulos de documentos textuais. O aprendizado de máquina para a tarefa de classificação automática pode ser de 3 tipos: (i) indutivo supervisionado, que considera apenas documentos rotulados para induzir um modelo de classificação e classificar novos documentos; (ii) transdutivo semissupervisionado, que classifica documentos não rotulados de uma coleção com base em documentos rotulados; e (iii) indutivo semissupervisionado, que considera documentos rotulados e não rotulados para induzir um modelo de classificação e utiliza esse modelo para classificar novos documentos. Independente do tipo, é necessário que as coleções de documentos textuais estejam representadas em um formato estruturado para os algoritmos de aprendizado de máquina. Normalmente os documentos são representados em um modelo espaço-vetorial, no qual cada documento é representado por um vetor, e cada posição desse vetor corresponde a um termo ou atributo da coleção de documentos. Algoritmos baseados no modelo espaço-vetorial consideram que tanto os documentos quanto os termos ou atributos são independentes, o que pode degradar a qualidade da classificação. Uma alternativa à representação no modelo espaço-vetorial é a representação em redes, que permite modelar relações entre entidades de uma coleção de textos, como documento e termos. Esse tipo de representação permite extrair padrões das classes que dificilmente são extraídos por algoritmos baseados no modelo espaço-vetorial, permitindo assim aumentar a performance de classificação. Além disso, a representação em redes permite representar coleções de textos utilizando diferentes tipos de objetos bem como diferentes tipos de relações, o que permite capturar diferentes características das coleções. Entretanto, observa-se na literatura alguns desafios para que se possam combinar algoritmos de aprendizado de máquina e representações de coleções de textos em redes para realizar efetivamente a classificação automática de textos. Os principais desafios abordados neste projeto de doutorado são (i) o desenvolvimento de representações em redes que possam ser geradas eficientemente e que também permitam realizar um aprendizado de maneira eficiente; (ii) redes que considerem diferentes tipos de objetos e relações; (iii) representações em redes de coleções de textos de diferentes línguas e domínios; e (iv) algoritmos de aprendizado de máquina eficientes e que façam um melhor uso das representações em redes para aumentar a qualidade da classificação automática. Neste projeto de doutorado foram propostos e desenvolvidos métodos para gerar redes que representem coleções de textos, independente de domínio e idioma, considerando diferentes tipos de objetos e relações entre esses objetos. Também foram propostos e desenvolvidos algoritmos de aprendizado de máquina indutivo supervisionado, indutivo semissupervisionado e transdutivo semissupervisionado, uma vez que não foram encontrados na literatura algoritmos para lidar com determinados tipos de relações, além de sanar a deficiência dos algoritmos existentes em relação à performance e/ou tempo de classificação. É apresentado nesta tese (i) uma extensa avaliação empírica demonstrando o benefício do uso das representações em redes para a classificação de textos em relação ao modelo espaço-vetorial, (ii) o impacto da combinação de diferentes tipos de relações em uma única rede e (iii) que os algoritmos propostos baseados em redes são capazes de superar a performance de classificação de algoritmos tradicionais e estado da arte tanto considerando algoritmos de aprendizado supervisionado quanto semissupervisionado. As soluções propostas nesta tese demonstraram ser úteis e aconselháveis para serem utilizadas em diversas aplicações que envolvam classificação de textos de diferentes domínios, diferentes características ou para diferentes quantidades de documentos rotulados. / A massive amount of textual data, such as e-mails, reports, articles and posts in social networks or blogs, has been generated and stored on a daily basis. The manual processing, organization and management of this huge amount of texts require a considerable human effort and sometimes these tasks are impossible to carry out in practice. Besides, the manual extraction of knowledge embedded in textual data is also unfeasible due to the large amount of texts. Thus, computational techniques which require little human intervention and allow the organization, management and knowledge extraction from large amounts of texts have gained attention in the last years and have been applied in academia, companies and organizations. The tasks mentioned above can be carried out through text automatic classification, in which labels (identifiers of predefined categories) are assigned to texts or portions of texts. A viable way to perform text automatic classification is through machine learning algorithms, which are able to learn, generalize or extract patterns from classes of text collections based on the content and labels of the texts. There are three types of machine learning algorithms for automatic classification: (i) inductive supervised, in which only labeled documents are considered to induce a classification model and this model are used to classify new documents; (ii) transductive semi-supervised, in which all known unlabeled documents are classified based on some labeled documents; and (iii) inductive semi-supervised, in which labeled and unlabeled documents are considered to induce a classification model in order to classify new documents. Regardless of the learning algorithm type, the texts of a collection must be represented in a structured format to be interpreted by the algorithms. Usually, the texts are represented in a vector space model, in which each text is represented by a vector and each dimension of the vector corresponds to a term or feature of the text collection. Algorithms based on vector space model consider that texts, terms or features are independent and this assumption can degrade the classification performance. Networks can be used as an alternative to vector space model representations. Networks allow the representations of relations among the entities of a text collection, such as documents and terms. This type of representation allows the extraction patterns which are not extracted by algorithms based on vector-space model. Moreover, text collections can be represented by networks composed of different types of entities and relations, which provide the extraction of different patterns from the texts. However, there are some challenges to be solved in order to allow the combination of machine learning algorithms and network-based representations to perform text automatic classification in an efficient way. The main challenges addressed in this doctoral project are (i) the development of network-based representations efficiently generated which also allows an efficient learning; (ii) the development of networks which represent different types of entities and relations; (iii) the development of networks which can represent texts written in different languages and about different domains; and (iv) the development of efficient learning algorithms which make a better use of the network-based representations and increase the classification performance. In this doctoral project we proposed and developed methods to represent text collections into networks considering different types of entities and relations and also allowing the representation of texts written in any language or from any domain. We also proposed and developed supervised inductive, semi-supervised transductive and semi-supervised inductive learning algorithms to interpret and learn from the proposed network-based representations since there were no algorithms to handle certain types of relations considered in this thesis. Besides, the proposed algorithms also attempt to obtain a higher classification performance and a faster classification than the existing network-based algorithms. In this doctoral thesis we present (i) an extensive empirical evaluation demonstrating the benefits about the use of network-based representations for text classification, (ii) the impact of the combination of different types of relations in a single network and (iii) that the proposed network-based algorithms are able to surpass the classification performance of traditional and state-of-the-art algorithms considering both supervised and semi-supervised learning. The solutions proposed in this doctoral project have proved to be advisable to be used in many applications involving classification of texts from different domains, areas, characteristics or considering different numbers of labeled documents.
|
Page generated in 0.0626 seconds