1 |
Análise de viés em notícias na língua portuguesa / Bias analysis on newswire in portugueseArruda, Gabriel Domingos de 02 December 2015 (has links)
O projeto descrito neste documento propõe um modelo para análise de viés em notícias, procurando identificar o viés dos meios de comunicação em relação a entidades políticas. Foram analisados três tipos de viés: o viés de seleção, que avalia o quanto uma entidade é referenciada pelo meio de comunicação; o viés de cobertura, que avalia quanto destaque é destinado a entidade e, por fim, o viés de afirmação, que avalia se estão falando mal ou bem da entidade. Para tal, foi construído um corpus de notícias sistematicamente extraídas de 5 produtores de notícias e classificadas manualmente em relação à polaridade e entidade alvo. Técnicas de análise de sentimentos baseadas em aprendizado de máquina foram validadas utilizando o corpus criado. Criou-se uma metodologia para identificação de viés, utilizando o conceito de outliers, a partir de métricas indicadoras. A partir da metodologia proposta, foi analisado o viés em relação aos candidatos ao governo de São Paulo e à presidência a partir do corpus criado, em que se identificou os três tipos de viés em dois produtores de notícias / The project described here proposes a model to study bias on newswire texts, related to political entities. Three types of bias are analysed: selection bias, which refers to the amount of times an entity is referenced by the media outlet; coverage bias, which assesses the amount of coverage given to an entity and, finally, the assertion bias, which analyses whether the news is a positive or negative report of an entity. To accomplish this, a corpus was systematically built by extracting news from 5 different newswires. These texts were manually classified according to their polarity alignment and associated entity. Sentiment Analysis techniques were applied and evaluated using the corpus. Based on the concept of outliers, a methodology for bias detection was created. Bias was analysed using the proposed methodology on the generated corpus for candidates to the government of the state of São Paulo and to presidency, being identified in two newswires for the three above-defined types
|
2 |
Análise de viés em notícias na língua portuguesa / Bias analysis on newswire in portugueseGabriel Domingos de Arruda 02 December 2015 (has links)
O projeto descrito neste documento propõe um modelo para análise de viés em notícias, procurando identificar o viés dos meios de comunicação em relação a entidades políticas. Foram analisados três tipos de viés: o viés de seleção, que avalia o quanto uma entidade é referenciada pelo meio de comunicação; o viés de cobertura, que avalia quanto destaque é destinado a entidade e, por fim, o viés de afirmação, que avalia se estão falando mal ou bem da entidade. Para tal, foi construído um corpus de notícias sistematicamente extraídas de 5 produtores de notícias e classificadas manualmente em relação à polaridade e entidade alvo. Técnicas de análise de sentimentos baseadas em aprendizado de máquina foram validadas utilizando o corpus criado. Criou-se uma metodologia para identificação de viés, utilizando o conceito de outliers, a partir de métricas indicadoras. A partir da metodologia proposta, foi analisado o viés em relação aos candidatos ao governo de São Paulo e à presidência a partir do corpus criado, em que se identificou os três tipos de viés em dois produtores de notícias / The project described here proposes a model to study bias on newswire texts, related to political entities. Three types of bias are analysed: selection bias, which refers to the amount of times an entity is referenced by the media outlet; coverage bias, which assesses the amount of coverage given to an entity and, finally, the assertion bias, which analyses whether the news is a positive or negative report of an entity. To accomplish this, a corpus was systematically built by extracting news from 5 different newswires. These texts were manually classified according to their polarity alignment and associated entity. Sentiment Analysis techniques were applied and evaluated using the corpus. Based on the concept of outliers, a methodology for bias detection was created. Bias was analysed using the proposed methodology on the generated corpus for candidates to the government of the state of São Paulo and to presidency, being identified in two newswires for the three above-defined types
|
3 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
4 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
5 |
The Detection of Outlying Fire Service’s ReportsKrasuski, Adam, Wasilewski, Piotr 28 May 2013 (has links) (PDF)
We present a methodology for improving the detection of outlying Fire Service’s reports based on domain knowledge and dialogue with Fire & Rescue domain experts. The outlying report is considered as element which is significantly different from the remaining data. Outliers are defined and searched on the basis of domain knowledge and dialogue with experts. We face the problem of reducing high data dimensionality without loosing specificity and real complexity of reported incidents. We solve this problem by introducing a knowledge based generalization level intermediating between analysed data and experts domain knowledge. In the methodology we use the Formal Concept Analysis methods for both generation appropriate categories from data and as tools supporting communication with domain experts. We conducted two experiments in finding two types of outliers in which outliers detection was supported by domain experts.
|
6 |
Détection de données aberrantes appliquée à la localisation GPS / Outliers detection applied to GPS localizationZair, Salim 07 October 2016 (has links)
Dans cette thèse, nous nous intéressons au problème de détection de mesures GPS erronées. En effet, en zones urbaines, les acquisitions sont fortement dégradées par des phénomènes de multi-trajets ou de multiples réflexions des signaux avant d’arriver à l’antenne réceptrice. En forêt, de multiples obstacles bloquent les signaux satellites, ce qui diminue la redondance des mesures. Alors que les algorithmes présents dans les récepteurs GPS détectent au maximum une mesure erronée par pas de temps, avec une combinaison de différents systèmes de navigation, l’hypothèse d’une seule erreur à la fois n’est plus tenable et la détection et gestion des données erronées (défaillantes, aberrantes ou outliers selon les différentes terminologies) représente un enjeu majeur dans les applications de navigation autonome et de localisation robuste et devient un nouveau défi technologique.La contribution principale de cette thèse est un algorithme de détection de mesures de pseudo-distances aberrantes exploitant la modélisation a contrario. Deux critères fondés sur l’espérance du nombre de fausses alarmes (NFA) sont utilisés pour mesurer la cohérence d’un ensemble de mesures sous l’hypothèse d’un modèle de bruit.Notre seconde contribution concerne l’introduction des mesures Doppler dans le processus de localisation. Nous étendons la détection d’outliers conjointement dans les mesures de pseudo-distance aux mesures Doppler et proposons une localisation par couplage avec le filtre particulaire soit SIR soit de Rao-Blackwell qui permet d’estimer analytiquement la vitesse.Notre troisième contribution est une approche crédibiliste pour la détection des mesures aberrantes dans les pseudo-distances. S’inspirant du RANSAC, nous choisissons, parmi les combinaisons d’observations possibles, la plus compatible selon une mesure de cohérence ou d’incohérence. Une étape de filtrage évidentiel permet de tenir compte de la solution précédente. Les approches proposées donnent de meilleures performances que les méthodes usuelles et démontrent l’intérêt de retirer les données aberrantes du processus de localisation. / In this work, we focus on the problem of detection of erroneous GPS measurements. Indeed, in urban areas, acquisitions are highly degraded by multipath phenomena or signal multiple reflections before reaching the receiver antenna. In forest areas, the satellite occlusion reduces the measurements redundancy. While the algorithms embedded in GPS receivers detect at most one erroneous measurement per epoch, the hypothesis of a single error at a time is no longer realistic when we combine data from different navigation systems. The detection and management of erroneous data (faulty, aberrant or outliers depending on the different terminologies) has become a major issue in the autonomous navigation applications and robust localization and raises a new technological challenge.The main contribution of this work is an outlier detection algorithm for GNSS localization with an a contrario modeling. Two criteria based on number of false alarms (NFA) are used to measure the consistency of a set of measurements under the noise model assumption.Our second contribution is the introduction of Doppler measurements in the localization process. We extend the outlier detection to both pseudo-ranges and Doppler measurements, and we propose a coupling with either the particle filter SIR or the Rao-Blackwellized particle filter that allows us to estimate analytically the velocity.Our third contribution is an evidential approach for the detection of outliers in the pseudo-ranges. Inspired by the RANSAC, we choose among possible combinations of observations, the most compatible one according to a measure of consistency or inconsistency. An evidential filtering step is performed that takes into account the previous solution. The proposed approaches achieve better performance than standard methods and demonstrate the interest of removing the outliers from the localization process.
|
7 |
The Detection of Outlying Fire Service’s Reports: FCA Driven AnalyticsKrasuski, Adam, Wasilewski, Piotr 28 May 2013 (has links)
We present a methodology for improving the detection of outlying Fire Service’s reports based on domain knowledge and dialogue with Fire & Rescue domain experts. The outlying report is considered as element which is significantly different from the remaining data. Outliers are defined and searched on the basis of domain knowledge and dialogue with experts. We face the problem of reducing high data dimensionality without loosing specificity and real complexity of reported incidents. We solve this problem by introducing a knowledge based generalization level intermediating between analysed data and experts domain knowledge. In the methodology we use the Formal Concept Analysis methods for both generation appropriate categories from data and as tools supporting communication with domain experts. We conducted two experiments in finding two types of outliers in which outliers detection was supported by domain experts.
|
8 |
Restrições da correlação nos testes de germinação de sementes e emergência de plântulas / Restrictions of the correlation in the tests of seed germination and seedling emergenceCursino, Celso 27 December 2006 (has links)
Coefficient of Pearson r is used to compare scientific tests. In seeds technology
it is used to compare results of procedures that measure vigour. When the correspondly
similar results are not found in very similar conditions, Person s correlation faces
criticism mainly due to two factors. The first one comes from statistics for whose usage
of Person s correlation there are prescriptions that are not always observed, when they
are not understood as assumption. Variables naturally associated are required with
bivariated normal distribution, pairing; homoscedasticity, rectilinear dispersion;
detection of outliers. Added to them, there are practical observations in what refers the
correlation to be valid only in a restrict range of the data series, the necessity to create
value ranges to consider this correlation as good or bad , the need of the graphical
analysis, the use and interpretation of the significance, among others. The second cause
of odd results would be the existence of several biological factors, which are sometimes
support for the reserarcher conclusions. With the objective of identifying applicability
of correlations and the causes for odd results of r, there have been compared data
existent in the Seeds Analysis Laboratory of ICIAG of the Universidade Federal de
Uberlândia-MG, as well as tests of germination of acelerated aging in optimal
conditions of repetibility done in laboratory, and tests of field seedling emergency, as
well as other simulated variables. The results showed odd results. The normal
scattergram between X and Y is enough clear to elucidate only correlated variables of
large samples. Although, if the covariance is not as obvious the dispersion Y=f(X) is not
enough to show simultaneous increasing or decreasing between variables. With an
alternative methodology of plotting the variables related to another auxiliar variable Z of
the same n elements of X and Y, we could study the variable behavior in an individual
way. It was possible to create graphic criteria to assess non-valid correlations, such as
similarity of variables comparable to homoscedastity; influence of outliers on small or
big n; grouping of outliers in a dissident range , influence of treatments effect. In the
analysed cases, we concluded that, comparing seeds vigour with only laboratory results,
as well as its relation with the field results and among simulated data, the results
inconsistency of correlations are prevalent as they do not follow the literature
prescriptions, among others. The magnitude of the distortions due to statistical causes
did not leave space for measuring effects of the variation of the biological seeds
conditions, temporal alterations related to management or the edafoclimatic ones.
Keywords: 1. Failure in correlations 2. Correlation reliability / Coeficiente de Pearson r é usado para comparar experimentos científicos. Em
tecnologias de sementes serve para comparar resultados de procedimentos que medem
vigor. Quando se prognosticam resultados de correlações baseados em condições
similares e eles não acontecem, a correlação de Pearson enfrenta críticas, atribuídas
principalmente a duas causas. Primeiramente pela estatística, para cuja utilização da
correlação de Pearson existem prescrições nem sempre observadas, talvez por não
serem entendidas como pressuposições. Exigem-se variáveis métricas naturalmente
associadas, com distribuição normal bivariada, pareamento, homoscedasticidade, nuvem
de dispersão retilínea; detectção de outliers. Somam-se observações práticas quanto à
validade restrita a um trecho da série de dados, da criação de faixas de valores para
considerá-la de baixa a alta , da necessidade da análise gráfica, da interpretação de
significância, entre outras. A segunda causa seria justamente a existência de variação
biológica devido a fatores diversos externos e interno às sementes, servindo às vezes de
sustentáculo para conclusões de interesse do pesquisador. No objetivo de identificar
aplicabilidade das correlações e as causas de resultados estranhos, foram comparados
dados existentes no Laboratório de Análises de Sementes do ICIAG da Universidade
Federal de Uberlândia-MG, testes germinação de envelhecimento acelerado em
condições ideais de repetibilidade em laboratório, e teste de emergência de plântulas em
campo, e outras variáveis simuladas, havendo incidência de resultados estranhos. A
representação gráfica normal da dispersão entre X e Y mostra satisfatoriamente o
correlacionamento de variáveis naturalmente associadas com n grande. Entretanto, se a
covariância não é tão óbvia, a disperção Y=f(X) não é suficiente para mostrar
crescimento ou decréscimo simultâneo entre as variáveis. Usando metodologia
alternativa de plotagem das variáveis em relação a uma variável auxiliar Z, de mesmos n
elementos que X e Y, pôde-se estudar individualmente o comportamento das variáveis.
O método gráfico permitiu taxar correlações em válidas ou não pela similaridade das
variáveis, comparável à homoscedasticidade; verificar outliers em n pequeno ou grande;
agrupamento de outliers em trecho dissidente e mostrar efeito de tratamentos. Nos
casos analisados, concluiu-se que, comparando vigor de sementes com resultados só de
laboratório, tão bem como no seu relacionamento com os de campo; e entre dados
simulados, as inconsistências de resultados de correlações são preponderantes por não
seguirem as prescrições da literatura, entre outras. A magnitude das distorções por
causas estatísticas não deixou espaço para mensurar efeitos da variação de condições
biológicas de sementes, alterações temporais relativas a manuseio ou edafoclimáticas. / Mestre em Agronomia
|
Page generated in 0.0776 seconds