41 |
Proposta de ajustamento para melhoria da confiabilidade e precisão dos pontos de rede geodésicas para fins topográficos locais / Adjustment proposal for improving of the reliability and precision of geodetic network points for localSantos, Antonio José Prado Martins 06 March 2006 (has links)
Em levantamentos geodésicos planialtimétrico se faz necessário conhecer a qualidade das coordenadas estimadas de acordo com o tipo de aplicação a que se destinam. Este trabalho mostra de modo didático o estudo das teorias de análise de qualidade de rede GPS, baseando-se nas teorias de confiabilidade de rede propostos por Baarda, em 1968. As hipóteses estatísticas são fundamentais para elaboração dos testes para detecção de erros grosseiros (outliers), que constitui a base para a análise da confiabilidade de rede. Neste trabalho são propostas três estratégias, desenvolvidas em MathCAD, para a análise da qualidade do ajustamento. Os resultados obtidos foram comparados com os dos programas comerciais, Ski-Pro e Ashtech Solution, e também validados por medidas de campo feitas com estação total. As três estratégias propostas, para a rede em estudo implantada no Campus II da USP, apresentaram bons resultados / Geodetic planimetric survey requires the knowledge of the coordinates quality, estimated according to the final application. This work shows a didactic way on the study on theoretical analyses on quality of GPS networks, based on reliability network proposed by Baarda in 1968. Statistical hypotheses are fundamental to development of outliers detection tests, which consists the base for network reliability. In this work, three strategies are proposed, developed in MathCAD, to analyze adjustment quality. The results were compared with the results of two commercial programs, Ski-Pro and Ashtech Solution, and also field validation measurements with total station. The three proposed strategy, applied on a pilot network located at the Campus II of USP, gave good results
|
42 |
Proposta de ajustamento para melhoria da confiabilidade e precisão dos pontos de rede geodésicas para fins topográficos locais / Adjustment proposal for improving of the reliability and precision of geodetic network points for localAntonio José Prado Martins Santos 06 March 2006 (has links)
Em levantamentos geodésicos planialtimétrico se faz necessário conhecer a qualidade das coordenadas estimadas de acordo com o tipo de aplicação a que se destinam. Este trabalho mostra de modo didático o estudo das teorias de análise de qualidade de rede GPS, baseando-se nas teorias de confiabilidade de rede propostos por Baarda, em 1968. As hipóteses estatísticas são fundamentais para elaboração dos testes para detecção de erros grosseiros (outliers), que constitui a base para a análise da confiabilidade de rede. Neste trabalho são propostas três estratégias, desenvolvidas em MathCAD, para a análise da qualidade do ajustamento. Os resultados obtidos foram comparados com os dos programas comerciais, Ski-Pro e Ashtech Solution, e também validados por medidas de campo feitas com estação total. As três estratégias propostas, para a rede em estudo implantada no Campus II da USP, apresentaram bons resultados / Geodetic planimetric survey requires the knowledge of the coordinates quality, estimated according to the final application. This work shows a didactic way on the study on theoretical analyses on quality of GPS networks, based on reliability network proposed by Baarda in 1968. Statistical hypotheses are fundamental to development of outliers detection tests, which consists the base for network reliability. In this work, three strategies are proposed, developed in MathCAD, to analyze adjustment quality. The results were compared with the results of two commercial programs, Ski-Pro and Ashtech Solution, and also field validation measurements with total station. The three proposed strategy, applied on a pilot network located at the Campus II of USP, gave good results
|
43 |
Detecção de outliers em séries espaço-temporais: análise de precipitação em Minas Gerais / Outliers Detection in Space-Time Series: Analysis of rainfall in Minas GeraisSilva, Alyne Neves 24 July 2012 (has links)
Made available in DSpace on 2015-03-26T13:32:17Z (GMT). No. of bitstreams: 1
texto completo.pdf: 3004404 bytes, checksum: 18834db766750ae443a52c29a9b0decd (MD5)
Previous issue date: 2012-07-24 / Fundação de Amparo a Pesquisa do Estado de Minas Gerais / Time series are sometimes influenced by disruptions of events, such as strikes, the outbreak of war, among others. These interrupts originate atypical observations or outliers that directly influence the homogeneity of the series, leading to erroneous inferences and interpretations of the variable under study, being very common in climatological data. So, in the interest of detecting outliers in time series of precipitation, this study aimed to establish a method of detecting outliers. For this, there was the junction of ARIMA models and methodologies of the classical geostatistics, the self-validation. The proposed criterion compares waste of time series analysis with confidence intervals of the residue of self-validation. We analyzed time series of average monthly rainfall for rainy days of 43 rainfall stations in the state of Minas Gerais, between the years 2000 to 2005. The analysis procedures ranging from the description of the periodicity through the periodogram to obtain validation, from the estimation of the semivariogram models by ordinary least squares methods and maximum likelihood. The results for the period under study, 165 were detected outliers, spread between the 43 rainfall stations. The station Campo Grande Ranch, located in the municipality of Passa Tempo, was the season in which they recorded the highest number of outliers, 45 in total. As the results, we considered the proposed method very efficient in detecting outliers, and therefore the analysis of the homogeneity of observations. / Séries temporais são algumas vezes influenciadas por interrupções de eventos, tais como greves, eclosão de guerras, entre outras. Estas interrupções originam observações atípicas ou outliers que influenciam diretamente na homogeneidade da série, ocasionando interpretações e inferências errôneas da variável sob estudo, sendo muito comum em dados climatológicos. Assim, com o interesse de detectar outliers em séries temporais de precipitação, o presente trabalho teve por objetivo estabelecer um método de detecção outliers. Para tal, realizou-se a junção da modelagem ARIMA e de uma das metodologias clássicas de geoestatística, a autovalidação. O critério proposto compara os resíduos da análise de séries temporais com intervalos de confiança dos resíduos da autovalidação. Foram analisadas séries temporais da precipitação média mensal por dias chuvosos de 43 estações pluviométricas localizadas no estado de Minas Gerais, entre os anos de 2000 a 2005. Os procedimentos de análise vão da descrição da periodicidade por meio do periodograma até a obtenção da autovalidação, à partir da estimação dos modelos de semivariograma pelos métodos de mínimos quadrados ordinários e máxima verossimilhança. Pelos resultados, para o período sob estudo, foram detectado 165 outliers, espalhados entre as 43 estações pluviométricas. A estação Fazenda Campo Grande, localizada no município de Passa Tempo, foi a estação em que se registrou o maior número de outliers, 45 no total. Conforme os resultados obtidos considerou-se o método proposto muito eficiente na detecção de outliers e, consequentemente, na análise da homogeneidade das observações.
|
44 |
Controle de qualidade de análises de solos da rede Rolas - RS/SC e procedimentos estatísticos alternativos / Quality control of soil analysis fron a network of laboratories and alternatives of statistical proceduresGriebeler, Gustavo 15 February 2012 (has links)
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Soil chemical analysis must be accurate to avoid errors in the recommendations of lime and fertilizers. The quality control program of the ROLAS-RS/SC network evaluates the analysis accuracy by the distance of a result, through standard deviation, from the median of four soil samples analyzed monthly during a year. This way of judgment requires that data sets must have normal distribution to insure that the median can be considered an estimation of the true value. Outliers should be eliminated because they change the standard deviation spreads and consequently, the accuracy. The mathematical procedure of the accuracy calculation may also allow that asterisks attributed to accurate data might be poised by accurate ones. In addition, the cancellation criteria of the asterisks may be at odds with the uncertainty associated with analytical methods. Therefore, the objective of this work was to check the normal distribution of data sets, identify outliers, evaluate procedures of accuracy calculation and quantify analytically the uncertainty associated to the extractions and determination methods of P and K in order to verify how these aspects may affect laboratories accuracy. The Lilliefors test was run to check the normality and outliers were identified through the quartile test. Procedures to evaluate the accuracy by normality adjustment through outliers elimination of data sets were tested. The substitution of the median by the average as criteria of central reference and calculation of accuracy for each attribute analyzed, instead of annual average accuracy was also tested. Repetitions of the analysis of P and K were carried out to determine the intrinsic variability of the methods. Only 59% of data followed normal distribution, indicating that 41% of the attributes analyzed were considered in disaccordance with statical assumptions. When outliers were eliminated of the data sets, analyzes with normal distribution increased up to 75%, which decreased the number of laboratories that had the minimum accuracy required by the laboratories network. When data sets have normal distribution, the use of the average instead of median showed to be better for the estimation of the true value. Data sets out of the analytical expected range should be eliminated, while those framed within it, if amplitude is less than 1,5 interquartile distances, should not be excluded for the calculation of the accuracy. The procedure to calculate annual average accuracy hide attributes less accurate then the minimal the required. The intrinsic variability associated with the methods of analysis indicates that the criteria for the cancellation of asterisks from P should be reassessed, but the criteria for K seem to be appropriate. Studies on the variability of each analytical method are needed. / Os resultados de análises químicas de amostras de solo devem apresentar exatidão satisfatória para que não sejam induzidos erros de recomendação da adubação e calagem. O controle de qualidade dos laboratórios da rede ROLAS-RS/SC avalia a exatidão média anual dos resultados, sendo a mesma estabelecida em função do afastamento em desvios padrão de um resultado em relação à mediana das análise de cada uma das amostras de solo realizadas ao longo do ano. Nesse sistema, é necessário que os dados apresentem distribuição normal para que a mediana seja assumida como estimativa do valor central das amostras de solos, associada à ausência de outliers, pois estes alteram a magnitude do desvio padrão e, consequentemente, a exatidão. O procedimento matemático de cálculo da exatidão média também pode permitir que os asteriscos recebidos em atributos de análise inexatos sejam contrabalanceados pelos atributos exatos. Além disso, os critérios de cancelamento de asteriscos podem estar em desacordo com a incerteza analítica associada aos métodos. Portanto, o objetivo do trabalho foi testar a distribuição normal dos resultados do controle de qualidade da rede ROLAS-RS/SC, identificar a presença de outliers, avaliar o procedimento de cálculo da exatidão e quantificar analiticamente a incerteza associada aos métodos de extração e determinação de P e K para verificar como estes fatores podem afetar a exatidão dos laboratórios. O teste de Lilliefors foi aplicado para avaliar a normalidade e os outliers foram identificados pelo método dos quartis. Procedimentos para avaliar a exatidão pelo ajuste à distribuição normal através da eliminação dos outliers foram testados, além da substituição da mediana pela média e avaliação do procedimento de cálculo da exatidão por atributo de análise das amostras de solo ao invés da exatidão média de todos os atributos. Repetições das análises de P e K foram realizadas para verificar a variabilidade intrínseca dos métodos. Somente 59% dos dados apresentaram distribuição normal, indicando que 41% das análises foram avaliadas em desacordo com os pressupostos estatísticos. A exclusão dos outliers elevou de 59 para 75% o número de análises com distribuição normal, que tornou o método de avaliação mais rígido, uma vez que diminuiu o número de laboratórios com a exatidão mínima exigida de 85%. Em dados com distribuição normal, a média parece melhor estimar o valor verdadeiro que a mediana. Dados sem distribuição normal por ultrapassarem as faixas de trabalho devem ser eliminados, enquanto aqueles enquadrados dentro das faixas de trabalho e que apresentam amplitude inferior a 1,5 vezes a distância interquartílica não devem ser excluídos para o cálculo da exatidão. O procedimento que calcula a exatidão média anual oculta atributos com exatidão inferior ao mínimo preconizado pelo sistema. A variabilidade intrínseca associada aos métodos de análise indica que os critérios de cancelamento de asteriscos do P devem ser reavaliados enquanto os critérios referentes ao K parecem estar adequados, contudo, deve-se aprofundar os es
|
45 |
A Note on the Size of the ADF Test with Additive Outliers and Fractional Errors. A Reappraisal about the (Non)Stationarity of the Latin-American Inflation Series / Una nota sobre el tamaño del Test ADF con outliers aditivos y errores fraccionales. Una re-evaluación de la (no) estacionariedad de las series de inflación latinoamericanasRodríguez, Gabriel, Ramírez, Dionisio 10 April 2018 (has links)
This note analyzes the empirical size of the augmented Dickey and Fuller (ADF) statistic proposedby Perron and Rodríguez (2003) when the errors are fractional. This ADF is based on a searching procedure for additive outliers based on first-differences of the data named td. Simulations show that empirical size of the ADF is not affected by fractional errors confirming the claim of Perron and Rodríguez (2003) that the procedure td is robust to departures of the unit root framework. In particular the results show low sensitivity of the size of the ADF statistic respect to the fractional parameter (d). However, as expected, when there is strong negative moving average autocorrelation or negative autoregressive autocorrelation, the ADF statistic is oversized. These difficulties are fixed when sample increases (from T = 100 to T = 200). Empirical application to eight quarterly Latin American inflation series is also provided showing the importance of taking into account dummy variables for the detected additive outliers. / En esta nota se analiza el tamaño empírico del estadístico Dickey y Fuller aumentado (ADF), propuesto por Perron y Rodríguez (2003), cuando los errores son fraccionales. Este estadístico se basa en un procedimiento de búsqueda de valores atípicos aditivos basado en las primeras diferencias de los datos denominado td. Las simulaciones muestran que el tamaño empírico del estadístico ADF no es afectado por los errores fraccionales confirmando el argumento de Perron y Rodríguez (2003) que el procedimiento td es robusto a las desviaciones del marco de raíz unitaria. En particular, los resultados muestran una baja sensibilidad del tamaño del estadístico ADF respecto al parámetro fraccional (d). Sin embargo, como es de esperar, cuando hay una fuerte autocorrelación negativa de tipo promedio móvil o autocorrelación autorregresiva negativa, el estadístico ADF tiene un tamaño exacto mayor que el nominal. Estas dificultades desaparecen cuando aumenta la muestra (a partir de T = 100 a T = 200). La aplicación empírica a ocho series de inflación latinoamericana trimestral proporciona evidencia de la importancia de tener en cuenta las variables ficticias para controlar por los outliers aditivos detectados.
|
46 |
Alguns métodos robustos para detectar outliers multivariados / Some robust methods to detect multivariate outliersFabíola Rocha de Santana Giroldo 07 March 2008 (has links)
Observações ou outliers estão quase sempre presentes em qualquer conjunto de dados, seja ele grande ou pequeno. Isso pode ocorrer por erro no armazenamento dos dados ou por existirem realmente alguns pontos diferentes dos demais. A presença desses pontos pode causar distorções nos resultados de modelos e estimativas. Por isso, a sua detecção é muito importante e deve ser feita antes do início de uma análise mais profunda dos dados. Após esse diagnóstico, pode-se tomar uma decisão a respeito dos pontos atípicos. Uma possibilidade é corrigi-los caso tenha ocorrido erro na transcrição dos dados. Caso sejam pontos válidos, eles devem ser tratados de forma diferente dos demais, seja com uma ponderação, seja com uma análise especial. Nos casos univariado e bivariado, o outlier pode ser detectado analisando-se o gráfico de dispersão que mostra o comportamento de cada observação do conjunto de dados de interesse. Se houver pontos distantes da massa de dados, eles devem ser considerados atípicos. No caso multivariado, a detecção por meio de gráficos torna-se um pouco mais complexa porque a análise deveria ser feita observando-se duas variáveis por vez, o que tornaria o processo longo e pouco confiável, pois um ponto pode ser atípico com relação a algumas variáveis e não ser com relação a outras, o que faria com que o resultado ficasse mascarado. Neste trabalho, alguns métodos robustos para detecção de outliers em dados multivariados são apresentados. A aplicação de cada um dos métodos é feita para um exemplo. Além disso, os métodos são comparados de acordo com o resultado que cada um apresentar para o exemplo em questão e via simulação. / Unusual observations or outliers are frequent in any data set, if it is large or not. Outliers may occur by typing mistake or by the existence of observations that are really different from the others. The presence of this observations may distort the results of models and estimates. Therefore, their detection is very important and it is recommended to be performed before any detailed analysis, when a decision can be taken about these atypical observations. A possibility is to correct these observations if the problem occurred with the construction of the data set. If the observations are correct, different strategies can be adopted, with some weights or with special analysis. In univariate and bivariate data sets, outliers can be detected analyzing the scatter plot. Observations distant from the cloud formed by the data set are considered unusual. In multivariate data sets, the detection of outliers using graphics is more difficult because we have to analyse a couple of variables each time, which results is a long and less reliable process because we can find an observation that is unusual for one variable and not unusual for the others, masking the results. In this work, some robust methods for detection of multivariate outliers are presented. The application of each one is done for an example. Moreover, the methods are compared by the results of each one in the example and by simulation.
|
47 |
Exploration Framework For Detecting Outliers In Data StreamsSean, Viseth 27 April 2016 (has links)
Current real-world applications are generating a large volume of datasets that are often continuously updated over time. Detecting outliers on such evolving datasets requires us to continuously update the result. Furthermore, the response time is very important for these time critical applications. This is challenging. First, the algorithm is complex; even mining outliers from a static dataset once is already very expensive. Second, users need to specify input parameters to approach the true outliers. While the number of parameters is large, using a trial and error approach online would be not only impractical and expensive but also tedious for the analysts. Worst yet, since the dataset is changing, the best parameter will need to be updated to respond to user exploration requests. Overall, the large number of parameter settings and evolving datasets make the problem of efficiently mining outliers from dynamic datasets very challenging. Thus, in this thesis, we design an exploration framework for detecting outliers in data streams, called EFO, which enables analysts to continuously explore anomalies in dynamic datasets. EFO is a continuous lightweight preprocessing framework. EFO embraces two optimization principles namely "best life expectancy" and "minimal trial," to compress evolving datasets into a knowledge-rich abstraction of important interrelationships among data. An incremental sorting technique is also used to leverage the almost ordered lists in this framework. Thereafter, the knowledge abstraction generated by EFO not only supports traditional outlier detection requests but also novel outlier exploration operations on evolving datasets. Our experimental study conducted on two real datasets demonstrates that EFO outperforms state-of-the-art technique in terms of CPU processing costs when varying stream volume, velocity and outlier rate.
|
48 |
Detection of outliers in failure dataGallup, Donald Robert January 2011 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries
|
49 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
50 |
Cooperative Clustering Model and Its ApplicationsKashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.
|
Page generated in 0.053 seconds