Spelling suggestions: "subject:"outliers"" "subject:"outlier's""
81 |
Optimisation of a Diagnostic Test for a Truck Engine / Optimering av ett diagnostest för en lastbilsmotorHaraldsson, Petter January 2002 (has links)
Diagnostic systems become more and more an important within the field of vehicle systems. This is much because new rules and regulation forcing the manufacturer of heavy duty trucks to survey the emission process in its engines during the whole lifetime of the truck. To do this a diagnostic system has to be implemented which always survey the process and check that the thresholds of the emissions set by the government not are exceeded. There is also a demand that this system should be reliable, i.e. not producing false alarms or missed detection. One way of producing such a system is to use model based diagnosis system where thresholds has to be set deciding if the system is corrupt or not. There is a lot of difficulties involved in this. Firstly, there is no way of knowing if the signals logged are corrupt or not. This is because faults in these signals should be detected. Secondly, because of strict demand of reliability the thresholds has to be set where there is very low probability of finding values while driving. In this thesis a methodology is proposed for setting thresholds in a diagnosis system in an experimental test engine at Scania. Measurement data has been logged over 20 hours of effective driving by two individuals of the same engine. It is shown that the result is improved significantly by using this method and the threshold can be set so smaller faults in the system reliably can be detected.
|
82 |
Factores que se asocian con el bajo peso del recién nacidoCorasma Uñurucu, Vilma Yovanna January 2002 (has links)
El bajo peso del recién nacido, es considerado como un indicador general de salud en los países en vías de desarrollo, de allí la importancia de obtener los factores que más influyen en el bajo peso al nacer. Utilizamos el análisis de Regresión Logística para clasificar a los recién nacidos en dos grupos: bajo peso al nacer y peso normal al nacer. El estudio se basa en todos los pacientes que se atendieron en el Instituto Materno Perinatal durante los meses de Enero a Junio del presente año. El presente trabajo no se limita a la estimación de parámetros del modelo; se realiza la validación de supuestos, evaluación de la bondad de ajuste del modelo, análisis de los residuos, detectar observaciones influyentes y finalmente la evaluación de la capacidad predictiva del modelo propuesto. / The low weight of the recent born, is considered like a general indicator of health in developing countries, of there the importance of obtaining the factors with more influence in the low weight when being born. We used the analysis of Logistic Regression to classify to the recent born in two groups: low weight when being born and normal weight when being born. The study is based on all the patients who were taken care of in Perinatal Maternal Institute during the months from January to June of the present year. The present work is not limited to the estimation of parameters of the model; it is made the validation of assumptions, evaluation of the kindness of adjustment of the model, analysis of the residues, to detect influential observations and finally the evaluation of the predictive capacity of the proposed model.
|
83 |
Optimisation of a Diagnostic Test for a Truck Engine / Optimering av ett diagnostest för en lastbilsmotorHaraldsson, Petter January 2002 (has links)
<p>Diagnostic systems become more and more an important within the field of vehicle systems. This is much because new rules and regulation forcing the manufacturer of heavy duty trucks to survey the emission process in its engines during the whole lifetime of the truck. To do this a diagnostic system has to be implemented which always survey the process and check that the thresholds of the emissions set by the government not are exceeded. There is also a demand that this system should be reliable, i.e. not producing false alarms or missed detection. One way of producing such a system is to use model based diagnosis system where thresholds has to be set deciding if the system is corrupt or not. There is a lot of difficulties involved in this. Firstly, there is no way of knowing if the signals logged are corrupt or not. This is because faults in these signals should be detected. Secondly, because of strict demand of reliability the thresholds has to be set where there is very low probability of finding values while driving. In this thesis a methodology is proposed for setting thresholds in a diagnosis system in an experimental test engine at Scania. Measurement data has been logged over 20 hours of effective driving by two individuals of the same engine. It is shown that the result is improved significantly by using this method and the threshold can be set so smaller faults in the system reliably can be detected.</p>
|
84 |
A Framework for Participatory Sensing SystemsMendez Chaves, Diego 01 January 2012 (has links)
Participatory sensing (PS) systems are a new emerging sensing paradigm based on the participation of cellular users in a cooperative way. Due to the spatio-temporal granularity that a PS system can provide, it is now possible to detect and analyze events that occur at different scales, at a low cost. While PS systems present interesting characteristics, they also create new problems. Since the measuring devices are cheaper and they are in the hands of the users, PS systems face several design challenges related to the poor accuracy and high failure rate of the sensors, the possibility of malicious users tampering the data, the violation of the privacy of the users as well as methods to encourage the participation of the users, and the effective visualization of the data. This dissertation presents four main contributions in order to solve some of these challenges.
This dissertation presents a framework to guide the design and implementation of PS applications considering all these aspects. The framework consists of five modules: sample size determination, data collection, data verification, data visualization, and density maps generation modules. The remaining contributions are mapped one-on-one to three of the modules of this framework: data verification, data visualization and density maps.
Data verification, in the context of PS, consists of the process of detecting and removing spatial outliers to properly reconstruct the variables of interest. A new algorithm for spatial outliers detection and removal is proposed, implemented, and tested. This hybrid neighborhood-aware algorithm considers the uneven spatial density of the users, the number of malicious users, the level of conspiracy, and the lack of accuracy and malfunctioning sensors. The experimental results show that the proposed algorithm performs as good as the best estimator while reducing the execution time considerably.
The problem of data visualization in the context of PS application is also of special interest. The characteristics of a typical PS application imply the generation of multivariate time-space series with many gaps in time and space. Considering this, a new method is presented based on the kriging technique along with Principal Component Analysis and Independent Component Analysis. Additionally, a new technique to interpolate data in time and space is proposed, which is more appropriate for PS systems. The results indicate that the accuracy of the estimates improves with the amount of data, i.e., one variable, multiple variables, and space and time data. Also, the results clearly show the advantage of a PS system compared with a traditional measuring system in terms of the precision and spatial resolution of the information provided to the users.
One key challenge in PS systems is that of the determination of the locations and number of users where to obtain samples from so that the variables of interest can be accurately represented with a low number of participants. To address this challenge, the use of density maps is proposed, a technique that is based on the current estimations of the variable. The density maps are then utilized by the incentive mechanism in order to encourage the participation of those users indicated in the map. The experimental results show how the density maps greatly improve the quality of the estimations while maintaining a stable and low total number of users in the system.
P-Sense, a PS system to monitor pollution levels, has been implemented and tested, and is used as a validation example for all the contributions presented here. P-Sense integrates gas and environmental sensors with a cell phone, in order to monitor air quality levels.
|
85 |
Identifikation av icke-representativa svar i frågeundersökningar genom detektion av multivariata avvikareGalvenius, Hugo January 2014 (has links)
To United Minds, large-scale surveys are an important offering to clients, not least the public opinion poll Väljarbarometern. A risk associated with surveys is satisficing – sub-optimal response behaviour impairing the possibility of correctly describing the sampled population through its results. The purpose of this study is to – through the use of multivariate outlier detection methods - identify those observations assumed to be non-representative of the population. The possibility of categorizing responses generated through satisficing as outliers is investigated. With regards to the character of the Väljarbarometern dataset, three existing algorithms are adapted to detect these outliers. Also, a number of randomly generated observations are added to the data, by all algorithms correctly labelled as outliers. The resulting anomaly scores generated by each algorithm are compared, concluding the Otey algorithm as the most effective for the purpose, above all since it takes into account correlation between variables. A plausible cut-off value for outliers and separation between non-representative and representative outliers are discussed. The resulting recommendation is to handle observations labelled as outliers through respondent follow-up or if not possible, through downweighting, inversely proportional to the anomaly scores.
|
86 |
Factores que se asocian con el bajo peso del recién nacidoCorasma Uñurucu, Vilma Yovanna January 2002 (has links)
El bajo peso del recién nacido, es considerado como un indicador general de salud en los países en vías de desarrollo, de allí la importancia de obtener los factores que más influyen en el bajo peso al nacer.
Utilizamos el análisis de Regresión Logística para clasificar a los recién nacidos en dos grupos: bajo peso al nacer y peso normal al nacer. El estudio se basa en todos los pacientes que se atendieron en el Instituto Materno Perinatal durante los meses de Enero a Junio del presente año.
El presente trabajo no se limita a la estimación de parámetros del modelo; se realiza la validación de supuestos, evaluación de la bondad de ajuste del modelo, análisis de los residuos, detectar observaciones influyentes y finalmente la evaluación de la capacidad predictiva del modelo propuesto. / --- The low weight of the recent born, is considered like a general indicator of health in developing countries, of there the importance of obtaining the factors with more influence in the low weight when being born.
We used the analysis of Logistic Regression to classify to the recent born in two groups: low weight when being born and normal weight when being born. The study is based on all the patients who were taken care of in Perinatal Maternal Institute during the months from January to June of the present year.
The present work is not limited to the estimation of parameters of the model; it is made the validation of assumptions, evaluation of the kindness of adjustment of the model, analysis of the residues, to detect influential observations and finally the evaluation of the predictive capacity of the proposed model.
|
87 |
Robust second-order least squares estimation for linear regression modelsChen, Xin 10 November 2010 (has links)
The second-order least-squares estimator (SLSE), which was proposed by Wang (2003), is asymptotically more efficient than the least-squares estimator (LSE) if the third moment of the error distribution is nonzero. However, it is not robust against outliers. In this paper. we propose two robust second-order least-squares estimators (RSLSE) for linear regression models. RSLSE-I and RSLSE-II, where RSLSE-I is robust against X-outliers and RSLSE-II is robust. against X-outliers and Y-outliers. The basic idea is to choose proper weight matrices, which give a zero weight to an outlier. The RSLSEs are asymptotically normally distributed and are highly efficient with high breakdown point.. Moreover, we compare the RSLSEs with the LSE, the SLSE and the robust MM-estimator through simulation studies and real data examples. The results show that they perform very well and are competitive to other robust regression estimators.
|
88 |
Robust principal component analysis biplotsWedlake, Ryan Stuart 03 1900 (has links)
Thesis (MSc (Mathematical Statistics))--University of Stellenbosch, 2008. / In this study several procedures for finding robust principal components (RPCs) for low and high dimensional data sets are investigated in parallel with robust principal component analysis (RPCA) biplots. These RPCA biplots will be used for the simultaneous visualisation of the observations and variables in the subspace spanned by the RPCs. Chapter 1 contains: a brief overview of the difficulties that are encountered when graphically investigating patterns and relationships in multidimensional data and why PCA can be used to circumvent these difficulties; the objectives of this study; a summary of the work done in order to meet these objectives; certain results in matrix algebra that are needed throughout this study. In Chapter 2 the derivation of the classic sample principal components (SPCs) is first discussed in detail since they are the „building blocks‟ of classic principal component analysis (CPCA) biplots. Secondly, the traditional CPCA biplot of Gabriel (1971) is reviewed. Thirdly, modifications to this biplot using the new philosophy of Gower & Hand (1996) are given attention. Reasons why this modified biplot has several advantages over the traditional biplot – some of which are aesthetical in nature – are given. Lastly, changes that can be made to the Gower & Hand (1996) PCA biplot to optimally visualise the correlations between the variables is discussed.
Because the SPCs determine the position of the observations as well as the orientation of the arrows (traditional biplot) or axes (Gower and Hand biplot) in the PCA biplot subspace, it is useful to give estimates of the standard errors of the SPCs together with the biplot display as an indication of the stability of the biplot. A computer-intensive statistical technique called the Bootstrap is firstly discussed that is used to calculate the standard errors of the SPCs without making underlying distributional assumptions. Secondly, the influence of outliers on Bootstrap results is investigated. Lastly, a robust form of the Bootstrap is briefly discussed for calculating standard error estimates that remain stable with or without the presence of outliers in the sample. All the preceding topics are the subject matter of Chapter 3. In Chapter 4, reasons why a PC analysis should be made robust in the presence of outliers are firstly discussed. Secondly, different types of outliers are discussed. Thirdly, a method for identifying influential observations and a method for identifying outlying observations are investigated. Lastly, different methods for constructing robust estimates of location and dispersion for the observations receive attention. These robust estimates are used in numerical procedures that calculate RPCs. In Chapter 5, an overview of some of the procedures that are used to calculate RPCs for lower and higher dimensional data sets is firstly discussed. Secondly, two numerical procedures that can be used to calculate RPCs for lower dimensional data sets are discussed and compared in detail. Details and examples of robust versions of the Gower & Hand (1996) PCA biplot that can be constructed using these RPCs are also provided. In Chapter 6, five numerical procedures for calculating RPCs for higher dimensional data sets are discussed in detail. Once RPCs have been obtained by using these methods, they are used to construct robust versions of the PCA biplot of Gower & Hand (1996). Details and examples of these robust PCA biplots are also provided. An extensive software library has been developed so that the biplot methodology discussed in this study can be used in practice. The functions in this library are given in an appendix at the end of this study. This software library is used on data sets from various fields so that the merit of the theory developed in this study can be visually appraised.
|
89 |
Análise de "outliers" para o controle do risco de evasão tributária do ICMSBittencourt Neto, Sérgio Augusto Pará 03 July 2018 (has links)
Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2018. / Submitted by Fabiana Santos (fabianacamargo@bce.unb.br) on 2018-11-07T18:38:41Z
No. of bitstreams: 1
2018_SérgioAugustoParáBittencourtNeto.pdf: 5650773 bytes, checksum: 743dbdc02efa3ebbf053f062cbc76e28 (MD5) / Approved for entry into archive by Fabiana Santos (fabianacamargo@bce.unb.br) on 2018-11-12T17:44:26Z (GMT) No. of bitstreams: 1
2018_SérgioAugustoParáBittencourtNeto.pdf: 5650773 bytes, checksum: 743dbdc02efa3ebbf053f062cbc76e28 (MD5) / Made available in DSpace on 2018-11-12T17:44:26Z (GMT). No. of bitstreams: 1
2018_SérgioAugustoParáBittencourtNeto.pdf: 5650773 bytes, checksum: 743dbdc02efa3ebbf053f062cbc76e28 (MD5)
Previous issue date: 2018-11-12 / Esta dissertação apresenta a aplicação associada de selecionados modelos estatísticos e de métodos de mineração de dados para a análise de outliers sobre as informações da Notas Fiscais Eletrônicas e do Livro Fiscal Eletrônico, proporcionando a investigação de novas modalidades de evasão fiscal no ICMS. São combinados: 1. o método de programação matemática da Análise Envoltória de Dados (DEA) para diferenciar as empresas com desempenho relativo de arrecadação ineficiente, dentro de um segmento econômico, e eleger os contribuintes suspeitos para investigação; 2. modelos de análise de séries temporais para avaliação dos dados fiscais atinentes à apuração do imposto (comparação gráfica dos valores reais e respectivas escriturações, gráficos boxplots, decomposição das componentes de tendência e sazonalidade e o modelo de alisamento exponencial Holtz-Winter), com o objetivo de detectar períodos de tempo anômalos (outliers); e 3. outras técnicas estatísticas descritivas (gráficos analíticos da distribuição de frequência), probabilísticas (Desigualdade de Chebyshev e Lei de Newcomb Benford) e o método de mineração por clusterização K-Means sobre as informações fiscais dos contribuintes selecionados, para identificar os registros escriturais e os documentos fiscais sob suspeição. É proposto um recurso computacional construído em linguagem R (plataforma R Studio) para: extrair do banco de dados (ORACLE) da Receita do Distrito Federal, processar as informações aplicando-lhes os modelos e métodos designados, e em conclusão, disponibilizar os resultados em painéis analíticos que facilitam e otimizam o trabalho de auditoria. Assim, a identificação das circunstâncias anômalas, a partir de um tratamento sistemático dos dados, proporciona maior eficiência à atividade de programação fiscal de auditorias tributárias. / This dissertation presents the associated application of selected statistical models and data mining methods for the analysis of outliers on the information of the Electronic Fiscal Notes and the Electronic Fiscal Book, providing the investigation of new types of tax evasions in ICMS. The following methods are applied: 1. the mathematical programming method of Data Envelopment Analysis (DEA) to differentiate companies with inefficient performance relative in the tax collection within an economic segment and to choose suspected taxpayers for research; 2. the analysis of time series used in the evaluation of fiscal data related to the calculation of the ICMS tax (graphical comparison of actual values and respective deeds, boxplot graphs, the decomposition of trend and seasonality components and the Holtz-Winter method), capable of anomalous time periods (outliers) detection; and 3. descriptive statistical analysis (frequency distribution), probabilistic analysis (Chebyshev Inequality and Newcomb Benford Law) and K-Means clustering techniques on selected taxpayers’ tax information to identify book entries and tax documents under suspicion. A computational code in R language (R Studio platform) is developed for: extraction of data from the Federal District Revenue database (ORACLE), processing of the extracted information while applying the designated models and methods and generating the results in panels that facilitate and optimize audit work. Thus, in conclusion, the identification of the anomalous circumstances, based on a systematic treatment of the data, provides greater efficiency to the fiscal programming activity of tax audits.
|
90 |
Proposta de um novo método para o planejamento de redes geodésicasKlein, Ivandro January 2014 (has links)
O objetivo deste trabalho é desenvolver e propor um novo método para o planejamento de redes geodésicas. O planejamento (ou pré-análise) de uma rede geodésica consiste em planejar (ou otimizar) a rede, de modo que a mesma atenda a critérios de qualidade pré-estabelecidos de acordo com os objetivos do projeto, como precisão, confiabilidade e custos. No método aqui proposto, os critérios a serem considerados na etapa de planejamento são os níveis de confiabilidade e homogeneidade mínimos aceitáveis para as observações; a acurácia posicional dos vértices, considerando tanto os efeitos de precisão quanto os (possíveis) efeitos de tendência, segundo ainda um determinado nível de confiança; o número de outliers não detectados máximo admissível; e o poder do teste mínimo do procedimento Data Snooping (DS) no cenário n-dimensional, isto é, considerando todas as observações (testadas individualmente). De acordo com as classificações encontradas na literatura, o método aqui proposto consiste em um projeto combinado, solucionado por meio do método da tentativa e erro, além de apresentar alguns aspectos inéditos em seus critérios de planejamento. Para demonstrar a sua aplicação prática, um exemplo numérico de planejamento de uma rede GNSS (Global Navigation Satellite System – Sistema Global de Navegação por Satélite) é apresentado e descrito. Os resultados obtidos após o processamento dos dados da rede GNSS foram concordantes com os valores estimados na sua etapa de planejamento, ou seja, o método aqui proposto apresentou desempenho satisfatório na prática. Além disso, também foram investigados como os critérios pré-estabelecidos, a geometria/configuração da rede geodésica e a precisão/correlação inicial das observações podem influenciar nos resultados obtidos na etapa de planejamento, seguindo o método aqui proposto. Com a realização destes experimentos, dentre outras conclusões, verificou-se que todo os critérios de planejamento do método aqui proposto estão intrinsecamente interligados, pois, por exemplo, uma baixa redundância conduz a um valor relativamente mais alto para a componente de precisão, e consequentemente, um valor relativamente mais baixo para a componente de tendência (mantendo a acurácia final constante), o que também conduz a um poder do teste mínimo nos cenários unidimensional e n-dimensional significativamente mais baixos. / The aim of this work is to develop and propose a new method for the design of geodetic networks. Design (planning or pre-analysis) of a geodetic network consists of planning (or optimizing) the network so that it follows the pre-established quality criteria according to the project objectives, such as accuracy, reliability and costs. In the method proposed here, the criteria to be considered in the planning stage are the minimum acceptable levels of reliability and homogeneity of the observations; the positional accuracy of the points considering both the effects of precision and the (possible) effects of bias (according to a given confidence level); the maximum allowable number of undetected outliers; and the minimum power of the test of the Data Snooping procedure (DS) in the n-dimensional scenario, i.e., considering all observations (individually tested). According to the classifications found in the literature, the method proposed here consists of a combined project, solved by means of trial and error approach, and presents some new aspects in their planning criteria. To demonstrate its practical application, a numerical example of a GNSS (Global Navigation Satellite System) network design is presented and described. The results obtained after processing the data of the GNSS network were found in agreement with the estimated values in the design stage, i.e., the method proposed here showed satisfactory performance in practice. Moreover, were also investigated as the pre-established criteria, the geometry/configuration of the geodetic network, and the initial values for precision/correlation of the observations may influence the results obtained in the planning stage, following the method proposed here. In these experiments, among other findings, it was found that all the design criteria of the method proposed here are intrinsically related, e.g., a low redundancy leads to a relatively higher value for the precision component, and consequently to a relatively lower value for the bias component (keeping constant the final accuracy), which also leads to a minimum power of the test significantly lower in the one-dimensional and the n-dimensional scenarios.
|
Page generated in 0.0342 seconds