Global ETD Search

131	Les oubliés de la recommandation sociale / The forgotten users of social recommendation Gras, Benjamin 18 January 2018 (has links) Un système de recommandation a pour objectif de recommander à un utilisateur, appelé utilisateur actif, des ressources pertinentes pour lui. Le filtrage collaboratif (FC) est une approche de recommandation très répandue qui exploite les préférences exprimées par des utilisateurs sur des ressources. Le FC repose sur l'hypothèse que les préférences des utilisateurs sont cohérentes entre elles, ce qui permet d'inférer les préférences d'un utilisateur à partir des préférences des autres utilisateurs. Définissons une préférence spécifique comme une préférence qui ne serait partagée pour aucun groupe d'utilisateurs. Un utilisateur possédant plusieurs préférences spécifiques qu'il ne partage avec aucun autre utilisateur sera probablement mal servi par une approche de FC classique. Il s'agit du problème des Grey Sheep Users (GSU). Dans cette thèse, je réponds à trois questions distinctes. 1) Qu'est-ce qu'une préférence spécifique ? J'apporte une réponse en proposant des hypothèses associées que je valide expérimentalement. 2) Comment identifier les GSU dans les données ? Cette identification est importante afin d'anticiper les mauvaises recommandations qui seront fournies à ces utilisateurs. Je propose des mesures numériques permettant d'identifier les GSU dans un jeu de données de recommandation sociale. Ces mesures sont significativement plus performantes que celles de l'état de l'art. Enfin, comment modéliser ces GSU pour améliorer la qualité des recommandations qui leurs sont fournies ? Je propose des méthodes inspirées du domaine de l'apprentissage automatique et dédiées à la modélisation des GSU permettant d'améliorer la qualité des recommandations qui leurs sont fournies / A recommender system aims at providing relevant resources to a user, named the active user. To allow this recommendation, the system exploits the information it has collected about the active user or about resources. The collaborative filtering (CF) is a widely used recommandation approach. The data exploited by CF are the preferences expressed by users on resources. CF is based on the assumption that preferences are consistent between users, allowing a user's preferences to be inferred from the preferences of other users. In a CF-based recommender system, at least one user community has to share the preferences of the active user to provide him with high quality recommendations. Let us define a specific preference as a preference that is not shared by any group of user. A user with several specific preferences will likely be poorly served by a classic CF approach. This is the problem of Grey Sheep Users (GSU). In this thesis, I focus on three separate questions. 1) What is a specific preference? I give an answer by proposing associated hypotheses that I validate experimentally. 2) How to identify GSU in preference data? This identification is important to anticipate the low quality recommendations that will be provided to these users. I propose numerical indicators to identify GSU in a social recommendation dataset. These indicators outperform those of the state of the art and allow to isolate users whose quality of recommendations is very low. 3) How can I model GSU to improve the quality of the recommendations they receive? I propose new recommendation approaches to allow GSU to benefit from the opinions of other users Modélisation utilisateur Modélisation de préférences Systèmes de recommandation Apprentissage automatique Données aberrantes Utilisateurs atypiques User modeling Preference modeling Recommender systems Machine learning Outliers Grey sheep users 004.019
132	Critérios robustos de seleção de modelos de regressão e identificação de pontos aberrantes / Robust model selection criteria in regression and outliers identification Guirado, Alia Garrudo 08 March 2019 (has links) A Regressão Robusta surge como uma alternativa ao ajuste por mínimos quadrados quando os erros são contaminados por pontos aberrantes ou existe alguma evidência de violação das suposições do modelo. Na regressão clássica existem critérios de seleção de modelos e medidas de diagnóstico que são muito conhecidos. O objetivo deste trabalho é apresentar os principais critérios robustos de seleção de modelos e medidas de detecção de pontos aberrantes, assim como analisar e comparar o desempenho destes de acordo com diferentes cenários para determinar quais deles se ajustam melhor a determinadas situações. Os critérios de validação cruzada usando simulações de Monte Carlo e o Critério de Informação Bayesiano são conhecidos por desenvolver-se de forma adequada na identificação de modelos. Na dissertação confirmou-se este fato e além disso, suas alternativas robustas também destacam-se neste aspecto. A análise de resíduos constitui uma forte ferramenta da análise diagnóstico de um modelo, no trabalho detectou-se que a análise clássica de resíduos sobre o ajuste do modelo de regressão linear robusta, assim como a análise das ponderações das observações, são medidas de detecção de pontos aberrantes eficientes. Foram aplicados os critérios e medidas analisados ao conjunto de dados obtido da Estação Meteorológica do Instituto de Astronomia, Geofísica e Ciências Atmosféricas da Universidade de São Paulo para detectar quais variáveis meteorológicas influem na temperatura mínima diária durante o ano completo, e ajustou-se um modelo que permite identificar os dias associados à entrada de sistemas frontais. / Robust Regression arises as an alternative to least squares method when errors are contaminated by outliers points or there are some evidence of violation of model assumptions. In classical regression there are several criteria for model selection and diagnostic measures that are well known. The objective of this work is to present the main robust criteria of model selection and outliers detection measures, as well as to analyze and compare their performance according to different stages to determine which of them fit better in certain situations. The cross-validation criteria using Monte Carlo simulations and Beyesian Information Criterion are known to be adequately developed in model identification. This fact was confirmed, and in addition, its robust alternatives also stand out in this aspect. The residual analysis is a strong tool for model diagnostic analysis, in this work it was detected that the classic residual analysis on the robust linear model regression fit, as well as the analysis of the observations weights, are efficient measures of outliers detection points. The analyzed criteria and measures were applied to the data set obtained from the Meteorological Station of the Astronomy, Geophysics and Atmospheric Sciences Institute of São Paulo University to detect which meteorological variables influence the daily minimum temperature during the whole year, and was fitted a model that allows identify the days associated with the entry of frontal systems. AIC AIC BIC BIC Cp Cp Cross-validation Identificação de pontos aberrantes Model selection Outliers identification R2 R2 Regressão robusta Robust regression Seleção de modelos Validação cruzada
133	Controle de qualidade no ajustamento de observações geodésicas Klein, Ivandro January 2012 (has links) Após o ajustamento de observações pelo método dos mínimos quadrados (MMQ) ter sido realizado, é possível a detecção e a identificação de erros não aleatórios nas observações, por meio de testes estatísticos. A teoria da confiabilidade faz uso de medidas adequadas para quantificar o menor erro detectável em uma observação, e a sua influência sobre os parâmetros ajustados, quando não detectado. A teoria de confiabilidade convencional foi desenvolvida para os procedimentos de teste convencionais, como o data snooping, que pressupõem que apenas uma observação está contaminada por erros grosseiros por vez. Recentemente foram desenvolvidas medidas de confiabilidade generalizadas, relativas a testes estatísticos que pressupõem a existência, simultânea, de múltiplas observações com erros (outliers). Outras abordagens para o controle de qualidade do ajustamento, alternativas a estes testes estatísticos, também foram propostas recentemente, como por exemplo, o método QUAD (Quasi-Accurate Detection of outliers method). Esta pesquisa tem por objetivo fazer um estudo sobre o controle de qualidade do ajustamento de observações geodésicas, por meio de experimentos em uma rede GPS (Global Positioning System), utilizando tanto os métodos convencionais quanto o atual estado da arte. Desta forma, foram feitos estudos comparativos entre medidas de confiabilidade convencionais e medidas de confiabilidade generalizadas para dois outliers simultâneos, bem como estudos comparativos entre o procedimento data snooping e testes estatísticos para a identificação de múltiplos outliers. Também se investigou como a questão das variâncias e covariâncias das observações, bem como a geometria/configuração da rede GPS em estudo, podem influenciar nas medidas de confiabilidade, tanto na abordagem convencional, quanto na abordagem generalizada. Por fim, foi feito um estudo comparativo entre o método QUAD e os testes estatísticos para a identificação de erros. / After the adjustment of observations has been carried out by Least Squares Method (LSM), it is possible to detect and identify non-random errors in the observations using statistical tests. The reliability theory makes use of appropriate measures to quantify the minimal detectable bias (error) in an observation, and its influence on the adjusted parameters, if not detected. The conventional reliability theory has been developed for conventional testing procedures such as data snooping, which assumes that only one observation is contaminated by errors at a time. Recently, generalized measures of reliability has been developed, relating to statistical tests that assumes the existence, simultaneous, of multiple observations with errors (outliers). Other approaches to the quality control of the adjustment, alternatives to these statistical tests, were also proposed recently, such as the QUAD method (Quasi-Accurate Detection of outliers method). The goal of this research is to make a study about the quality control of the adjustment of geodetic observations, by means of experiments in a GPS (Global Positioning System) network, using both conventional methods and the current state of the art. In this way, comparisons were made between conventional reliability measures and generalized measures of reliability for two outliers, as well as comparisons between the data snooping procedure and statistical tests to identify multiple outliers. It was also investigated how the variances and covariances of the observations, as well as the geometry/configuration of the GPS network in study, can influence the measures of reliability, both in the conventional approach and in the generalized approach. Finally, a comparison was made between the QUAD method and the statistical tests to identify outliers (errors). Sensoriamento remoto Teoria dos erros : Geodésia LSM; Geodetic networks Geodetic networks Statistical tests to identify errors Measures of reliability Multiple outliers QUAD method
134	Fast and Scalable Outlier Detection with Metric Access Methods / Detecção Rápida e Escalável de Casos de Exceção com Métodos de Acesso Métrico Bispo Junior, Altamir Gomes 25 July 2019 (has links) It is well-known that the existing theoretical models for outlier detection make assumptions that may not reflect the true nature of outliers in every real application. This dissertation describes an empirical study performed on unsupervised outlier detection using 8 algorithms from the state-of-the-art and 8 datasets that refer to a variety of real-world tasks of practical relevance, such as spotting cyberattacks, clinical pathologies and abnormalities occurring in nature. We present our lowdown on the results obtained, pointing out to the strengths and weaknesses of each technique from the application specialists point of view, which is a shift from the designer-based point of view that is commonly adopted. Many of the techniques had unfeasibly high runtime requirements or failed to spot what the specialists consider as outliers in their own data. To tackle this issue, we propose MetricABOD: a novel ABOD-based algorithm that makes the analysis up to thousands of times faster, still being in average 26% more accurate than the most accurate related work. This improvement is tantamount to practical outlier detection in many real-world applications for which the existing methods present unstable accuracy or unfeasible runtime requirements. Finally, we studied two collections of text data to show that our MetricABOD works also for adimensional, purely metric data. / É conhecido e notável que os modelos teóricos existentes empregados na detecção de outliers realizam assunções que podem não refletir a verdadeira natureza dos outliers em cada aplicação. Esta dissertação descreve um estudo empírico sobre detecção de outliers não-supervisionada usando 8 algoritmos do estado-da-arte e 8 conjuntos de dados que foram extraídos de uma variedade de tarefas do mundo real de relevância prática, tais como a detecção de ataques cibernéticos, patologias clínicas e anormalidades naturais. Apresentam-se considerações sobre os resultados obtidos, apontando os pontos positivos e negativos de cada técnica do ponto de vista do especialista da aplicação, o que representa uma mudança do embasamento rotineiro no ponto de vista do desenvolvedor da técnica. A maioria das técnicas estudadas apresentou requerimentos de tempo impraticáveis ou falhou em encontrar o que os especialistas consideram como outliers nos conjuntos de dados confeccionados por eles próprios. Para lidar-se com esta questão, foi desenvolvido o método MetricABOD: um novo algoritmo baseado no ABOD que torna a análise milhares de vezes mais veloz, sendo ainda em média 26% mais acurada do que o trabalho relacionado mais acurado. Esta melhoria equivale a tornar a busca por outliers uma tarefa factível em muitas aplicações do mundo real para as quais os métodos existentes apresentam resultados instáveis ou requerimentos de tempo impassíveis de realização. Finalmente, foram também estudadas duas coleções de dados adimensionais para mostrar que o novo MetricABOD funciona também para dados puramente métricos. Applied computational sciences Ciência computacional aplicada Complex data Dados complexos Data mining Métodos de acesso métrico Metric access methods Mineração de dados Unsupervised outlier detection
135	多維列聯表離群細格的偵測研究 / Identification of Outlying Cells in Cross-Classified Tables 陳佩妘, Chen, Pei-Yun Unknown Date (has links) 在處理列聯表時，適合度檢定的結果如果是顯著的話，則意味著配適的模式並不恰當，這其中一個可能的原因是資料中存在離群細格．因此我們希望能夠針對問題癥結所在，找出離群細格，使得我們的資料可以利用一個比較簡單且容易解釋的模式來做分析．在這篇論文中，我們主要依據施苑玉[1995]所提出的方法作些許的改變，使得改進後的方法可以適用於三維列聯表的所有情形．此外我們也將 Simonoff 在1988年所提出的方法，以及 BMDP 統計軟體的程序 4F ，與我們所提出的方法相比較．由模擬實驗的結果可發現我們的方法比前述兩種方法更具可行性． / When fitting a loglinear model to a contingency table, a significant goodness-of-fit can be resulted because of the existence of a few outlyingcells. Since a simpler model is easier to interpret and conveys more easilyunderstood information about a table than a complicated one, we would liketo identify those outliers so that a simpler model would fit a given data set. In this research, a modification of Shih's [1995] procedure is provided, and the revised method is now applicable to any type of models related tothree-way tables. Some data sets are simulated to compare outliers detectedusing procedures proposed by Simonoff [1988], and BMDP program 4F with our proposed method. Based on the results through simulation, our revised procedure outperforms the other two procedures most of the time. 列聯表適合度檢定離群細格近似獨立性 Contingency tables Goodness-of-fit Outliers Quasi-Independence
136	Robust Control Charts Cetinyurek, Aysun 01 January 2007 (has links) (PDF) ABSTRACT ROBUST CONTROL CHARTS &Ccedil / etiny&uuml / rek, Aysun M. Sc., Department of Statistics Supervisor: Dr. BariS S&uuml / r&uuml / c&uuml / Co-Supervisor: Assoc. Prof. Dr. Birdal Senoglu December 2006, 82 pages Control charts are one of the most commonly used tools in statistical process control. A prominent feature of the statistical process control is the Shewhart control chart that depends on the assumption of normality. However, violations of underlying normality assumption are common in practice. For this reason, control charts for symmetric distributions for both long- and short-tailed distributions are constructed by using least squares estimators and the robust estimators -modified maximum likelihood, trim, MAD and wave. In order to evaluate the performance of the charts under the assumed distribution and investigate robustness properties, the probability of plotting outside the control limits is calculated via Monte Carlo simulation technique. HA Statistics 36161
137	Second-order Least Squares Estimation in Generalized Linear Mixed Models Li, He 06 April 2011 (has links) Maximum likelihood is an ubiquitous method used in the estimation of generalized linear mixed model (GLMM). However, the method entails computational difficulties and relies on the normality assumption for random effects. We propose a second-order least squares (SLS) estimator based on the first two marginal moments of the response variables. The proposed estimator is computationally feasible and requires less distributional assumptions than the maximum likelihood estimator. To overcome the numerical difficulties of minimizing an objective function that involves multiple integrals, a simulation-based SLS estimator is proposed. We show that the SLS estimators are consistent and asymptotically normally distributed under fairly general conditions in the framework of GLMM. Missing data is almost inevitable in longitudinal studies. Problems arise if the missing data mechanism is related to the response process. This thesis develops the proposed estimators to deal with response data missing at random by either adapting the inverse probability weight method or applying the multiple imputation approach. In practice, some of the covariates are not directly observed but are measured with error. It is well-known that simply substituting a proxy variable for the unobserved covariate in the model will generally lead to biased and inconsistent estimates. We propose the instrumental variable method for the consistent estimation of GLMM with covariate measurement error. The proposed approach does not need any parametric assumption on the distribution of the unknown covariates. This makes the method less restrictive than other methods that rely on either a parametric distribution of the covariates, or to estimate the distribution using some extra information. In the presence of data outliers, it is a concern that the SLS estimators may be vulnerable due to the second-order moments. We investigated the robustness property of the SLS estimators using their influence functions. We showed that the proposed estimators have a bounded influence function and a redescending property so they are robust to outliers. The finite sample performance and property of the SLS estimators are studied and compared with other popular estimators in the literature through simulation studies and real world data examples. Bias reduction Discrete response Influence function Instrumental variable Least squares method Longitudinal data Measurement error M-estimator Mixed effects models Outliers Robustness Simulation-based estimator
138	Extraction De Motifs Séquentiels Dans Des Données Multidimensionelles Plantevit, Marc 15 July 2008 (has links) (PDF) L'extraction de motifs séquentiels est devenue, depuis son introduction, une technique majeure du domaine de la fouille de données avec de nombreuses applications (analyse du comportement des consommateurs, bioinformatique, sécurité, musique, etc.). Les motifs séquentiels permettent la découverte de corrélations entre événements en fonction de leurs chronologies d'apparition. Il existe de nombreux algorithmes permettant l'extraction de tels motifs. Toutefois, ces propositions ne prennent en compte qu'une seule dimension d'analyse (e.g le produit dans les applications de type étude des achats des consommateurs) alors que la plupart des données réelles sont multidimensionnelles par nature. Dans ce manuscrit, nous définissons les motifs séquentiels multidimensionnels afin de prendre en compte les spécificités inhérentes aux bases de données multidimensionnelles (plusieurs dimensions, hiérarchies, valeurs agrégées). Nous définissons des algorithmes permettant l'extraction de motifs séquentiels multi- dimensionnels en tenant compte des ces spécificités. Des expérimentations menées sur des données synthétiques et sur des données réelles sont rapportées et montrent l'intérêt de nos propositions. Nous nous intéressons également à l'extraction de comportements temporels atypiques dans des données multidimensionnelles. Nous montrons qu'il peut y avoir plusieurs interprétations d'un comportement atypique (fait ou connaissance). En fonction de chaque interprétation, nous proposons une méthode d'extraction de tels comportements. Ces méthodes sont également validées par des expérimentations sur des données réelles. [INFO] Computer Science [INFO] Informatique Motifs séquentiels multidimensionnels hiérarchies mesures motifs clos connaissances inattendues outliers
139	Two statistical problems related to credit scoring / Tanja de la Rey. De la Rey, Tanja January 2007 (has links) This thesis focuses on two statistical problems related to credit scoring. In credit scoring of individuals, two classes are distinguished, namely low and high risk individuals (the so-called "good" and "bad" risk classes). Firstly, we suggest a measure which may be used to study the nature of a classifier for distinguishing between the two risk classes. Secondly, we derive a new method DOUW (detecting outliers using weights) which may be used to fit logistic regression models robustly and for the detection of outliers. In the first problem, the focus is on a measure which may be used to study the nature of a classifier. This measure transforms a random variable so that it has the same distribution as another random variable. Assuming a linear form of this measure, three methods for estimating the parameters (slope and intercept) and for constructing confidence bands are developed and compared by means of a Monte Carlo study. The application of these estimators is illustrated on a number of datasets. We also construct statistical hypothesis to test this linearity assumption. In the second problem, the focus is on providing a robust logistic regression fit and the identification of outliers. It is well-known that maximum likelihood estimators of logistic regression parameters are adversely affected by outliers. We propose a robust approach that also serves as an outlier detection procedure and is called DOUW. The approach is based on associating high and low weights with the observations as a result of the likelihood maximization. It turns out that the outliers are those observations to which low weights are assigned. This procedure depends on two tuning constants. A simulation study is presented to show the effects of these constants on the performance of the proposed methodology. The results are presented in terms of four benchmark datasets as well as a large new dataset from the application area of retail marketing campaign analysis. In the last chapter we apply the techniques developed in this thesis on a practical credit scoring dataset. We show that the DOUW method improves the classifier performance and that the measure developed to study the nature of a classifier is useful in a credit scoring context and may be used for assessing whether the distribution of the good and the bad risk individuals is from the same translation-scale family. / Thesis (Ph.D. (Risk Analysis))--North-West University, Potchefstroom Campus, 2008. Credit scoring Quantile comparison function Method of moments Method of quantiles Estimation Asymptotic theory Test of linearity Logistic regression Outliers Robust estimators Trimming Down weighting
140	Second-order Least Squares Estimation in Generalized Linear Mixed Models Li, He 06 April 2011 (has links) Maximum likelihood is an ubiquitous method used in the estimation of generalized linear mixed model (GLMM). However, the method entails computational difficulties and relies on the normality assumption for random effects. We propose a second-order least squares (SLS) estimator based on the first two marginal moments of the response variables. The proposed estimator is computationally feasible and requires less distributional assumptions than the maximum likelihood estimator. To overcome the numerical difficulties of minimizing an objective function that involves multiple integrals, a simulation-based SLS estimator is proposed. We show that the SLS estimators are consistent and asymptotically normally distributed under fairly general conditions in the framework of GLMM. Missing data is almost inevitable in longitudinal studies. Problems arise if the missing data mechanism is related to the response process. This thesis develops the proposed estimators to deal with response data missing at random by either adapting the inverse probability weight method or applying the multiple imputation approach. In practice, some of the covariates are not directly observed but are measured with error. It is well-known that simply substituting a proxy variable for the unobserved covariate in the model will generally lead to biased and inconsistent estimates. We propose the instrumental variable method for the consistent estimation of GLMM with covariate measurement error. The proposed approach does not need any parametric assumption on the distribution of the unknown covariates. This makes the method less restrictive than other methods that rely on either a parametric distribution of the covariates, or to estimate the distribution using some extra information. In the presence of data outliers, it is a concern that the SLS estimators may be vulnerable due to the second-order moments. We investigated the robustness property of the SLS estimators using their influence functions. We showed that the proposed estimators have a bounded influence function and a redescending property so they are robust to outliers. The finite sample performance and property of the SLS estimators are studied and compared with other popular estimators in the literature through simulation studies and real world data examples. Bias reduction Discrete response Influence function Instrumental variable Least squares method Longitudinal data Measurement error M-estimator Mixed effects models Outliers Robustness Simulation-based estimator

Search results