Global ETD Search

181	A Content Based Movie Recommendation System Empowered By Collaborative Missing Data Prediction Karaman, Hilal 01 July 2010 (has links) (PDF) The evolution of the Internet has brought us into a world that represents a huge amount of information items such as music, movies, books, web pages, etc. with varying quality. As a result of this huge universe of items, people get confused and the question &ldquo / Which one should I choose?&rdquo / arises in their minds. Recommendation Systems address the problem of getting confused about items to choose, and filter a specific type of information with a specific information filtering technique that attempts to present information items that are likely of interest to the user. A variety of information filtering techniques have been proposed for performing recommendations, including content-based and collaborative techniques which are the most commonly used approaches in recommendation systems. This thesis work introduces ReMovender, a content-based movie recommendation system which is empowered by collaborative missing data prediction. The distinctive point of this study lies in the methodology used to correlate the users in the system with one another and the usage of the content information of movies. ReMovender makes it possible for the users to rate movies in a scale from one to five. By using these ratings, it finds similarities among the users in a collaborative manner to predict the missing ratings data. As for the content-based part, a set of movie features are used in order to correlate the movies and produce recommendations for the users. QA General 15707
182	Investigation of probabilistic principal component analysis compared to proper orthogonal decomposition methods for basis extraction and missing data estimation Lee, Kyunghoon 21 May 2010 (has links) The identification of flow characteristics and the reduction of high-dimensional simulation data have capitalized on an orthogonal basis achieved by proper orthogonal decomposition (POD), also known as principal component analysis (PCA) or the Karhunen-Loeve transform (KLT). In the realm of aerospace engineering, an orthogonal basis is versatile for diverse applications, especially associated with reduced-order modeling (ROM) as follows: a low-dimensional turbulence model, an unsteady aerodynamic model for aeroelasticity and flow control, and a steady aerodynamic model for airfoil shape design. Provided that a given data set lacks parts of its data, POD is required to adopt a least-squares formulation, leading to gappy POD, using a gappy norm that is a variant of an L2 norm dealing with only known data. Although gappy POD is originally devised to restore marred images, its application has spread to aerospace engineering for the following reason: various engineering problems can be reformulated in forms of missing data estimation to exploit gappy POD. Similar to POD, gappy POD has a broad range of applications such as optimal flow sensor placement, experimental and numerical flow data assimilation, and impaired particle image velocimetry (PIV) data restoration. Apart from POD and gappy POD, both of which are deterministic formulations, probabilistic principal component analysis (PPCA), a probabilistic generalization of PCA, has been used in the pattern recognition field for speech recognition and in the oceanography area for empirical orthogonal functions in the presence of missing data. In formulation, PPCA presumes a linear latent variable model relating an observed variable with a latent variable that is inferred only from an observed variable through a linear mapping called factor-loading. To evaluate the maximum likelihood estimates (MLEs) of PPCA parameters such as a factor-loading, PPCA can invoke an expectation-maximization (EM) algorithm, yielding an EM algorithm for PPCA (EM-PCA). By virtue of the EM algorithm, the EM-PCA is capable of not only extracting a basis but also restoring missing data through iterations whether the given data are intact or not. Therefore, the EM-PCA can potentially substitute for both POD and gappy POD inasmuch as its accuracy and efficiency are comparable to those of POD and gappy POD. In order to examine the benefits of the EM-PCA for aerospace engineering applications, this thesis attempts to qualitatively and quantitatively scrutinize the EM-PCA alongside both POD and gappy POD using high-dimensional simulation data. In pursuing qualitative investigations, the theoretical relationship between POD and PPCA is transparent such that the factor-loading MLE of PPCA, evaluated by the EM-PCA, pertains to an orthogonal basis obtained by POD. By contrast, the analytical connection between gappy POD and the EM-PCA is nebulous because they distinctively approximate missing data due to their antithetical formulation perspectives: gappy POD solves a least-squares problem whereas the EM-PCA relies on the expectation of the observation probability model. To juxtapose both gappy POD and the EM-PCA, this research proposes a unifying least-squares perspective that embraces the two disparate algorithms within a generalized least-squares framework. As a result, the unifying perspective reveals that both methods address similar least-squares problems; however, their formulations contain dissimilar bases and norms. Furthermore, this research delves into the ramifications of the different bases and norms that will eventually characterize the traits of both methods. To this end, two hybrid algorithms of gappy POD and the EM-PCA are devised and compared to the original algorithms for a qualitative illustration of the different basis and norm effects. After all, a norm reflecting a curve-fitting method is found to more significantly affect estimation error reduction than a basis for two example test data sets: one is absent of data only at a single snapshot and the other misses data across all the snapshots. From a numerical performance aspect, the EM-PCA is computationally less efficient than POD for intact data since it suffers from slow convergence inherited from the EM algorithm. For incomplete data, this thesis quantitatively found that the number of data-missing snapshots predetermines whether the EM-PCA or gappy POD outperforms the other because of the computational cost of a coefficient evaluation, resulting from a norm selection. For instance, gappy POD demands laborious computational effort in proportion to the number of data-missing snapshots as a consequence of the gappy norm. In contrast, the computational cost of the EM-PCA is invariant to the number of data-missing snapshots thanks to the L2 norm. In general, the higher the number of data-missing snapshots, the wider the gap between the computational cost of gappy POD and the EM-PCA. Based on the numerical experiments reported in this thesis, the following criterion is recommended regarding the selection between gappy POD and the EM-PCA for computational efficiency: gappy POD for an incomplete data set containing a few data-missing snapshots and the EM-PCA for an incomplete data set involving multiple data-missing snapshots. Last, the EM-PCA is applied to two aerospace applications in comparison to gappy POD as a proof of concept: one with an emphasis on basis extraction and the other with a focus on missing data reconstruction for a given incomplete data set with scattered missing data. The first application exploits the EM-PCA to efficiently construct reduced-order models of engine deck responses obtained by the numerical propulsion system simulation (NPSS), some of whose results are absent due to failed analyses caused by numerical instability. Model-prediction tests validate that engine performance metrics estimated by the reduced-order NPSS model exhibit considerably good agreement with those directly obtained by NPSS. Similarly, the second application illustrates that the EM-PCA is significantly more cost effective than gappy POD at repairing spurious PIV measurements obtained from acoustically-excited, bluff-body jet flow experiments. The EM-PCA reduces computational cost on factors 8 ~ 19 compared to gappy POD while generating the same restoration results as those evaluated by gappy POD. All in all, through comprehensive theoretical and numerical investigation, this research establishes that the EM-PCA is an efficient alternative to gappy POD for an incomplete data set containing missing data over an entire data set. Basis extraction Gappy proper orthogonal decomposition Proper orthogonal decomposition Missing data estimation Orthogonal decompositions Principal components analysis Expectation-maximization algorithms
183	Essays on Innovation, Patents, and Econometrics Entezarkheir, Mahdiyeh January 2010 (has links) This thesis investigates the impact of fragmentation in the ownership of complementary patents or patent thickets on firms' market value. This question is motivated by the increase in the patent ownership fragmentation following the pro-patent shifts in the US since 1982. The first chapter uses panel data on patenting US manufacturing firms from 1979 to 1996, and estimates the impact of patent thickets on firms' market value. I find that patent thickets lower firms' market value, and firms with a large patent portfolio size experience a smaller negative effect from their thickets. Moreover, no systematic difference exists in the impact of patent thickets on firms' market value over time. The second chapter extends this analysis to account for the indirect impacts of patent thickets on firms' market value. These indirect effects arise through the effects of patent thickets on firms' R\&D and patenting activities. Using panel data on US manufacturing firms from 1979 to 1996, I estimate the impact of patent thickets on market value, R\&D, and patenting as well as the impacts of R\&D and patenting on market value. Employing these estimates, I determine the direct, indirect, and total impacts of patent thickets on market value. I find that patent thickets decrease firms' market value, while I hold the firms’ R\&D and patenting activities constant. I find no evidence of a change in R\&D due to patent thickets. However, there is evidence of defensive patenting (an increase in patenting attributed to thickets), which helps to reduce the direct negative impact of patent thickets on market value. The data sets used in Chapters 1 and 2 have a number of missing observations on regressors. The commonly used methods to manage missing observations are the listwise deletion (complete case) and the indicator methods. Studies on the statistical properties of these methods suggest a smaller bias using the listwise deletion method. Employing Monte Carlo simulations, Chapter 3 examines the properties of these methods, and finds that in some cases the listwise deletion estimates have larger biases than indicator estimates. This finding suggests that interpreting estimates arrived at with either approach requires caution. Innovation Fragmentation Market Value Patent Patent Thicket R&D Spillovers Missing Data Unobserved Error Terms Censored Regressors Listwise Deletion Dummy Indicator Economics
184	Repairing event logs using stochastic process models Rogge-Solti, Andreas, Mans, Ronny S., van der Aalst, Wil M. P., Weske, Mathias January 2013 (has links) Companies strive to improve their business processes in order to remain competitive. Process mining aims to infer meaningful insights from process-related data and attracted the attention of practitioners, tool-vendors, and researchers in recent years. Traditionally, event logs are assumed to describe the as-is situation. But this is not necessarily the case in environments where logging may be compromised due to manual logging. For example, hospital staff may need to manually enter information regarding the patient’s treatment. As a result, events or timestamps may be missing or incorrect. In this paper, we make use of process knowledge captured in process models, and provide a method to repair missing events in the logs. This way, we facilitate analysis of incomplete logs. We realize the repair by combining stochastic Petri nets, alignments, and Bayesian networks. We evaluate the results using both synthetic data and real event data from a Dutch hospital. / Unternehmen optimieren ihre Geschäftsprozesse laufend um im kompetitiven Umfeld zu bestehen. Das Ziel von Process Mining ist es, bedeutende Erkenntnisse aus prozessrelevanten Daten zu extrahieren. In den letzten Jahren sorgte Process Mining bei Experten, Werkzeugherstellern und Forschern zunehmend für Aufsehen. Traditionell wird dabei angenommen, dass Ereignisprotokolle die tatsächliche Ist-Situation widerspiegeln. Dies ist jedoch nicht unbedingt der Fall, wenn prozessrelevante Ereignisse manuell erfasst werden. Ein Beispiel hierfür findet sich im Krankenhaus, in dem das Personal Behandlungen meist manuell dokumentiert. Vergessene oder fehlerhafte Einträge in Ereignisprotokollen sind in solchen Fällen nicht auszuschließen. In diesem technischen Bericht wird eine Methode vorgestellt, die das Wissen aus Prozessmodellen und historischen Daten nutzt um fehlende Einträge in Ereignisprotokollen zu reparieren. Somit wird die Analyse unvollständiger Ereignisprotokolle erleichtert. Die Reparatur erfolgt mit einer Kombination aus stochastischen Petri Netzen, Alignments und Bayes'schen Netzen. Die Ergebnisse werden mit synthetischen Daten und echten Daten eines holländischen Krankenhauses evaluiert. Process Mining fehlende Daten stochastische Petri Netze Bayes'sche Netze process mining missing data stochastic Petri nets Bayesian networks Data processing Computer science
185	Design, maintenance and methodology for analysing longitudinal social surveys, including applications Domrow, Nathan Craig January 2007 (has links) This thesis describes the design, maintenance and statistical analysis involved in undertaking a Longitudinal Survey. A longitudinal survey (or study) obtains observations or responses from individuals over several times over a defined period. This enables the direct study of changes in an individual's response over time. In particular, it distinguishes an individual's change over time from the baseline differences among individuals within the initial panel (or cohort). This is not possible in a cross-sectional study. As such, longitudinal surveys give correlated responses within individuals. Longitudinal studies therefore require different considerations for sample design and selection and analysis from standard cross-sectional studies. This thesis looks at the methodology for analysing social surveys. Most social surveys comprise of variables described as categorical variables. This thesis outlines the process of sample design and selection, interviewing and analysis for a longitudinal study. Emphasis is given to categorical response data typical of a survey. Included in this thesis are examples relating to the Goodna Longitudinal Survey and the Longitudinal Survey of Immigrants to Australia (LSIA). Analysis in this thesis also utilises data collected from these surveys. The Goodna Longitudinal Survey was conducted by the Queensland Office of Economic and Statistical Research (a portfolio office within Queensland Treasury) and began in 2002. It ran for two years whereby two waves of responses were collected. bayesian benchmarking correlation cross sectional surveys data analysis generalized estimating equations imputation longitudinal surveys missing data sample size standard error survey design survey methodology weighting
186	Gestion de données manquantes dans des cascades de boosting : application à la détection de visages / Management of missing data in boosting cascades : application to face detection Bouges, Pierre 06 December 2012 (has links) Ce mémoire présente les travaux réalisés dans le cadre de ma thèse. Celle-ci a été menée dans le groupe ISPR (ImageS, Perception systems and Robotics) de l’Institut Pascal au sein de l’équipe ComSee (Computers that See). Ces travaux s’inscrivent dans le cadre du projet Bio Rafale initié par la société clermontoise Vesalis et financé par OSEO. Son but est d’améliorer la sécurité dans les stades en s’appuyant sur l’identification des interdits de stade. Les applications des travaux de cette thèse concernent la détection de visages. Elle représente la première étape de la chaîne de traitement du projet. Les détecteurs les plus performants utilisent une cascade de classifieurs boostés. La notion de cascade fait référence à une succession séquentielle de plusieurs classifieurs. Le boosting, quant à lui, représente un ensemble d’algorithmes d’apprentissage automatique qui combinent linéairement plusieurs classifieurs faibles. Le détecteur retenu pour cette thèse utilise également une cascade de classifieurs boostés. L’apprentissage d’une telle cascade nécessite une base d’apprentissage ainsi qu’un descripteur d’images. Cette description des images est ici assurée par des matrices de covariance. La phase d’apprentissage d’un détecteur d’objets détermine ces conditions d’utilisation. Une de nos contributions est d’adapter un détecteur à des conditions d’utilisation non prévues par l’apprentissage. Les adaptations visées aboutissent à un problème de classification avec données manquantes. Une formulation probabiliste de la structure en cascade est alors utilisée pour incorporer les incertitudes introduites par ces données manquantes. Cette formulation nécessite l’estimation de probabilités a posteriori ainsi que le calcul de nouveaux seuils à chaque niveau de la cascade modifiée. Pour ces deux problèmes, plusieurs solutions sont proposées et de nombreux tests sont effectués pour déterminer la meilleure configuration. Enfin, les applications suivantes sont présentées : détection de visages tournés ou occultés à partir d’un détecteur de visages de face. L’adaptation du détecteur aux visages tournés nécessite l’utilisation d’un modèle géométrique 3D pour ajuster les positions des sous-fenêtres associées aux classifieurs faibles. / This thesis has been realized in the ISPR group (ImageS, Perception systems and Robotics) of the Institut Pascal with the ComSee team (Computers that See). My research is involved in a project called Bio Rafale. It was created by the compagny Vesalis in 2008 and it is funded by OSEO. Its goal is to improve the security in stadium using identification of dangerous fans. The applications of these works deal with face detection. It is the first step in the process chain of the project. Most efficient detectors use a cascade of boosted classifiers. The term cascade refers to a sequential succession of several classifiers. The term boosting refers to a set of learning algorithms that linearly combine several weak classifiers. The detector selected for this thesis also uses a cascade of boosted classifiers. The training of such a cascade needs a training database and an image feature. Here, covariance matrices are used as image feature. The limits of an object detector are fixed by its training stage. One of our contributions is to adapt an object detector to handle some of its limits. The proposed adaptations lead to a problem of classification with missing data. A probabilistic formulation of a cascade is then used to incorporate the uncertainty introduced by the missing data. This formulation involves the estimation of a posteriori probabilities and the computation of new rejection thresholds at each level of the modified cascade. For these two problems, several solutions are proposed and extensive tests are done to find the best configuration. Finally, our solution is applied to the detection of turned or occluded faces using just an uprigth face detector. Detecting the turned faces requires the use of a 3D geometric model to adjust the position of the subwindow associated with each weak classifier. Reconnaissance de forme Détection d’objets Apprentissage supervisé Classification Base d’apprentissage Visage Données manquantes Adaptation Pattern recognition Object detection Supervised learning Classification Training database Face Missing data Adaptation
187	Multiple Imputation for Two-Level Hierarchical Models with Categorical Variables and Missing at Random Data January 2016 (has links) abstract: Accurate data analysis and interpretation of results may be influenced by many potential factors. The factors of interest in the current work are the chosen analysis model(s), the presence of missing data, and the type(s) of data collected. If analysis models are used which a) do not accurately capture the structure of relationships in the data such as clustered/hierarchical data, b) do not allow or control for missing values present in the data, or c) do not accurately compensate for different data types such as categorical data, then the assumptions associated with the model have not been met and the results of the analysis may be inaccurate. In the presence of clustered/nested data, hierarchical linear modeling or multilevel modeling (MLM; Raudenbush & Bryk, 2002) has the ability to predict outcomes for each level of analysis and across multiple levels (accounting for relationships between levels) providing a significant advantage over single-level analyses. When multilevel data contain missingness, multilevel multiple imputation (MLMI) techniques may be used to model both the missingness and the clustered nature of the data. With categorical multilevel data with missingness, categorical MLMI must be used. Two such routines for MLMI with continuous and categorical data were explored with missing at random (MAR) data: a formal Bayesian imputation and analysis routine in JAGS (R/JAGS) and a common MLM procedure of imputation via Bayesian estimation in BLImP with frequentist analysis of the multilevel model in Mplus (BLImP/Mplus). Manipulated variables included interclass correlations, number of clusters, and the rate of missingness. Results showed that with continuous data, R/JAGS returned more accurate parameter estimates than BLImP/Mplus for almost all parameters of interest across levels of the manipulated variables. Both R/JAGS and BLImP/Mplus encountered convergence issues and returned inaccurate parameter estimates when imputing and analyzing dichotomous data. Follow-up studies showed that JAGS and BLImP returned similar imputed datasets but the choice of analysis software for MLM impacted the recovery of accurate parameter estimates. Implications of these findings and recommendations for further research will be discussed. / Dissertation/Thesis / Doctoral Dissertation Educational Psychology 2016 Quantitative psychology Statistics Educational tests & measurements Bayesian Estimation Categorical Data Analysis Missing at Random Data Missing Data Theory Multilevel Modeling Multiple Imputation
188	Statistické metody pro regresní modely s chybějícími daty / Statistical Methods for Regression Models With Missing Data Nekvinda, Matěj January 2018 (has links) The aim of this thesis is to describe and further develop estimation strategies for data obtained by stratified sampling. Estimation of the mean and linear regression model are discussed. The possible inclusion of auxiliary variables in the estimation is exam- ined. The auxiliary variables can be transformed rather than used in their original form. A transformation minimizing the asymptotic variance of the resulting estimator is pro- vided. The estimator using an approach from this thesis is compared to the doubly robust estimator and shown to be asymptotically equivalent.
189	Modelos de sobreviv?ncia com fra??o de cura e omiss?o nas covari?veis Fonseca, Renata Santana 06 March 2009 (has links) Made available in DSpace on 2014-12-17T15:26:37Z (GMT). No. of bitstreams: 1 RenataSF.pdf: 443214 bytes, checksum: 93598adf420b7d48eb5b8b2c6e619c38 (MD5) Previous issue date: 2009-03-06 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior / In this work we study the survival cure rate model proposed by Yakovlev (1993) that are considered in a competing risk setting. Covariates are introduced for modeling the cure rate and we allow some covariates to have missing values. We consider only the cases by which the missing covariates are categorical and implement the EM algorithm via the method of weights for maximum likelihood estimation. We present a Monte Carlo simulation experiment to compare the properties of the estimators based on this method with those estimators under the complete case scenario. We also evaluate, in this experiment, the impact in the parameter estimates when we increase the proportion of immune and censored individuals among the not immune one. We demonstrate the proposed methodology with a real data set involving the time until the graduation for the undergraduate course of Statistics of the Universidade Federal do Rio Grande do Norte / Neste trabalho estudamos o modelo de sobreviv^encia com fra??o de cura proposto por Yakovlev et al. (1993) que possui uma estrutura de riscos competitivos. Covari?veis s?o introduzidas para modelar o n?mero m?dio de riscos e permitimos que algumas destas covari?veis apresentem omiss?o. Consideramos apenas os casos em que as covari?veis omissas s?o categ?ricas e as estimativas dos par?metros s?o obtidas atrav?s do algoritmo EM ponderado. Apresentamos uma s?rie de simula??es para confrontar as estimativas obtidas atrav?s deste m?todo com as obtidas quando se exclui do banco de dados as observa??es que apresentam omiss?o, conhecida como an?lise de casos completos. Avaliamos tamb?m atrav?s de simula??es, o impacto na estimativa dos par?metros quando aumenta-se o percentual de curados e de censura entre indiv?duos n?o curados. Um conjunto de dados reais referentes ao tempo at? a conclus?o do curso de estat?stica na Universidade Federal do Rio Grande do Norte ? utilizado para ilustrar o m?todo. An?lise de sobreviv?ncia Fra??o de cura Vari?veis omissas Algoritmo EM Survival analysis Rate cure Missing data EM algorithm
190	Optimisation de l’analyse de données de la mission spatiale MICROSCOPE pour le test du principe d’équivalence et d’autres applications / Optimization of the data analysis of the MICROSCOPE space mission for the test of the Equivalence Principle and other applications Baghi, Quentin 12 October 2016 (has links) Le Principe d'Equivalence (PE) est un pilier fondamental de la Relativité Générale. Il est aujourd'hui remis en question par les tentatives d'élaborer une théorie plus exhaustive en physique fondamentale, comme la théorie des cordes. La mission spatiale MICROSCOPE vise à tester ce principe à travers l'universalité de la chute libre, avec un objectif de précision de 10-15, c'est-à-dire un gain de deux ordres de grandeurs par rapport aux expériences actuelles. Le satellite embarque deux accéléromètres électrostatiques, chacun intégrant deux masses-test. Les masses de l'accéléromètre servant au test du PE sont de compositions différentes, alors que celles de l'accéléromètre de référence sont constituées d'un même matériau. L'objectif est de mesurer la chute libre des masses-test dans le champ gravitationnel de la Terre, en mesurant leur accélération différentielle avec une précision attendue de 10-12 ms-2Hz-1/2 dans la bande d'intérêt. Une violation du PE se traduirait par une différence périodique caractéristique entre les deux accélérations. Cependant, diverses perturbations sont également mesurées en raison de la grande sensibilité de l'instrument. Certaines d'entre elles, comme les gradients de gravité et d'inertie, sont bien définies. En revanche d'autres ne sont pas modélisées ou ne le sont qu'imparfaitement, comme le bruit stochastique et les pics d'accélérations dus à l'environnement du satellite, qui peuvent entraîner des saturations de la mesure ou des données lacunaires. Ce contexte expérimental requiert le développement d'outils adaptés pour l'analyse de données, qui s'inscrivent dans le cadre général de l'analyse des séries temporelles par régression linéaire.On étudie en premier lieu la détection et l’estimation de perturbations harmoniques dans le cadre de l'analyse moindres carrés. On montre qu’avec cette technique la projection des perturbations harmoniques sur le signal de violation du PE peut être maintenue à un niveau acceptable. On analyse ensuite l'impact des pertes de données sur la performance du test du PE. On montre qu'avec l'hypothèse pire cas sur la fréquence des interruptions de données (environ 300 interruptions de 0.5 seconde par orbite, chiffre évalué avant le vol), l'incertitude des moindres carrés ordinaires est multipliée par un facteur 35 à 60. Pour compenser cet effet, une méthode de régression linéaire basée sur une estimation autorégressive du bruit est développée, qui permet de décorréler les observations disponibles, sans calcul ni inversion directs de la matrice de covariance. La variance de l'estimateur ainsi construit est proche de la valeur optimale, ce qui permet de réaliser un test du PE au niveau attendu, même en présence de pertes de données fréquentes. On met également en place une méthode pour évaluer plus précisément la DSP du bruit à partir des données disponibles, sans utilisation de modèle a priori. L'approche est fondée sur une modification de l'algorithme espérance-maximisation (EM) avec une hypothèse de régularité de la DSP, en utilisant une imputation statistique des données manquantes. On obtient une estimée de la DSP avec une erreur inférieure à 10-12 ms-2Hz-1/2. En dernier lieu, on étend les applications de l'analyse de données en étudiant la faisabilité de la mesure du gradient de gravité terrestre avec MICROSCOPE. On évalue la capacité de cette observable à déchiffrer la géométrie des grandes échelles du géopotentiel. Par simulation des signaux obtenus à partir de différents modèles du manteau terrestre profond, on montre que leurs particularités peuvent être distinguées. / The Equivalence Principle (EP) is a cornerstone of General Relativity, and is called into question by the attempts to build more comprehensive theories in fundamental physics such as string theories. The MICROSCOPE space mission aims at testing this principle through the universality of free fall, with a target precision of 10-15, two orders of magnitude better than current on-ground experiments. The satellite carries on-board two electrostatic accelerometers, each one including two test-masses. The masses of the test accelerometer are made with different materials, whereas the masses of the reference accelerometer have the same composition. The objective is to monitor the free fall of the test-masses in the gravitational field of the earth by measuring their differential accelerations with an expected precision of 10-12 ms-2Hz-1/2 in the bandwidth of interest. An EP violation would result in a characteristic periodic difference between the two accelerations. However, various perturbations are also measured because of the high sensitivity of the instrument. Some of them are well defined, e.g. gravitational and inertial gradient disturbances, but others are unmodeled, such as random noise and acceleration peaks due to the satellite environment, which can lead to saturations in the measurement or data gaps. This experimental context requires us to develop suited tools for the data analysis, which are applicable in the general framework of linear regression analysis of time series.We first study the statistical detection and estimation of unknown harmonic disturbances in a least squares framework, in the presence of a colored noise of unknown PSD. We show that with this technique the projection of the harmonic disturbances onto the WEP violation signal can be rejected. Secondly we analyze the impact of the data unavailability on the performance of the EP test. We show that with the worst case before-flight hypothesis (almost 300 gaps of 0.5 second per orbit), the uncertainty of the ordinary least squares is increased by a factor 35 to 60. To counterbalance this effect, a linear regression method based on an autoregressive estimation of the noise is developed, which allows a proper decorrelation of the available observations, without direct computation and inversion of the covariance matrix. The variance of the constructed estimator is close to the optimal value, allowing us to perform the EP test at the expected level even in case of very frequent data interruptions. In addition, we implement a method to more accurately characterize the noise PSD when data are missing, with no prior model on the noise. The approach is based on modified expectation-maximization (EM) algorithm with a smooth assumption on the PSD, and use a statistical imputation of the missing data. We obtain a PSD estimate with an error less than 10-12 ms-2Hz-1/2. Finally, we widen the applications of the data analysis by studying the feasibility of the measurement of the earth's gravitational gradient with MICROSCOPE data. We assess the ability of this set-up to decipher the large scale geometry of the geopotential. By simulating the signals obtained from different models of the earth's deep mantle, and comparing them to the expected noise level, we show that their features can be distinguished. Physique gravitationnelle expérimentale Instrumentation spatiale Traitement de données Analyse spectrale Données manquantes Spaceborne instruments Data processing Spectral analysis Missing data 520

Search results