Spelling suggestions: "subject:"[een] MISSING DATA"" "subject:"[enn] MISSING DATA""
151 |
Essays on Innovation, Patents, and EconometricsEntezarkheir, Mahdiyeh January 2010 (has links)
This thesis investigates the impact of fragmentation in the ownership of complementary patents or patent thickets on firms' market value. This question is motivated by the increase in the patent ownership fragmentation following the pro-patent shifts in the US since 1982. The first chapter uses panel data on patenting US manufacturing firms from 1979 to 1996, and estimates the impact of patent thickets on firms' market value. I find that patent thickets lower firms' market value, and firms with a large patent portfolio size experience a smaller negative effect from their thickets. Moreover, no systematic difference exists in the impact of patent thickets on firms' market value over time. The second chapter extends this analysis to account for the indirect impacts of patent thickets on firms' market value. These indirect effects arise through the effects of patent thickets on firms' R\&D and patenting activities. Using panel data on US manufacturing firms from 1979 to 1996, I estimate the impact of patent thickets on market value, R\&D, and patenting as well as the impacts of R\&D and patenting on market value. Employing these estimates, I determine the direct, indirect, and total impacts of patent thickets on market value. I find that patent thickets decrease firms' market value, while I hold the firms’ R\&D and patenting activities constant. I find no evidence of a change in R\&D due to patent thickets. However, there is evidence of defensive patenting (an increase in patenting attributed to thickets), which helps to reduce the direct negative impact of patent thickets on market value.
The data sets used in Chapters 1 and 2 have a number of missing observations on regressors. The commonly used methods to manage missing observations are the listwise deletion (complete case) and the indicator methods. Studies on the statistical properties of these methods suggest a smaller bias using the listwise deletion method. Employing Monte Carlo simulations, Chapter 3 examines the properties of these methods, and finds that in some cases the listwise deletion estimates have larger biases than indicator estimates. This finding suggests that interpreting estimates arrived at with either approach requires caution.
|
152 |
Multisensor Segmentation-based Noise Suppression for Intelligibility Improvement in MELP CodersDemiroglu, Cenk 18 January 2006 (has links)
This thesis investigates the use of an auxiliary sensor, the GEMS device, for improving the quality of noisy speech and designing noise preprocessors to MELP speech coders. Use of auxiliary sensors for noise-robust
ASR applications is also investigated to develop speech enhancement algorithms that use acoustic-phonetic
properties of the speech signal.
A Bayesian risk minimization framework is developed that can incorporate the acoustic-phonetic properties
of speech sounds and knowledge of human auditory perception into the speech enhancement framework. Two noise suppression
systems are presented using the ideas developed in the mathematical framework. In the first system, an aharmonic
comb filter is proposed for voiced speech where low-energy frequencies are severely suppressed while
high-energy frequencies are suppressed mildly. The proposed
system outperformed an MMSE estimator in subjective listening tests and DRT intelligibility test for MELP-coded noisy speech.
The effect of aharmonic
comb filtering on the linear predictive coding (LPC) parameters is analyzed using a missing data approach.
Suppressing the low-energy frequencies without any modification of the high-energy frequencies is shown to
improve the LPC spectrum using the Itakura-Saito distance measure.
The second system combines the aharmonic comb filter with the acoustic-phonetic properties of speech
to improve the intelligibility of the MELP-coded noisy speech.
Noisy speech signal is segmented into broad level sound classes using a multi-sensor automatic
segmentation/classification tool, and each sound class is enhanced differently based on its
acoustic-phonetic properties. The proposed system is shown to outperform both the MELPe noise preprocessor
and the aharmonic comb filter in intelligibility tests when used in concatenation with the MELP coder.
Since the second noise suppression system uses an automatic segmentation/classification algorithm, exploiting the GEMS signal in an automatic
segmentation/classification task is also addressed using an ASR
approach. Current ASR engines can segment and classify speech utterances
in a single pass; however, they are sensitive to ambient noise.
Features that are extracted from the GEMS signal can be fused with the noisy MFCC features
to improve the noise-robustness of the ASR system. In the first phase, a voicing
feature is extracted from the clean speech signal and fused with the MFCC features.
The actual GEMS signal could not be used in this phase because of insufficient sensor data to train the ASR system.
Tests are done using the Aurora2 noisy digits database. The speech-based voicing
feature is found to be effective at around 10 dB but, below 10 dB, the effectiveness rapidly drops with decreasing SNR
because of the severe distortions in the speech-based features at these SNRs. Hence, a novel system is proposed that treats the
MFCC features in a speech frame as missing data if the global SNR is below 10 dB and the speech frame is
unvoiced. If the global SNR is above 10 dB of the speech frame is voiced, both MFCC features and voicing feature are used. The proposed
system is shown to outperform some of the popular noise-robust techniques at all SNRs.
In the second phase, a new isolated monosyllable database is prepared that contains both speech and GEMS data. ASR experiments conducted
for clean speech showed that the GEMS-based feature, when fused with the MFCC features, decreases the performance.
The reason for this unexpected result is found to be partly related to some of the GEMS data that is severely noisy.
The non-acoustic sensor noise exists in all GEMS data but the severe noise happens rarely. A missing
data technique is proposed to alleviate the effects of severely noisy sensor data. The GEMS-based feature is treated as missing data
when it is detected to be severely noisy. The combined features are shown to outperform the MFCC features for clean
speech when the missing data technique is applied.
|
153 |
A Content Based Movie Recommendation System Empowered By Collaborative Missing Data PredictionKaraman, Hilal 01 July 2010 (has links) (PDF)
The evolution of the Internet has brought us into a world that represents a huge amount of information items such as music, movies, books, web pages, etc. with varying quality. As a result of this huge universe of items, people get confused and the question &ldquo / Which one should I choose?&rdquo / arises in their minds. Recommendation Systems address the problem of getting confused about items to choose, and filter a specific type of information with a specific information filtering technique that attempts to present information items that are likely of interest to the user. A variety of information filtering techniques have been proposed for performing recommendations, including content-based and collaborative techniques which are the most commonly used approaches in recommendation systems. This thesis work introduces ReMovender, a content-based movie recommendation system which is empowered by collaborative missing data prediction. The distinctive point of this study lies in the methodology used to correlate the users in the system with one another and the usage of the content information of movies. ReMovender makes it possible for the users to rate movies in a scale from one to five. By using these ratings, it finds similarities among the users in a collaborative manner to predict the missing ratings data. As for the content-based part, a set of movie features are used in order to correlate the movies and produce recommendations for the users.
|
154 |
Investigation of probabilistic principal component analysis compared to proper orthogonal decomposition methods for basis extraction and missing data estimationLee, Kyunghoon 21 May 2010 (has links)
The identification of flow characteristics and the reduction of high-dimensional simulation data have capitalized on an orthogonal basis achieved by proper orthogonal decomposition (POD), also known as principal component analysis (PCA) or the Karhunen-Loeve transform (KLT). In the realm of aerospace engineering, an orthogonal basis is versatile for diverse applications, especially associated with reduced-order modeling (ROM) as follows: a low-dimensional turbulence model, an unsteady aerodynamic model for aeroelasticity and flow control, and a steady aerodynamic model for airfoil shape design. Provided that a given data set lacks parts of its data, POD is required to adopt a least-squares formulation, leading to gappy POD, using a gappy norm that is a variant of an L2 norm dealing with only known data. Although gappy POD is originally devised to restore marred images, its application has spread to aerospace engineering for the following reason: various engineering problems can be reformulated in forms of missing data estimation to exploit gappy POD. Similar to POD, gappy POD has a broad range of applications such as optimal flow sensor placement, experimental and numerical flow data assimilation, and impaired particle image velocimetry (PIV) data restoration.
Apart from POD and gappy POD, both of which are deterministic formulations, probabilistic principal component analysis (PPCA), a probabilistic generalization of PCA, has been used in the pattern recognition field for speech recognition and in the oceanography area for empirical orthogonal functions in the presence of missing data. In formulation, PPCA presumes a linear latent variable model relating an observed variable with a latent variable that is inferred only from an observed variable through a linear mapping called factor-loading. To evaluate the maximum likelihood estimates (MLEs) of PPCA parameters such as a factor-loading, PPCA can invoke an expectation-maximization (EM) algorithm, yielding an EM algorithm for PPCA (EM-PCA). By virtue of the EM algorithm, the EM-PCA is capable of not only extracting a basis but also restoring missing data through iterations whether the given data are intact or not. Therefore, the EM-PCA can potentially substitute for both POD and gappy POD inasmuch as its accuracy and efficiency are comparable to those of POD and gappy POD. In order to examine the benefits of the EM-PCA for aerospace engineering applications, this thesis attempts to qualitatively and quantitatively scrutinize the EM-PCA alongside both POD and gappy POD using high-dimensional simulation data.
In pursuing qualitative investigations, the theoretical relationship between POD and PPCA is transparent such that the factor-loading MLE of PPCA, evaluated by the EM-PCA, pertains to an orthogonal basis obtained by POD. By contrast, the analytical connection between gappy POD and the EM-PCA is nebulous because they distinctively approximate missing data due to their antithetical formulation perspectives: gappy POD solves a least-squares problem whereas the EM-PCA relies on the expectation of the observation probability model. To juxtapose both gappy POD and the EM-PCA, this research proposes a unifying least-squares perspective that embraces the two disparate algorithms within a generalized least-squares framework. As a result, the unifying perspective reveals that both methods address similar least-squares problems; however, their formulations contain dissimilar bases and norms. Furthermore, this research delves into the ramifications of the different bases and norms that will eventually characterize the traits of both methods. To this end, two hybrid algorithms of gappy POD and the EM-PCA are devised and compared to the original algorithms for a qualitative illustration of the different basis and norm effects. After all, a norm reflecting a curve-fitting method is found to more significantly affect estimation error reduction than a basis for two example test data sets: one is absent of data only at a single snapshot and the other misses data across all the snapshots.
From a numerical performance aspect, the EM-PCA is computationally less efficient than POD for intact data since it suffers from slow convergence inherited from the EM algorithm. For incomplete data, this thesis quantitatively found that the number of data-missing snapshots predetermines whether the EM-PCA or gappy POD outperforms the other because of the computational cost of a coefficient evaluation, resulting from a norm selection. For instance, gappy POD demands laborious computational effort in proportion to the number of data-missing snapshots as a consequence of the gappy norm. In contrast, the computational cost of the EM-PCA is invariant to the number of data-missing snapshots thanks to the L2 norm. In general, the higher the number of data-missing snapshots, the wider the gap between the computational cost of gappy POD and the EM-PCA. Based on the numerical experiments reported in this thesis, the following criterion is recommended regarding the selection between gappy POD and the EM-PCA for computational efficiency: gappy POD for an incomplete data set containing a few data-missing snapshots and the EM-PCA for an incomplete data set involving multiple data-missing snapshots.
Last, the EM-PCA is applied to two aerospace applications in comparison to gappy POD as a proof of concept: one with an emphasis on basis extraction and the other with a focus on missing data reconstruction for a given incomplete data set with scattered missing data.
The first application exploits the EM-PCA to efficiently construct reduced-order models of engine deck responses obtained by the numerical propulsion system simulation (NPSS), some of whose results are absent due to failed analyses caused by numerical instability.
Model-prediction tests validate that engine performance metrics estimated by the reduced-order NPSS model exhibit considerably good agreement with those directly obtained by NPSS. Similarly, the second application illustrates that the EM-PCA is significantly more cost effective than gappy POD at repairing spurious PIV measurements obtained from acoustically-excited, bluff-body jet flow experiments. The EM-PCA reduces computational cost on factors 8 ~ 19 compared to gappy POD while generating the same restoration results as those evaluated by gappy POD. All in all, through comprehensive theoretical and numerical investigation, this research establishes that the EM-PCA is an efficient alternative to gappy POD for an incomplete data set containing missing data over an entire data set.
|
155 |
Essays on Innovation, Patents, and EconometricsEntezarkheir, Mahdiyeh January 2010 (has links)
This thesis investigates the impact of fragmentation in the ownership of complementary patents or patent thickets on firms' market value. This question is motivated by the increase in the patent ownership fragmentation following the pro-patent shifts in the US since 1982. The first chapter uses panel data on patenting US manufacturing firms from 1979 to 1996, and estimates the impact of patent thickets on firms' market value. I find that patent thickets lower firms' market value, and firms with a large patent portfolio size experience a smaller negative effect from their thickets. Moreover, no systematic difference exists in the impact of patent thickets on firms' market value over time. The second chapter extends this analysis to account for the indirect impacts of patent thickets on firms' market value. These indirect effects arise through the effects of patent thickets on firms' R\&D and patenting activities. Using panel data on US manufacturing firms from 1979 to 1996, I estimate the impact of patent thickets on market value, R\&D, and patenting as well as the impacts of R\&D and patenting on market value. Employing these estimates, I determine the direct, indirect, and total impacts of patent thickets on market value. I find that patent thickets decrease firms' market value, while I hold the firms’ R\&D and patenting activities constant. I find no evidence of a change in R\&D due to patent thickets. However, there is evidence of defensive patenting (an increase in patenting attributed to thickets), which helps to reduce the direct negative impact of patent thickets on market value.
The data sets used in Chapters 1 and 2 have a number of missing observations on regressors. The commonly used methods to manage missing observations are the listwise deletion (complete case) and the indicator methods. Studies on the statistical properties of these methods suggest a smaller bias using the listwise deletion method. Employing Monte Carlo simulations, Chapter 3 examines the properties of these methods, and finds that in some cases the listwise deletion estimates have larger biases than indicator estimates. This finding suggests that interpreting estimates arrived at with either approach requires caution.
|
156 |
Repairing event logs using stochastic process modelsRogge-Solti, Andreas, Mans, Ronny S., van der Aalst, Wil M. P., Weske, Mathias January 2013 (has links)
Companies strive to improve their business processes in order to remain competitive. Process mining aims to infer meaningful insights from process-related data and attracted the attention of practitioners, tool-vendors, and researchers in recent years. Traditionally, event logs are assumed to describe the as-is situation. But this is not necessarily the case in environments where logging may be compromised due to manual logging. For example, hospital staff may need to manually enter information regarding the patient’s treatment. As a result, events or timestamps may be missing or incorrect.
In this paper, we make use of process knowledge captured in process models, and provide a method to repair missing events in the logs. This way, we facilitate analysis of incomplete logs. We realize the repair by combining stochastic Petri nets, alignments, and Bayesian networks. We evaluate the results using both synthetic data and real event data from a Dutch hospital. / Unternehmen optimieren ihre Geschäftsprozesse laufend um im kompetitiven Umfeld zu bestehen. Das Ziel von Process Mining ist es, bedeutende Erkenntnisse aus prozessrelevanten Daten zu extrahieren. In den letzten Jahren sorgte Process Mining bei Experten, Werkzeugherstellern und Forschern zunehmend für Aufsehen. Traditionell wird dabei angenommen, dass Ereignisprotokolle die tatsächliche Ist-Situation widerspiegeln. Dies ist jedoch nicht unbedingt der Fall, wenn prozessrelevante Ereignisse manuell erfasst werden. Ein Beispiel hierfür findet sich im Krankenhaus, in dem das Personal Behandlungen meist manuell dokumentiert. Vergessene oder fehlerhafte Einträge in Ereignisprotokollen sind in solchen Fällen nicht auszuschließen.
In diesem technischen Bericht wird eine Methode vorgestellt, die das Wissen aus Prozessmodellen und historischen Daten nutzt um fehlende Einträge in Ereignisprotokollen zu reparieren. Somit wird die Analyse unvollständiger Ereignisprotokolle erleichtert. Die Reparatur erfolgt mit einer Kombination aus stochastischen Petri Netzen, Alignments und Bayes'schen Netzen. Die Ergebnisse werden mit synthetischen Daten und echten Daten eines holländischen Krankenhauses evaluiert.
|
157 |
Design, maintenance and methodology for analysing longitudinal social surveys, including applicationsDomrow, Nathan Craig January 2007 (has links)
This thesis describes the design, maintenance and statistical analysis involved in undertaking a Longitudinal Survey. A longitudinal survey (or study) obtains observations or responses from individuals over several times over a defined period. This enables the direct study of changes in an individual's response over time. In particular, it distinguishes an individual's change over time from the baseline differences among individuals within the initial panel (or cohort). This is not possible in a cross-sectional study. As such, longitudinal surveys give correlated responses within individuals. Longitudinal studies therefore require different considerations for sample design and selection and analysis from standard cross-sectional studies. This thesis looks at the methodology for analysing social surveys. Most social surveys comprise of variables described as categorical variables. This thesis outlines the process of sample design and selection, interviewing and analysis for a longitudinal study. Emphasis is given to categorical response data typical of a survey. Included in this thesis are examples relating to the Goodna Longitudinal Survey and the Longitudinal Survey of Immigrants to Australia (LSIA). Analysis in this thesis also utilises data collected from these surveys. The Goodna Longitudinal Survey was conducted by the Queensland Office of Economic and Statistical Research (a portfolio office within Queensland Treasury) and began in 2002. It ran for two years whereby two waves of responses were collected.
|
158 |
Gestion de données manquantes dans des cascades de boosting : application à la détection de visages / Management of missing data in boosting cascades : application to face detectionBouges, Pierre 06 December 2012 (has links)
Ce mémoire présente les travaux réalisés dans le cadre de ma thèse. Celle-ci a été menée dans le groupe ISPR (ImageS, Perception systems and Robotics) de l’Institut Pascal au sein de l’équipe ComSee (Computers that See). Ces travaux s’inscrivent dans le cadre du projet Bio Rafale initié par la société clermontoise Vesalis et financé par OSEO. Son but est d’améliorer la sécurité dans les stades en s’appuyant sur l’identification des interdits de stade. Les applications des travaux de cette thèse concernent la détection de visages. Elle représente la première étape de la chaîne de traitement du projet. Les détecteurs les plus performants utilisent une cascade de classifieurs boostés. La notion de cascade fait référence à une succession séquentielle de plusieurs classifieurs. Le boosting, quant à lui, représente un ensemble d’algorithmes d’apprentissage automatique qui combinent linéairement plusieurs classifieurs faibles. Le détecteur retenu pour cette thèse utilise également une cascade de classifieurs boostés. L’apprentissage d’une telle cascade nécessite une base d’apprentissage ainsi qu’un descripteur d’images. Cette description des images est ici assurée par des matrices de covariance. La phase d’apprentissage d’un détecteur d’objets détermine ces conditions d’utilisation. Une de nos contributions est d’adapter un détecteur à des conditions d’utilisation non prévues par l’apprentissage. Les adaptations visées aboutissent à un problème de classification avec données manquantes. Une formulation probabiliste de la structure en cascade est alors utilisée pour incorporer les incertitudes introduites par ces données manquantes. Cette formulation nécessite l’estimation de probabilités a posteriori ainsi que le calcul de nouveaux seuils à chaque niveau de la cascade modifiée. Pour ces deux problèmes, plusieurs solutions sont proposées et de nombreux tests sont effectués pour déterminer la meilleure configuration. Enfin, les applications suivantes sont présentées : détection de visages tournés ou occultés à partir d’un détecteur de visages de face. L’adaptation du détecteur aux visages tournés nécessite l’utilisation d’un modèle géométrique 3D pour ajuster les positions des sous-fenêtres associées aux classifieurs faibles. / This thesis has been realized in the ISPR group (ImageS, Perception systems and Robotics) of the Institut Pascal with the ComSee team (Computers that See). My research is involved in a project called Bio Rafale. It was created by the compagny Vesalis in 2008 and it is funded by OSEO. Its goal is to improve the security in stadium using identification of dangerous fans. The applications of these works deal with face detection. It is the first step in the process chain of the project. Most efficient detectors use a cascade of boosted classifiers. The term cascade refers to a sequential succession of several classifiers. The term boosting refers to a set of learning algorithms that linearly combine several weak classifiers. The detector selected for this thesis also uses a cascade of boosted classifiers. The training of such a cascade needs a training database and an image feature. Here, covariance matrices are used as image feature. The limits of an object detector are fixed by its training stage. One of our contributions is to adapt an object detector to handle some of its limits. The proposed adaptations lead to a problem of classification with missing data. A probabilistic formulation of a cascade is then used to incorporate the uncertainty introduced by the missing data. This formulation involves the estimation of a posteriori probabilities and the computation of new rejection thresholds at each level of the modified cascade. For these two problems, several solutions are proposed and extensive tests are done to find the best configuration. Finally, our solution is applied to the detection of turned or occluded faces using just an uprigth face detector. Detecting the turned faces requires the use of a 3D geometric model to adjust the position of the subwindow associated with each weak classifier.
|
159 |
Multiple Imputation for Two-Level Hierarchical Models with Categorical Variables and Missing at Random DataJanuary 2016 (has links)
abstract: Accurate data analysis and interpretation of results may be influenced by many potential factors. The factors of interest in the current work are the chosen analysis model(s), the presence of missing data, and the type(s) of data collected. If analysis models are used which a) do not accurately capture the structure of relationships in the data such as clustered/hierarchical data, b) do not allow or control for missing values present in the data, or c) do not accurately compensate for different data types such as categorical data, then the assumptions associated with the model have not been met and the results of the analysis may be inaccurate. In the presence of clustered/nested data, hierarchical linear modeling or multilevel modeling (MLM; Raudenbush & Bryk, 2002) has the ability to predict outcomes for each level of analysis and across multiple levels (accounting for relationships between levels) providing a significant advantage over single-level analyses. When multilevel data contain missingness, multilevel multiple imputation (MLMI) techniques may be used to model both the missingness and the clustered nature of the data. With categorical multilevel data with missingness, categorical MLMI must be used. Two such routines for MLMI with continuous and categorical data were explored with missing at random (MAR) data: a formal Bayesian imputation and analysis routine in JAGS (R/JAGS) and a common MLM procedure of imputation via Bayesian estimation in BLImP with frequentist analysis of the multilevel model in Mplus (BLImP/Mplus). Manipulated variables included interclass correlations, number of clusters, and the rate of missingness. Results showed that with continuous data, R/JAGS returned more accurate parameter estimates than BLImP/Mplus for almost all parameters of interest across levels of the manipulated variables. Both R/JAGS and BLImP/Mplus encountered convergence issues and returned inaccurate parameter estimates when imputing and analyzing dichotomous data. Follow-up studies showed that JAGS and BLImP returned similar imputed datasets but the choice of analysis software for MLM impacted the recovery of accurate parameter estimates. Implications of these findings and recommendations for further research will be discussed. / Dissertation/Thesis / Doctoral Dissertation Educational Psychology 2016
|
160 |
Statistické metody pro regresní modely s chybějícími daty / Statistical Methods for Regression Models With Missing DataNekvinda, Matěj January 2018 (has links)
The aim of this thesis is to describe and further develop estimation strategies for data obtained by stratified sampling. Estimation of the mean and linear regression model are discussed. The possible inclusion of auxiliary variables in the estimation is exam- ined. The auxiliary variables can be transformed rather than used in their original form. A transformation minimizing the asymptotic variance of the resulting estimator is pro- vided. The estimator using an approach from this thesis is compared to the doubly robust estimator and shown to be asymptotically equivalent.
|
Page generated in 0.0497 seconds