• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 133
  • 55
  • 42
  • 15
  • 14
  • 8
  • 6
  • 4
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 323
  • 140
  • 120
  • 119
  • 69
  • 54
  • 44
  • 40
  • 27
  • 24
  • 22
  • 22
  • 21
  • 20
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
161

The impact of missing data imputation on HCC survival prediction : Exploring the combination of missing data imputation with data-level methods such as clustering and oversampling

Abdul Jalil, Walid, Dalla Torre, Kvin January 2018 (has links)
The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification.
162

The Single Imputation Technique in the Gaussian Mixture Model Framework

Aisyah, Binti M.J. January 2018 (has links)
Missing data is a common issue in data analysis. Numerous techniques have been proposed to deal with the missing data problem. Imputation is the most popular strategy for handling the missing data. Imputation for data analysis is the process to replace the missing values with any plausible values. Two most frequent imputation techniques cited in literature are the single imputation and the multiple imputation. The multiple imputation, also known as the golden imputation technique, has been proposed by Rubin in 1987 to address the missing data. However, the inconsistency is the major problem in the multiple imputation technique. The single imputation is less popular in missing data research due to bias and less variability issues. One of the solutions to improve the single imputation technique in the basic regression model: the main motivation is that, the residual is added to improve the bias and variability. The residual is drawn by normal distribution assumption with a mean of 0, and the variance is equal to the residual variance. Although new methods in the single imputation technique, such as stochastic regression model, and hot deck imputation, might be able to improve the variability and bias issues, the single imputation techniques suffer with the uncertainty that may underestimate the R-square or standard error in the analysis results. The research reported in this thesis provides two imputation solutions for the single imputation technique. In the first imputation procedure, the wild bootstrap is proposed to improve the uncertainty for the residual variance in the regression model. In the second solution, the predictive mean matching (PMM) is enhanced, where the regression model is taking the main role to generate the recipient values while the observations in the donors are taken from the observed values. Then the missing values are imputed by randomly drawing one of the observations in the donor pool. The size of the donor pool is significant to determine the quality of the imputed values. The fixed size of donor is used to be employed in many existing research works with PMM imputation technique, but might not be appropriate in certain circumstance such as when the data distribution has high density region. Instead of using the fixed size of donor pool, the proposed method applies the radius-based solution to determine the size of donor pool. Both proposed imputation procedures will be combined with the Gaussian mixture model framework to preserve the original data distribution. The results reported in the thesis from the experiments on benchmark and artificial data sets confirm improvement for further data analysis. The proposed approaches are therefore worthwhile to be considered for further investigation and experiments.
163

AI-enabled modeling and monitoring of data-rich advanced manufacturing systems

Mamun, Abdullah Al 08 August 2023 (has links) (PDF)
The infrastructure of cyber-physical systems (CPS) is based on a meta-concept of cybermanufacturing systems (CMS) that synchronizes the Industrial Internet of Things (IIoTs), Cloud Computing, Industrial Control Systems (ICSs), and Big Data analytics in manufacturing operations. Artificial Intelligence (AI) can be incorporated to make intelligent decisions in the day-to-day operations of CMS. Cyberattack spaces in AI-based cybermanufacturing operations pose significant challenges, including unauthorized modification of systems, loss of historical data, destructive malware, software malfunctioning, etc. However, a cybersecurity framework can be implemented to prevent unauthorized access, theft, damage, or other harmful attacks on electronic equipment, networks, and sensitive data. The five main cybersecurity framework steps are divided into procedures and countermeasure efforts, including identifying, protecting, detecting, responding, and recovering. Given the major challenges in AI-enabled cybermanufacturing systems, three research objectives are proposed in this dissertation by incorporating cybersecurity frameworks. The first research aims to detect the in-situ additive manufacturing (AM) process authentication problem using high-volume video streaming data. A side-channel monitoring approach based on an in-situ optical imaging system is established, and a tensor-based layer-wise texture descriptor is constructed to describe the observed printing path. Subsequently, multilinear principal component analysis (MPCA) is leveraged to reduce the dimension of the tensor-based texture descriptor, and low-dimensional features can be extracted for detecting attack-induced alterations. The second research work seeks to address the high-volume data stream problems in multi-channel sensor fusion for diverse bearing fault diagnosis. This second approach proposes a new multi-channel sensor fusion method by integrating acoustics and vibration signals with different sampling rates and limited training data. The frequency-domain tensor is decomposed by MPCA, resulting in low-dimensional process features for diverse bearing fault diagnosis by incorporating a Neural Network classifier. By linking the second proposed method, the third research endeavor is aligned to recovery systems of multi-channel sensing signals when a substantial amount of missing data exists due to sensor malfunction or transmission issues. This study has leveraged a fully Bayesian CANDECOMP/PARAFAC (FBCP) factorization method that enables to capture of multi-linear interaction (channels × signals) among latent factors of sensor signals and imputes missing entries based on observed signals.
164

Constructing Gender Indices Using Exploratory Factor Analysis

Annersten, Gilbert January 2023 (has links)
No description available.
165

New Technique for Imputing Missing Item Responses for an Ordinal Variable: Using Tennessee Youth Risk Behavior Survey as an Example.

Ahmed, Andaleeb Abrar 15 December 2007 (has links) (PDF)
Surveys ordinarily ask questions in an ordinal scale and often result in missing data. We suggest a regression based technique for imputing missing ordinal data. Multilevel cumulative logit model was used with an assumption that observed responses of certain key variables can serve as covariate in predicting missing item responses of an ordinal variable. Individual predicted probabilities at each response level were obtained. Average individual predicted probabilities for each response level were used to randomly impute the missing responses using a uniform distribution. Finally, likelihood ratio chi square statistics was used to compare the imputed and observed distributions. Two other forms of multiple imputation algorithms were performed for comparison. Performance of our imputation technique was comparable to other 2 established algorithms. Our method being simpler does not involve any complex algorithms and with further research can potentially be used as an imputation technique for missing ordinal variables.
166

Support Vector Machines for Classification and Imputation

Rogers, Spencer David 16 May 2012 (has links) (PDF)
Support vector machines (SVMs) are a powerful tool for classification problems. SVMs have only been developed in the last 20 years with the availability of cheap and abundant computing power. SVMs are a non-statistical approach and make no assumptions about the distribution of the data. Here support vector machines are applied to a classic data set from the machine learning literature and the out-of-sample misclassification rates are compared to other classification methods. Finally, an algorithm for using support vector machines to address the difficulty in imputing missing categorical data is proposed and its performance is demonstrated under three different scenarios using data from the 1997 National Labor Survey.
167

Advancing the Effectiveness of Non-Linear Dimensionality Reduction Techniques

Gashler, Michael S. 18 May 2012 (has links) (PDF)
Data that is represented with high dimensionality presents a computational complexity challenge for many existing algorithms. Limiting dimensionality by discarding attributes is sometimes a poor solution to this problem because significant high-level concepts may be encoded in the data across many or all of the attributes. Non-linear dimensionality reduction (NLDR) techniques have been successful with many problems at minimizing dimensionality while preserving intrinsic high-level concepts that are encoded with varying combinations of attributes. Unfortunately, many challenges remain with existing NLDR techniques, including excessive computational requirements, an inability to benefit from prior knowledge, and an inability to handle certain difficult conditions that occur in data with many real-world problems. Further, certain practical factors have limited advancement in NLDR, such as a lack of clarity regarding suitable applications for NLDR, and a general inavailability of efficient implementations of complex algorithms. This dissertation presents a collection of papers that advance the state of NLDR in each of these areas. Contributions of this dissertation include: • An NLDR algorithm, called Manifold Sculpting, that optimizes its solution using graduated optimization. This approach enables it to obtain better results than methods that only optimize an approximate problem. Additionally, Manifold Sculpting can benefit from prior knowledge about the problem. • An intelligent neighbor-finding technique called SAFFRON that improves the breadth of problems that existing NLDR techniques can handle. • A neighborhood refinement technique called CycleCut that further increases the robustness of existing NLDR techniques, and that can work in conjunction with SAFFRON to solve difficult problems. • Demonstrations of specific applications for NLDR techniques, including the estimation of state within dynamical systems, training of recurrent neural networks, and imputing missing values in data. • An open source toolkit containing each of the techniques described in this dissertation, as well as several existing NLDR algorithms, and other useful machine learning methods.
168

Spatial Allocation, Imputation, and Sampling Methods for Timber Product Output Data

Brown, John 10 November 2009 (has links)
Data from the 2001 and 2003 timber product output (TPO) studies for Georgia were explored to determine new methods for handling missing data and finding suitable sampling estimators. Mean roundwood volume receipts per mill for the year 2003 were calculated using the methods developed by Rubin (1987). Mean receipts per mill ranged from 4.4 to 14.2 million ft3. The mean value of 9.3 million ft3 did not statistically differ from the NONMISS, SINGLE1, and SINGLE2 references means (p=.68, .75, and .76 respectively). Fourteen estimators were investigated to investigate sampling approaches, with estimators being of several means types (simple random sample, ratio, stratified sample, and combined ratio) as well as employing two methods for stratification (Dalenius-Hodges (DH) square root of the Frequency method and a cluster analysis method. Relative efficiency (RE) improved when the number of groups increased and when employing a ratio estimator, particularly a combined ratio. Neither the DH method nor the cluster analysis method performed better than the other. Six bound sizes (1, 5, 10, 15, 20, and 25 percent) were considered for deriving samples sizes for the total volume of roundwood. The minimum achievable bound size was found to be 10 percent of the total receipts volume for the DH-method using a two group stratification. This was true for both the stratified and combined ratio estimators. In addition, for the stratified and combined ratio estimators, only the DH method stratifications were able to reach a 10 percent bound on the total (6 of the 12 stratified estimators). The remaining six stratified estimators were able to achieve a 20 percent bound of the total. Finally, nonlinear repeated measures models were developed to spatially allocate mill receipts to surrounding counties in the event of obtaining only a mill's total receipt volume. A Gompertz model with a power spatial covariance was found to be the best performing when using road distances from the mills to either county center type (geographic or forest mass). These models utilized the cumulative frequency of mill receipts as the response variable, with cumulative frequencies based on distance from the mill to the county. / Ph. D.
169

Partial least squares structural equation modelling with incomplete data. An investigation of the impact of imputation methods.

Mohd Jamil, J.B. January 2012 (has links)
Despite considerable advances in missing data imputation methods over the last three decades, the problem of missing data remains largely unsolved. Many techniques have emerged in the literature as candidate solutions. These techniques can be categorised into two classes: statistical methods of data imputation and computational intelligence methods of data imputation. Due to the longstanding use of statistical methods in handling missing data problems, it takes quite some time for computational intelligence methods to gain profound attention even though these methods have analogous accuracy, in comparison to other approaches. The merits of both these classes have been discussed at length in the literature, but only limited studies make significant comparison to these classes. This thesis contributes to knowledge by firstly, conducting a comprehensive comparison of standard statistical methods of data imputation, namely, mean substitution (MS), regression imputation (RI), expectation maximization (EM), tree imputation (TI) and multiple imputation (MI) on missing completely at random (MCAR) data sets. Secondly, this study also compares the efficacy of these methods with a computational intelligence method of data imputation, ii namely, a neural network (NN) on missing not at random (MNAR) data sets. The significance difference in performance of the methods is presented. Thirdly, a novel procedure for handling missing data is presented. A hybrid combination of each of these statistical methods with a NN, known here as the post-processing procedure, was adopted to approximate MNAR data sets. Simulation studies for each of these imputation approaches have been conducted to assess the impact of missing values on partial least squares structural equation modelling (PLS-SEM) based on the estimated accuracy of both structural and measurement parameters. The best method to deal with particular missing data mechanisms is highly recognized. Several significant insights were deduced from the simulation results. It was figured that for the problem of MCAR by using statistical methods of data imputation, MI performs better than the other methods for all percentages of missing data. Another unique contribution is found when comparing the results before and after the NN post-processing procedure. This improvement in accuracy may be resulted from the neural network¿s ability to derive meaning from the imputed data set found by the statistical methods. Based on these results, the NN post-processing procedure is capable to assist MS in producing significant improvement in accuracy of the approximated values. This is a promising result, as MS is the weakest method in this study. This evidence is also informative as MS is often used as the default method available to users of PLS-SEM software. / Minister of Higher Education Malaysia and University Utara Malaysia
170

The Impact of Data Imputation Methodologies on Knowledge Discovery

Brown, Marvin Lane 26 November 2008 (has links)
No description available.

Page generated in 0.0662 seconds