Global ETD Search

251	Developing Predictive Models for Lung Tumor Analysis Basu, Satrajit 01 January 2012 (has links) A CT-scan of lungs has become ubiquitous as a thoracic diagnostic tool. Thus, using CT-scan images in developing predictive models for tumor types and survival time of patients afflicted with Non-Small Cell Lung Cancer (NSCLC) would provide a novel approach to non-invasive tumor analysis. It can provide an alternative to histopathological techniques such as needle biopsy. Two major tumor analysis problems were addressed in course of this study, tumor type classification and survival time prediction. CT-scan images of 109 patients with NSCLC were used in this study. The first involved classifying tumor types into two major classes of non-small cell lung tumors, Adenocarcinoma and Squamous-cell Carcinoma, each constituting 30% of all lung tumors. In a first of its kind investigation, a large group of 2D and 3D image features, which were hypothesized to be useful, are evaluated for effectiveness in classifying the tumors. Classifiers including decision trees and support vector machines (SVM) were used along with feature selection techniques (wrappers and relief-F) to build models for tumor classification. Results show that over the large feature space for both 2D and 3D features it is possible to predict tumor classes with over 63% accuracy, showing new features may be of help. The accuracy achieved using 2D and 3D features is similar, with 3D easier to use. The tumor classification study was then extended by introducing the Bronchioalveolar Carcinoma (BAC) tumor type. Following up on the hypothesis that Bronchioalveolar Carcinoma is substantially different from other NSCLC tumor types, a two-class problem was created, where an attempt was made to differentiate BAC from the other two tumor types. To make a three-class problem a two-class problem, misclassification amongst Adenocarcinoma and Squamous-cell Carcinoma were ignored. Using the same prediction models as the previous study and just 3D image features, tumor classes were predicted with around 77% accuracy. The final study involved predicting two year survival time in patients suffering from NSCLC. Using a subset of the image features and a handful of clinical features, predictive models were developed to predict two year survival time in 95 NSCLC patients. A support vector machine classifier, naive Bayes classifier and decision tree classifier were used to develop the predictive models. Using the Area Under the Curve (AUC) as a performance metric, different models were developed and analyzed for their effectiveness in predicting survival time. A novel feature selection method to group features based on a correlation measure has been proposed in this work along with feature space reduction using principal component analysis. The parameters for the support vector machine were tuned using grid search. A model based on a combination of image and clinical features, achieved the best performance with an AUC of 0.69, using dimensionality reduction by means of principal component analysis along with grid search to tune the parameters of the SVM classifier. The study showed the effectiveness of a predominantly image feature space in predicting survival time. A comparison of the performance of the models from different classifiers also indicate SVMs consistently outperformed or matched the other two classifiers for this data. Classifiers CT-scan Feature Selection Image Features Radiomics Support Vector Machine American Studies Arts and Humanities Computer Sciences
252	Interactive Object Retrieval using Interpretable Visual Models Rebai, Ahmed 18 May 2011 (has links) (PDF) This thesis is an attempt to improve visual object retrieval by allowing users to interact with the system. Our solution lies in constructing an interactive system that allows users to define their own visual concept from a concise set of visual patches given as input. These patches, which represent the most informative clues of a given visual category, are trained beforehand with a supervised learning algorithm in a discriminative manner. Then, and in order to specialize their models, users have the possibility to send their feedback on the model itself by choosing and weighting the patches they are confident of. The real challenge consists in how to generate concise and visually interpretable models. Our contribution relies on two points. First, in contrast to the state-of-the-art approaches that use bag-of-words, we propose embedding local visual features without any quantization, which means that each component of the high-dimensional feature vectors used to describe an image is associated to a unique and precisely localized image patch. Second, we suggest using regularization constraints in the loss function of our classifier to favor sparsity in the models produced. Sparsity is indeed preferable for concision (a reduced number of patches in the model) as well as for decreasing prediction time. To meet these objectives, we developed a multiple-instance learning scheme using a modified version of the BLasso algorithm. BLasso is a boosting-like procedure that behaves in the same way as Lasso (Least Absolute Shrinkage and Selection Operator). It efficiently regularizes the loss function with an additive L1-constraint by alternating between forward and backward steps at each iteration. The method we propose here is generic in the sense that it can be used with any local features or feature sets representing the content of an image region. [INFO] Computer Science [INFO] Informatique Object retrieval Interpretability Feature selection Sparsity Human perception Visual keywords User interaction
253	Non-Destructive VIS/NIR Reflectance Spectrometry for Red Wine Grape Analysis Fadock, Michael 04 August 2011 (has links) A novel non-destructive method of grape berry analysis is presented that uses reflected light to predict berry composition. The reflectance spectrum was collected using a diode array spectrometer (350 to 850 nm) over the 2009 and 2010 growing seasons. Partial least squares regression (PLS) and support vector machine regression (SVMR) generated calibrations between reflected light and composition for five berry components, total soluble solids (°Brix), titratable acidity (TA), pH, total phenols, and anthocyanins. Standard methods of analysis for the components were employed and characterized for error. Decomposition of the reflectance data was performed by principal component analysis (PCA) and independent component analysis (ICA). Regression models were constructed using 10x10 fold cross validated PLS and SVM models subject to smoothing, differentiation, and normalization pretreatments. All generated models were validated on the alternate season using two model selection strategies: minimum root mean squared error of prediction (RMSEP), and the "oneSE" heuristic. PCA/ICA decomposition demonstrated consistent features in the long VIS wavelengths and NIR region. The features are consistent across seasons. 2009 was generally more variable, possibly due to cold weather affects. RMSEP and R2 statistics of models indicate that PLS °Brix, pH, and TA models are well predicted for 2009 and 2010. SVM was marginally better. The R2 values of the PLS °Brix, pH, and TA models for 2009 and 2010 respectively were: 0.84, 0.58, 0.56 and: 0.89, 0.81, 0.58. 2010 °Brix models were suitable for rough screening. Optimal pretreatments were SG smoothing and relative normalization. Anthocyanins were well predicted in 2009, R2 0.65, but not in 2010, R2 0.15. Phenols were not well predicted in either year, R2 0.15-0.25. Validation demonstrated that °Brix, pH, and TA models from 2009 transferred to 2010 with fair results, R2 0.70, 0.72, 0.31. Models generated using 2010 reflectance data did not generate models that could predict 2009 data. It is hypothesized that weather events present in 2009 and not in 2010 allowed for a forward calibration transfer, and prevented the reverse calibration transfer. Heuristic selection was superior to minimum RMSEP for transfer, indicating some overfitting in the minimum RMSEP models. The results are demonstrative of a reflectance-composition relationship in the VIS-NIR region for °Brix, pH, and TA requiring additional study and development of further calibrations. PLS SVM regression reflectance multivariate cross validation feature selection calibration °Brix pH titratable acidity total phenols anthocyanins
254	Application of continuous wavelet analysis to hyperspectral data for the characterization of vegetation Cheng, Tao Unknown Date No description available. Hyperspectral Wavelet Leaf water content Green attack Leaf reflectance Feature selection
255	Classification in high dimensional feature spaces / by H.O. van Dyk Van Dyk, Hendrik Oostewald January 2009 (has links) In this dissertation we developed theoretical models to analyse Gaussian and multinomial distributions. The analysis is focused on classification in high dimensional feature spaces and provides a basis for dealing with issues such as data sparsity and feature selection (for Gaussian and multinomial distributions, two frequently used models for high dimensional applications). A Naïve Bayesian philosophy is followed to deal with issues associated with the curse of dimensionality. The core treatment on Gaussian and multinomial models consists of finding analytical expressions for classification error performances. Exact analytical expressions were found for calculating error rates of binary class systems with Gaussian features of arbitrary dimensionality and using any type of quadratic decision boundary (except for degenerate paraboloidal boundaries). Similarly, computationally inexpensive (and approximate) analytical error rate expressions were derived for classifiers with multinomial models. Additional issues with regards to the curse of dimensionality that are specific to multinomial models (feature sparsity) were dealt with and tested on a text-based language identification problem for all eleven official languages of South Africa. / Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009. Naïve Bayesian Maximum likelihood Curse of dimensionality Gaussian distribution Multinomial distribution Feature selection Data sparsity Chi-square variates Hyperboloidal decision boundaries
256	Classification in high dimensional feature spaces / by H.O. van Dyk Van Dyk, Hendrik Oostewald January 2009 (has links) In this dissertation we developed theoretical models to analyse Gaussian and multinomial distributions. The analysis is focused on classification in high dimensional feature spaces and provides a basis for dealing with issues such as data sparsity and feature selection (for Gaussian and multinomial distributions, two frequently used models for high dimensional applications). A Naïve Bayesian philosophy is followed to deal with issues associated with the curse of dimensionality. The core treatment on Gaussian and multinomial models consists of finding analytical expressions for classification error performances. Exact analytical expressions were found for calculating error rates of binary class systems with Gaussian features of arbitrary dimensionality and using any type of quadratic decision boundary (except for degenerate paraboloidal boundaries). Similarly, computationally inexpensive (and approximate) analytical error rate expressions were derived for classifiers with multinomial models. Additional issues with regards to the curse of dimensionality that are specific to multinomial models (feature sparsity) were dealt with and tested on a text-based language identification problem for all eleven official languages of South Africa. / Thesis (M.Ing. (Computer Engineering))--North-West University, Potchefstroom Campus, 2009. Naïve Bayesian Maximum likelihood Curse of dimensionality Gaussian distribution Multinomial distribution Feature selection Data sparsity Chi-square variates Hyperboloidal decision boundaries
257	Rule-based Models of Transcriptional Regulation and Complex Diseases : Applications and Development Bornelöv, Susanne January 2014 (has links) As we gain increased understanding of genetic disorders and gene regulation more focus has turned towards complex interactions. Combinations of genes or gene and environmental factors have been suggested to explain the missing heritability behind complex diseases. Furthermore, gene activation and splicing seem to be governed by a complex machinery of histone modification (HM), transcription factor (TF), and DNA sequence signals. This thesis aimed to apply and develop multivariate machine learning methods for use on such biological problems. Monte Carlo feature selection was combined with rule-based classification to identify interactions between HMs and to study the interplay of factors with importance for asthma and allergy. Firstly, publicly available ChIP-seq data (Paper I) for 38 HMs was studied. We trained a classifier for predicting exon inclusion levels based on the HMs signals. We identified HMs important for splicing and illustrated that splicing could be predicted from the HM patterns. Next, we applied a similar methodology on data from two large birth cohorts describing asthma and allergy in children (Paper II). We identified genetic and environmental factors with importance for allergic diseases which confirmed earlier results and found candidate gene-gene and gene-environment interactions. In order to interpret and present the classifiers we developed Ciruvis, a web-based tool for network visualization of classification rules (Paper III). We applied Ciruvis on classifiers trained on both simulated and real data and compared our tool to another methodology for interaction detection using classification. Finally, we continued the earlier study on epigenetics by analyzing HM and TF signals in genes with or without evidence of bidirectional transcription (Paper IV). We identified several HMs and TFs with different signals between unidirectional and bidirectional genes. Among these, the CTCF TF was shown to have a well-positioned peak 60-80 bp upstream of the transcription start site in unidirectional genes. Histone modification Transcription factor Transcriptional regulation Next-generation sequencing Feature selection Machine learning Rule-based classification Asthma Allergy
258	Greedy Representative Selection for Unsupervised Data Analysis Helwa, Ahmed Khairy Farahat January 2012 (has links) In recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances. This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below. First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection. Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time. Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance. Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task. Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems. Data Mining Machine Learning Unsupervised Data Analysis Greedy Algorithms Representative Selection Feature Selection Data Clustering Electrical and Computer Engineering
259	Optimum Polarization States & their Role in UWB Radar Identification of Targets Faisal Aldhubaib Unknown Date (has links) Although utilization of polarimetry techniques for recognition of military and civilian targets is well established in the narrowband context, it is not yet fully established in a broadband sense as compared to planetary area of research. The concept of combining polarimetry together with certain areas of broadband technology and thus forming a robust signature and feature set has been the main theme of this thesis. This is important, as basing the feature set on multiple types of signatures can increase the accuracy of the recognition process. In this thesis, the concept of radar target recognition based upon a polarization signature in a broadband context is examined. A proper UWB radar signal can excite the target dominant resonances and, consequently, reveal information about the target principle dimensions; while diversity in the polarization domain revealed information about the target shape. The target dimensions are used to classify the target, and then information about its shape is used to identify it. Fused together and inferred from the target characteristic polarization states, it was verified that the polarization information at dominant resonant frequencies have both a physical interpretation and attributes (as seen in section ‎3.4.3) related to the target symmetry, linearity, and orientation. In addition, this type of information has the ability to detect the presence of major scattering mechanisms such as strong specular reflection as in the case of the cylinder flat ends. Throughout the thesis, simulated canonical targets with similar resonant frequencies were used, and thus identification of radar targets was based solely on polarization information. In this framework, the resonant frequencies were merely identified as peaks in the frequency response for simple or low damping targets such as thin metal wires, or alternatively identified as the imaginary parts of the complex poles for complex or high damping targets with significant diameter and dielectric properties. Therefore, the main contribution of this thesis originates from the ability to integrate the optimum polarization states in a broadband context for improved target recognition performance. In this context, the spectral dispersion originating from the broad nature of the radar signal, the lack of accuracy in extracting the target resonances, the robustness of the polarization feature set, the representation of these states in time domain, and the feature set modelling with spatial variation are among the important issues addressed with several approaches presented to overcome them. The general approach considered involved a subset of “representative” times in the time domain, or correspondingly, “representative frequencies” in the frequency domain with which to associate optimum polarization states with each member of the subset are used. The first approach in chapter ‎3 involved the polarization representation by a set of frequency bands associated with the target resonant frequencies. This type of polarization description involved the formulation of a wideband scattering matrix to accommodate the broad nature of the signal presentation with appropriate bandwidth selection for each resonance; good estimation of the optimum polarization states in this procedure was achievable even for low signal-to-noise ratios. The second approach in chapter ‎4 extended the work of chapter ‎3 and involved the modification of the optimum polarization states by their associated powers. In addition, this approach included an identification algorithm based on the nearest neighbour technique. To identify the target, the identification algorithm involved the states at a set of resonant frequencies to give a majority vote. Then, a comparison of the performance of the modified polarization states and the original states demonstrated good improvement when the modified set is used. Generally, the accuracy of the resonance set estimate is more reliable in the time domain than the frequency domain, especially for resonances well localized in time. Therefore, the third approach in chapter ‎5 deals with the optimum states in the time domain where the extension to a wide band context was possible by the virtue of the polarization information embodied in the energy of the resonances. This procedure used a model-based signature to model the target impulse response as a set of resonances. The relevant resonance parameters, in this case, the resonant frequency and its associated energy, were extracted using the Matrix Pencil of Function algorithm. Again, this approach of sparse representation is necessary to find descriptors from the target impulse response that are time-invariant, and at the same time, can relate robustly to the target physical characteristics. A simple target such as a long wire showed that indeed polarization information contained in the target resonance energies could reflect the target physical attributes. In addition, for noise-corrupted signals and without any pulse averaging, the accuracy in estimating the optimum states was sufficiently good for signal to noise ratios above 20dB. Below this level, extraction of some members of the resonance set are not possible. In addition, using more complex wire models of aircraft, these time-based optimum states could distinguish between similar dimensional targets with small structural differences, e.g. different wing dihedral angles. The results also showed that the dominant resonance set has members belonging to different structural sections of the target. Therefore, incorporation of a time-based polarization set can give the full target physical characteristics. In the final procedure, a statistical Kernel function estimated the feature set derived previously in chapter ‎3, with aspect angle. After sampling the feature set over a wide set of angular aspects, a criterion based on the Bayesian error bisected the target global aspect into smaller sectors to decrease the variance of the estimate and, subsequently, decrease the probability of error. In doing so, discriminative features that have acceptable minimum probability of error were achievable. The minimum probability of error criterion and the angular bisection of the target could separate the feature set of two targets with similar resonances. Automatic Target recognition, Kernel Statistical Analysis feature selection nearest neighbour target feature resonance theory characteristic polarization state theory Radar polarimetry
260	System Complexity Reduction via Feature Selection January 2011 (has links) abstract: This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree models. Associative classifiers can achieve high accuracy, but the combination of many rules is difficult to interpret. Rule condition subset selection (RCSS) methods for associative classification are considered. RCSS aims to prune the rule conditions into a subset via feature selection. The subset then can be summarized into rule-based classifiers. Experiments show that classifiers after RCSS can substantially improve the classification interpretability without loss of accuracy. An ensemble feature selection method is proposed to learn Markov blankets for either discrete or continuous networks (without linear, Gaussian assumptions). The method is compared to a Bayesian local structure learning algorithm and to alternative feature selection methods in the causal structure learning problem. Feature selection is also used to enhance the interpretability of time series classification. Existing time series classification algorithms (such as nearest-neighbor with dynamic time warping measures) are accurate but difficult to interpret. This research leverages the time-ordering of the data to extract features, and generates an effective and efficient classifier referred to as a time series forest (TSF). The computational complexity of TSF is only linear in the length of time series, and interpretable features can be extracted. These features can be further reduced, and summarized for even better interpretability. Lastly, two variable importance measures are proposed to reduce the feature selection bias in tree-based ensemble models. It is well known that bias can occur when predictor attributes have different numbers of values. Two methods are proposed to solve the bias problem. One uses an out-of-bag sampling method called OOBForest, and the other, based on the new concept of a partial permutation test, is called a pForest. Experimental results show the existing methods are not always reliable for multi-valued predictors, while the proposed methods have advantages. / Dissertation/Thesis / Ph.D. Industrial Engineering 2011 Industrial Engineering Artificial Intelligence Information Technology associative classification attribute importance feature selection random forest time series classification

Search results