Global ETD Search

241	Identifying Plankton from Grayscale Silhouette Images Kramer, Kurt A 27 October 2005 (has links) Utilizing a continuous silhouette image of marine plankton produced by a device called SIPPER, developed by the Marine Sciences Department, individual plankton images were extracted, features were derived, and classification was performed. There were plankton recognition experiments performed in Support Vector Machine parameter tuning, Fourier descriptors, and feature selection. Several groups of features were implemented, moments, gramulometric, Fourier transform for texture, intensity histograms, Fourier descriptors for contour, convex hull, and Eigen ratio. The Fourier descriptors were implemented in three different flavors sampling, averaging and hybrid (mix of sampling and averaging). The feature selection experiments utilized a modified WRAPPER approach of which several flavors were explored including Best Case Next, Forward and Backward, and Beam Search. Feature selection significantly reduced the number of features required for processing, while at the same time maintaining the same level of classification accuracy. This resulted in reduced processing time for training and classification. SIPPER Feature selection Feature calculation Active learning Support vector machine SVM Multi-class American Studies Arts and Humanities
242	Greedy Representative Selection for Unsupervised Data Analysis Helwa, Ahmed Khairy Farahat January 2012 (has links) In recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances. This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below. First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection. Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time. Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance. Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task. Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems. Data Mining Machine Learning Unsupervised Data Analysis Greedy Algorithms Representative Selection Feature Selection Data Clustering Electrical and Computer Engineering
243	Learning with Feed-forward Neural Networks: Three Schemes to Deal with the Bias/Variance Trade-off Romero Merino, Enrique 30 November 2004 (has links) In terms of the Bias/Variance decomposition, very flexible (i.e., complex) Supervised Machine Learning systems may lead to unbiased estimators but with high variance. A rigid model, in contrast, may lead to small variance but high bias. There is a trade-off between the bias and variance contributions to the error, where the optimal performance is achieved.In this work we present three schemes related to the control of the Bias/Variance decomposition for Feed-forward Neural Networks (FNNs) with the (sometimes modified) quadratic loss function:1. An algorithm for sequential approximation with FNNs, named Sequential Approximation with Optimal Coefficients and Interacting Frequencies (SAOCIF). Most of the sequential approximations proposed in the literature select the new frequencies (the non-linear weights) guided by the approximation of the residue of the partial approximation. We propose a sequential algorithm where the new frequency is selected taking into account its interactions with the previously selected ones. The interactions are discovered by means of their optimal coefficients (the linear weights). A number of heuristics can be used to select the new frequencies. The aim is that the same level of approximation may be achieved with less hidden units than if we only try to match the residue as best as possible. In terms of the Bias/Variance decomposition, it will be possible to obtain simpler models with the same bias. The idea behind SAOCIF can be extended to approximation in Hilbert spaces, maintaining orthogonal-like properties. In this case, the importance of the interacting frequencies lies in the expectation of increasing the rate of approximation. Experimental results show that the idea of interacting frequencies allows to construct better approximations than matching the residue.2. A study and comparison of different criteria to perform Feature Selection (FS) with Multi-Layer Perceptrons (MLPs) and the Sequential Backward Selection (SBS) procedure within the wrapper approach. FS procedures control the Bias/Variance decomposition by means of the input dimension, establishing a clear connection with the curse of dimensionality. Several critical decision points are studied and compared. First, the stopping criterion. Second, the data set where the value of the loss function is measured. Finally, we also compare two ways of computing the saliency (i.e., the relative importance) of a feature: either first train a network and then remove temporarily every feature or train a different network with every feature temporarily removed. The experiments are performed for linear and non-linear models. Experimental results suggest that the increase in the computational cost associated with retraining a different network with every feature temporarily removed previous to computing the saliency can be rewarded with a significant performance improvement, specially if non-linear models are used. Although this idea could be thought as very intuitive, it has been hardly used in practice. Regarding the data set where the value of the loss function is measured, it seems clear that the SBS procedure for MLPs takes profit from measuring the loss function in a validation set. A somewhat non-intuitive conclusion is drawn looking at the stopping criterion, where it can be seen that forcing overtraining may be as useful as early stopping.3. A modification of the quadratic loss function for classification problems, inspired in Support Vector Machines (SVMs) and the AdaBoost algorithm, named Weighted Quadratic Loss (WQL) function. The modification consists in weighting the contribution of every example to the total error. In the linearly separable case, the solution of the hard margin SVM also minimizes the proposed loss function. The hardness of the resulting solution can be controlled, as in SVMs, so that this scheme may also be used for the non-linearly separable case. The error weighting proposed in WQL forces the training procedure to pay more attention to the points with a smaller margin. Therefore, variance tries to be controlled by not attempting to overfit the points that are already well classified. The model shares several properties with the SVMs framework, with some additional advantages. On the one hand, the final solution is neither restricted to have an architecture with so many hidden units as points (or support vectors) in the data set nor to use kernel functions. The frequencies are not restricted to be a subset of the data set. On the other hand, it allows to deal with multiclass and multilabel problems in a natural way. Experimental results are shown confirming these claims.A wide experimental work has been done with the proposed schemes, including artificial data sets, well-known benchmark data sets and two real-world problems from the Natural Language Processing domain. In addition to widely used activation functions, such as the hyperbolic tangent or the Gaussian function, other activation functions have been tested. In particular, sinusoidal MLPs showed a very good behavior. The experimental results can be considered as very satisfactory. The schemes presented in this work have been found to be very competitive when compared to other existing schemes described in the literature. In addition, they can be combined among them, since they deal with complementary aspects of the whole learning process. margin maximization feature selection bias/variance artificial intelligence machine learning sequential approximations neural networks 1203. Ciència dels ordinadors 004
244	Feature selection and artifact removal in sleep stage classification Hapuarachchi, Pasan January 2006 (has links) The use of Electroencephalograms (EEG) are essential to the analysis of sleep disorders in patients. With the use of electroencephalograms, electro-oculograms (EOG), and electromyograms (EMG), doctors and EEG technician can make conclusions about the sleep patterns of patients. In particular, the classification of the sleep data into various stages, such as NREM I-IV, REM, Awake, is extremely important. The EEG signal itself is highly sensitive to physiological and non-physiological artifacts. Trained human experts can accommodate for these artifacts while they are analyzing the EEG signal. <br /><br /> However, if some of these artifacts are removed prior to analysis, their job will be become easier. Furthermore, one of the biggest motivations, of our team's research is the construction of a portable device that can analyze the sleep data as they are being collected. For this task, the sleep data must be analyzed completely automatically in order to make the classifications. <br /><br /> The research presented in this thesis concerns itself with the <em>denoising</em> and the <em>feature selection</em> aspects of the teams' goals. Since humans are able to process artifacts and ignore them prior to classification, an automated system should have the same capabilities or close to them. As such, the denoising step is performed to condition the data prior to any other stages of the sleep stage neoclassicisms. As mentioned before, the denoising step, by itself, is useful to human EEG technicians as well. <br /><br /> The denoising step in this research mainly looks at EOG artifacts and artifacts isolated to a single EEG channel, such as electrode pop artifacts. The first two algorithms uses Wavelets exclusively (BWDA and WDA), while the third algorithm is a mixture of Wavelets and In- dependent Component Analysis (IDA). With the BWDA algorithm, determining <em>consistent</em> thresholds proved to be a difficult task. With the WDA algorithm, the performance was better, since the selection of the thresholds was more straight-forward and since there was more control over defining the duration of the artifacts. The IDA algorithm performed inferior to the WDA algorithm. This could have been due to the small number of measurement channels or the automated sub-classifier used to select the <em>denoised EEG signal</em> from the set of ICA <em>demixed</em> signals. <br /><br /> The feature selection stage is extremely important as it selects the most pertinent features to make a particular classification. Without such a step, the classifier will have to process useless data, which might result in a poorer classification. Furthermore, unnecessary features will take up valuable computer cycles as well. In a portable device, due to battery consumption, wasting computer cycles is not an option. The research presented in this thesis shows the importance of a systematic feature selection step in EEG classification. The feature selection step produced excellent results with a maximum use of just 5 features. During automated classification, this is extremely important as the automated classifier will only have to calculate 5 features for each given epoch. Electrical & Computer Engineering sleep stage EEG EOG EMG ECG artifact removal feature extraction feature selection Neural Networks
245	Protein Tertiary Model Assessment Using Granular Machine Learning Techniques Chida, Anjum A 21 March 2012 (has links) The automatic prediction of protein three dimensional structures from its amino acid sequence has become one of the most important and researched fields in bioinformatics. As models are not experimental structures determined with known accuracy but rather with prediction it’s vital to determine estimates of models quality. We attempt to solve this problem using machine learning techniques and information from both the sequence and structure of the protein. The goal is to generate a machine that understands structures from PDB and when given a new model, predicts whether it belongs to the same class as the PDB structures (correct or incorrect protein models). Different subsets of PDB (protein data bank) are considered for evaluating the prediction potential of the machine learning methods. Here we show two such machines, one using SVM (support vector machines) and another using fuzzy decision trees (FDT). First using a preliminary encoding style SVM could get around 70% in protein model quality assessment accuracy, and improved Fuzzy Decision Tree (IFDT) could reach above 80% accuracy. For the purpose of reducing computational overhead multiprocessor environment and basic feature selection method is used in machine learning algorithm using SVM. Next an enhanced scheme is introduced using new encoding style. In the new style, information like amino acid substitution matrix, polarity, secondary structure information and relative distance between alpha carbon atoms etc is collected through spatial traversing of the 3D structure to form training vectors. This guarantees that the properties of alpha carbon atoms that are close together in 3D space and thus interacting are used in vector formation. With the use of fuzzy decision tree, we obtained a training accuracy around 90%. There is significant improvement compared to previous encoding technique in prediction accuracy and execution time. This outcome motivates to continue to explore effective machine learning algorithms for accurate protein model quality assessment. Finally these machines are tested using CASP8 and CASP9 templates and compared with other CASP competitors, with promising results. We further discuss the importance of model quality assessment and other information from proteins that could be considered for the same. Protein 3D Structures Protein model assessment Feature selection Support vector machines Decision tree Fuzzy ID3 and Machine learning.
246	Feature selection and artifact removal in sleep stage classification Hapuarachchi, Pasan January 2006 (has links) The use of Electroencephalograms (EEG) are essential to the analysis of sleep disorders in patients. With the use of electroencephalograms, electro-oculograms (EOG), and electromyograms (EMG), doctors and EEG technician can make conclusions about the sleep patterns of patients. In particular, the classification of the sleep data into various stages, such as NREM I-IV, REM, Awake, is extremely important. The EEG signal itself is highly sensitive to physiological and non-physiological artifacts. Trained human experts can accommodate for these artifacts while they are analyzing the EEG signal. <br /><br /> However, if some of these artifacts are removed prior to analysis, their job will be become easier. Furthermore, one of the biggest motivations, of our team's research is the construction of a portable device that can analyze the sleep data as they are being collected. For this task, the sleep data must be analyzed completely automatically in order to make the classifications. <br /><br /> The research presented in this thesis concerns itself with the <em>denoising</em> and the <em>feature selection</em> aspects of the teams' goals. Since humans are able to process artifacts and ignore them prior to classification, an automated system should have the same capabilities or close to them. As such, the denoising step is performed to condition the data prior to any other stages of the sleep stage neoclassicisms. As mentioned before, the denoising step, by itself, is useful to human EEG technicians as well. <br /><br /> The denoising step in this research mainly looks at EOG artifacts and artifacts isolated to a single EEG channel, such as electrode pop artifacts. The first two algorithms uses Wavelets exclusively (BWDA and WDA), while the third algorithm is a mixture of Wavelets and In- dependent Component Analysis (IDA). With the BWDA algorithm, determining <em>consistent</em> thresholds proved to be a difficult task. With the WDA algorithm, the performance was better, since the selection of the thresholds was more straight-forward and since there was more control over defining the duration of the artifacts. The IDA algorithm performed inferior to the WDA algorithm. This could have been due to the small number of measurement channels or the automated sub-classifier used to select the <em>denoised EEG signal</em> from the set of ICA <em>demixed</em> signals. <br /><br /> The feature selection stage is extremely important as it selects the most pertinent features to make a particular classification. Without such a step, the classifier will have to process useless data, which might result in a poorer classification. Furthermore, unnecessary features will take up valuable computer cycles as well. In a portable device, due to battery consumption, wasting computer cycles is not an option. The research presented in this thesis shows the importance of a systematic feature selection step in EEG classification. The feature selection step produced excellent results with a maximum use of just 5 features. During automated classification, this is extremely important as the automated classifier will only have to calculate 5 features for each given epoch. Electrical & Computer Engineering sleep stage EEG EOG EMG ECG artifact removal feature extraction feature selection Neural Networks
247	Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System Jeon, Woojay 13 November 2006 (has links) It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection. Speech processing Speech recognition Feature selection Pattern recognition Speech analysis Auditory model Automatic speech recognition Auditory cortex
248	Biomarker discovery and clinical outcome prediction using knowledge based-bioinformatics Phan, John H. 02 April 2009 (has links) Advances in high-throughput genomic and proteomic technology have led to a growing interest in cancer biomarkers. These biomarkers can potentially improve the accuracy of cancer subtype prediction and subsequently, the success of therapy. However, identification of statistically and biologically relevant biomarkers from high-throughput data can be unreliable due to the nature of the data--e.g., high technical variability, small sample size, and high dimension size. Due to the lack of available training samples, data-driven machine learning methods are often insufficient without the support of knowledge-based algorithms. We research and investigate the benefits of using knowledge-based algorithms to solve clinical prediction problems. Because we are interested in identifying biomarkers that are also feasible in clinical prediction models, we focus on two analytical components: feature selection and predictive model selection. In addition to data variance, we must also consider the variance of analytical methods. There are many existing feature selection algorithms, each of which may produce different results. Moreover, it is not trivial to identify model parameters that maximize the sensitivity and specificity of clinical prediction. Thus, we introduce a method that uses independently validated biological knowledge to reduce the space of relevant feature selection algorithms and to improve the reliability of clinical predictors. Finally, we implement several functions of this knowledge-based method as a web-based, user-friendly, and standards-compatible software application. Feature selection Gene expression profiling Biomarker identification Clinical outcome prediction Knowledge based Bioinformatics Biochemical markers Gene expression
249	Computer-aided diagnosis for mammographic microcalcification clusters [electronic resource] / by Mugdha Tembey. Tembey, Mugdha. January 2003 (has links) Title from PDF of title page. / Document formatted into pages; contains 112 pages. / Thesis (M.S.C.S.)--University of South Florida, 2003. / Includes bibliographical references. / Text (Electronic thesis) in PDF format. / ABSTRACT: Breast cancer is the second leading cause of cancer deaths among women in the United States and microcalcifications clusters are one of the most important indicators of breast disease. Computer methodologies help in the detection and differentiation between benign and malignant lesions and have the potential to improve radiologists' performance and breast cancer diagnosis significantly. A Computer-Aided Diagnosis (CAD-Dx) algorithm has been previously developed to assist radiologists in the diagnosis of mammographic clusters of calcifications with the modules: (a) detection of all calcification-like areas, (b) false-positive reduction and segmentation of the detected calcifications, (c) selection of morphological and distributional features and (d) classification of the clusters. Classification was based on an artificial neural network (ANN) with 14 input features and assigned a likelihood of malignancy to each cluster. / ABSTRACT: The purpose of this work was threefold: (a) optimize the existing algorithm and test on a large database, (b) rank classification features and select the best feature set, and (c) determine the impact of single and two-view feature estimation on classification and feature ranking. Classification performance was evaluated with the NevProp4 artificial neural network trained with the leave-one-out resampling technique. Sequential forward selection was used for feature selection and ranking. Mammograms from 136 patients, containing single or two views of a breast with calcification cluster were digitized at 60 microns and 16 bits per pixel. 260 regions of interest (ROI's) centered on calcification cluster were defined to build the single-view dataset. 100 of the 136 patients had a two-view mammogram which yielded 202 ROI's that formed the two-view dataset. Classification and feature selection were evaluated with both these datasets. / ABSTRACT: To decide on the optimal features for two-view feature estimation several combinations of CC and MLO view features were attempted. On the single-view dataset the classifier achieved an AZ =0.8891 with 88% sensitivity and 77% specificity at an operating point of 0.4; 12 features were selected as the most important. With the two-view dataset, the classifier achieved a higher performance with an AZ =0.9580 and sensitivity and specificity of 98% and 80% respectively at an operating point of 0.4; 10 features were selected as the most important. / System requirements: World Wide Web browser and PDF reader. / Mode of access: World Wide Web. neural network. filtering. segmentation. detection. shape analysis. feature selection.
250	System for Identifying Plankton from the SIPPER Instrument Platform Kramer, Kurt A. 29 October 2010 (has links) Plankton imaging systems such as SIPPER produce a large quantity of data in the form of plankton images from a variety of classes. A system known as PICES was developed to quickly extract, classify and manage the millions of images produced from a single one-week research cruise. A new fast technique for parameter tuning and feature selection for Support Vector Machines using Wrappers was created. This technique allows for faster feature selection, while at the same time maintaining and sometimes improving classification accuracy. It also gives the user greater flexibility in the management of class contents in existing training libraries. Support vector machines are binary classifiers that can implement multi-class classifiers by creating a classifier for each possible combination of classes or for each class using a one class versus all strategy. Feature selection searches for a single set of features to be used by each of the binary classifiers. This ignores the fact that features that may be good discriminators for two particular classes might not do well for other class combinations. As a result, the feature selection process may not include these features in the common set to be used by all support vector machines. It is shown through experimentation that by selecting features for each binary class combination, overall classification accuracy can be improved and the time required for training a multi-class support vector machine can be reduced. Another benefit of this approach is that significantly less time is required for feature selection when additional classes are added to the training data. This is because the features selected for the existing class combinations are still valid, so that feature selection only needs to be run for the new combination added. This work resulted in a system called PICES, a GUI based user friendly system, which aids in the classification management of over 55 million images of plankton split amongst 180 classes. PICES embodies an improved means of performing Wrapper based feature selection that creates classifiers that train faster and are just as accurate and sometimes more accurate, while reducing the feature selection time. Marine Science PICES Machine Learning Feature Selection Support Vector Machine SVM Multi-Class Pair-Wise American Studies Arts and Humanities

Search results