Spelling suggestions: "subject:"feature 3reduction"" "subject:"feature coeduction""
1 |
A novel method for finding small highly discriminant gene setsGardner, Jason H. 15 November 2004 (has links)
In a normal microarray classification problem there will be many genes, on the order of thousands, and few samples, on the order of tens. This necessitates a massive feature space reduction before classification can take place. While much time and effort has gone into evaluating and comparing the performance of different classifiers, less thought has been spent on the problem of efficient feature space reduction.
There are in the microarray classification literature several widely used heuristic feature reduction algorithms that will indeed find small feature subsets to classify over. These methods work in a broad sense but we find that they often require too much computation, find overly large gene sets or are not properly generalizable. Therefore, we believe that a systematic study of feature reduction, as it is related to microarray classification, is in order.
In this thesis we review current feature space reduction algorithms and propose a new, mixed model algorithm. This mixed-modified algorithm uses the best aspects of the filter algorithms and the best aspects of the wrapper algorithms to find very small yet highly discriminant gene sets. We also discuss methods to evaluate alternate, ambiguous gene sets. Applying our new mixed model algorithm to several published datasets we find that our new algorithm outperforms current gene finding methods.
|
2 |
A Comparison of Unsupervised Methods for DNA Microarray Leukemia DataHarness, Denise 05 April 2018 (has links) (PDF)
Advancements in DNA microarray data sequencing have created the need for sophisticated machine learning algorithms and feature selection methods. Probabilistic graphical models, in particular, have been used to identify whether microarrays or genes cluster together in groups of individuals having a similar diagnosis. These clusters of genes are informative, but can be misleading when every gene is used in the calculation. First feature reduction techniques are explored, however the size and nature of the data prevents traditional techniques from working efficiently. Our method is to use the partial correlations between the features to create a precision matrix and predict which associations between genes are most important to predicting Leukemia diagnosis. This technique reduces the number of genes to a fraction of the original. In this approach, partial correlations are then extended into a spectral clustering approach. In particular, a variety of different Laplacian matrices are generated from the network of connections between features, and each implies a graphical network model of gene interconnectivity. Various edge and vertex weighted Laplacians are considered and compared against each other in a probabilistic graphical modeling approach. The resulting multivariate Gaussian distributed clusters are subsequently analyzed to determine which genes are activated in a patient with Leukemia. Finally, the results of this are compared against other feature engineering approaches to assess its accuracy on the Leukemia data set. The initial results show the partial correlation approach of feature selection predicts the diagnosis of a Leukemia patient with almost the same accuracy as using a machine learning algorithm on the full set of genes. More calculations of the precision matrix are needed to ensure the set of most important genes is correct. Additionally more machine learning algorithms will be implemented using the full and reduced data sets to further validate the current prediction accuracy of the partial correlation method.
|
3 |
Improving the Performance of a Hybrid Classification Method Using a Parallel Algorithm and a Novel Data Reduction TechniquePhillips, Rhonda D. 21 August 2007 (has links)
This thesis presents both a shared memory parallel version of the hybrid classification algorithm IGSCR (iterative guided spectral class rejection) and a novel data reduction technique that can be used in conjuction with pIGSCR (parallel IGSCR). The parallel algorithm is motivated by a demonstrated need for more computing power driven by the increasing size of remote sensing datasets due to higher resolution sensors, larger study regions, and the like. Even with a fast algorithm such as pIGSCR, the reduction of dimension in a dataset is desirable in order to decrease the processing time further and possibly improve overall classification accuracy.
pIGSCR was developed to produce fast and portable code using Fortran 95, OpenMP, and the Hierarchical Data Format version 5 (HDF5) and accompanying data access library. The applicability of the faster pIGSCR algorithm is demonstrated by classifying Landsat data covering most of Virginia, USA into forest and non-forest classes with approximately 90 percent accuracy. Parallel results are given using the SGI Altix 3300 shared memory computer and the SGI Altix 3700 with as many as 64 processors reaching speedups of almost 77. This fast algorithm allows an analyst to perform and assess multiple classifications to refine parameters. As an example, pIGSCR was used for a factorial analysis consisting of 42 classifications of a 1.2 gigabyte image to select the number of initial classes (70) and class purity (70%) used for the remaining two images.
A feature selection or reduction method may be appropriate for a specific lassification method depending on the properties and training required for the classification method, or an alternative band selection method may be derived based on the classification method itself. This thesis introduces a feature reduction method based on the singular value decomposition (SVD). This feature reduction technique was applied to training data from two multitemporal datasets of Landsat TM/ETM+ imagery acquired over a forested area in Virginia, USA and Rondonia, Brazil. Subsequent parallel iterative guided spectral class rejection (pIGSCR) forest/non-forest classifications were performed to determine the quality of the feature reduction. The classifications of the Virginia data were five times faster using SVD based feature reduction without affecting the classification accuracy. Feature reduction using the SVD was also compared to feature reduction using principal components analysis (PCA). The highest average accuracies for the Virginia dataset (88.34%) and for the Amazon dataset (93.31%) were achieved using the SVD. The results presented here indicate that SVD based feature reduction can produce statistically significantly better classifications than PCA. / Master of Science
|
4 |
Statistical Analysis of High-Dimensional Gene Expression DataJustin Zhu Unknown Date (has links)
The use of diagnostic rules based on microarray gene expression data has received wide attention in bioinformatics research. In order to form diagnostic rules, statistical techniques are needed to form classifiers with estimates for their associated error rates, and to correct for any selection biases in the estimates. There are also the associated problems of identifying the genes most useful in making these predictions. Traditional statistical techniques require the number of samples to be much larger than the number of features. Gene expression datasets usually have a small number of samples, but a large number of features. In this thesis, some new techniques are developed, and traditional techniques are used innovatively after appropriate modification to analyse gene expression data. Classification: We first consider classifying tissue samples based on the gene expression data. We employ an external cross-validation with recursive feature elimination to provide classification error rates for tissue samples with different numbers of genes. The techniques are implemented as an R package BCC (Bias-Corrected Classification), and are applied to a number of real-world datasets. The results demonstrate that the error rates vary with different numbers of genes. For each dataset, there is usually an optimal number of genes that returns the lowest cross-validation error rate. Detecting Differentially Expressed Genes: We then consider the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. The focus is on the use of mixture models to handle the multiplicity issue. The mixture model approach provides a framework for the estimation of the prior probability that a gene is not differentially expressed. It estimates various error rates, including the FDR (False Discovery Rate) and the FNR (False Negative Rate). We also develop a method for selecting biomarker genes for classification, based on their repeatability among the highly differentially expressed genes in cross-validation trials. The latter method incorporates both gene selection and classification. Selection Bias: When forming a prediction rule on the basis of a small number of classified tissue samples, some form of feature (gene) selection is usually adopted. This is a necessary step if the number of features is high. As the subset of genes used in the final form of the rule has not been randomly selected but rather chosen according to some criteria designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rule if care is not taken. Various situations are presented where selection biases arise in the formation of a prediction rule and where there is a consequent need for the correction of the biases. Three types of selection biases are analysed: selection bias from not using external cross-validation, selection bias of not working with the full set of genes, and the selection bias from optimizing the classification error rate over a number of subsets obtained according to a selection method. Here we mostly employ the support vector machine with recursive feature elimination. This thesis includes a description of cross-validation schemes that are able to correct for these selection biases. Furthermore, we examine the bias incurred when using the predicted rather than the true outcomes to define the class labels in forming and evaluating the performance of the discriminant rule. Case Study: We present a case study using the breast cancer datasets. In the study, we compare the 70 highly differentially expressed genes proposed by van 't Veer and colleagues, against the set of the genes selected using our repeatability method. The results demonstrate that there is more than one set of biomarker genes. We also examine the selection biases that may exist when analysing this dataset. The selection biases are demonstrated to be substantial.
|
5 |
A Self-Constructing Fuzzy Feature Clustering for Text CategorizationLiu, Ren-jia 26 August 2009 (has links)
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster.
By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods.
|
6 |
BloodBowl 2 race clustering by different playstylesIvanauskas, Tadas January 2020 (has links)
The number of features and number of instances has a significant impact on computation time and memory footprint for machine learning algorithms. Reducing the number of features reduces the memory footprint and computation time and allows for a number of instances to remain constant. This thesis investigates the feature reduction by clustering.9 clustering algorithms and 3 classification algorithms were used to investigate whether categories obtained by clustering algorithms can be a replacement for original attributes in the data set with minimal impact on classification accuracy. The video game Blood Bowl 2 was chosen as a study subject. Blood Bowl2 match data was obtained from a public database The results show that the cluster labels cannot be used as a substitute for the original features as the substitution had no effect on the classifications. Furthermore, the cluster labels had relatively low weight values and would be excluded by activation functions on most algorithms.
|
7 |
Entwicklung eines Verfahrens zur Mustererkennung für die Analyse von Gasen mittels ImpedanzspektroskopieLi, Fei 12 February 2019 (has links)
1. Zielstellung der Arbeit war die Entwicklung von Musterkennungsmethoden zur automatischen Klassifizierung von Gasen. Um dieses Ziel zu erreichen, wurde die Reduktionsmethode Parameterabschätzung mittels Adaptive-Simulated-Annealing (ASA-PE) und eine Committee machine (CM) zur Klassifikation entwickelt.
2. Mittels PEDOT:PSS-Sensoren wurden mit Hilfe der Impedanzspektroskopie NH3 und NO2 in unterschiedlichen Konzentrationen gemessen. Die aufgenommenen Messdaten wurden durch die ASA-PE, die Komplexe Haupt-komponentenanalyse (CPCA) und die Discriminant analyses via Support Vector (SVDA) reduziert.
3. Der Vergleich der Merkmalsextraktionsmethoden zeigt: Die in dieser Arbeit neu entwickelte Methode ASA-PE liefert im Vergleich dazu ein sicheres Segmentierungs-Ergebnis.
4. Der Vergleich zwischen ASA-PE und ZView zeigt, dass die ASA-PE eine sichere Methode für die automatisierte Gasanalyse ist. Aber bei zweidimensionalen Merkmalen gibt es einen Bereich, in dem sich eine gemeinsame Häufung einstellt, welche zu einer Irritation in der Auswertung von CPCA und SVDA führen kann. Dieses Problem kann durch eine Erhöhung der Anzahl von Merkmalen gelöst werden.
5. Es wurden sechs die Klassifikationsmethoden: Abstandsgewichtete k-Nächste-Nachbarn-Klassifikation (DW-kNN), das mehrlagige Perzeptron (MLP), Support Vector Machine (SVM), CM, CM ohne MLP und CM mit Abstandskontrolle und AAi-Filter untersucht und miteinander verglichen. Um die Klassifikationsmethoden anzulernen wurden alle Merkmalsreduktions-ergebnisse der CPCA, SVDA und der ASA-PE in Trainings- und Testdaten eingeteilt.
6. Die Ergebnisse zeigen, dass die Kombination aus One-Against-All-SVM (OAA-SVM) und ASA-PE die besten Erkennungsraten liefert. Bei 200 Trainingsdatensätzen wird eine Erkennungsrate von bis zu 99.5% erzielt. Durch diese Kombination können jedoch nur 8 Typen ohne Identifikation von unbekannten Typen ermittelt werden.
7. Wenn das MLP aus CM entfernt wird, werden die Resultate von CM leicht verbessert. Mit Hilfe von 6-Sigma zeigt CM ohne MLP eine gute Erkennungsrate für unbekannte Gase und gleichzeitig bleibt die Erkennungsrate auf einem befriedigenden Niveau.
8. Die Streuung der ASA-PE führt zu einer schlechten Abgrenzung zwischen bekannten und unbekannten Gasen. Stattdessen zeigt die Kombination von CM ohne MLP und CPCA in diesem Fall eine gute Abgrenzung.:Abstract II
Danksagung III
Inhaltsverzeichnis IV
Abkürzungen VII
1 Einführung
1.1 Einleitung
1.2 Entwicklungen bei Gassensoren
1.2.1 Fortschritte bei Material und Messmethode
1.2.2 Fortschritte bei Mustererkennungsmethoden
1.3 Motivation
1.4 Struktur der Arbeit
2 Verfahren zur Gasanalyse
2.1 Messverfahren
2.1.1 Impedanzspektroskopie als Detektionsmethode
2.1.1.1 Definition der Impedanz
2.1.1.2 Bauelemente des elektrischen Modells
2.1.2 Optische Verfahren
2.1.3 Elektrochemische Verfahren
2.2 Merkmalerkennung
2.2.1 Merkmalsreduktion
2.2.1.1 Komplexe Hauptkomponentenanalyse (Engl. Complex Principal Component Analysis)
2.2.1.2 Kernel-Diskriminanzanalyse mittels Support Vektoren (engl. kernel Discriminant Analysis via Support Vector)
2.2.2 Klassifikationsverfahren
2.2.2.1 Abstands-gewichtete k-Nächste-Nachbarn-Klassifikation (engl. Distance weighted k-Nearest-Neighbor-Algorithms, DW-kNN)
2.2.2.2 Mehrlagiges Perzeptron (MLP)
2.2.2.3 Support Vektor Maschine (SVM)
3 Eigene Mustererkennungsverfahren
3.1 Parameterschätzung mittels Adaptive-Simulated-Annealing (ASA-PE)
3.1.1 Allgemeines Impedanzspektroskopiemodell eines Gassensors
3.1.2 Parameterschätzung
3.1.3 Die Optimierungsverfahren
3.2 Committee machine
4 Anwendungsbeispiel
4.1 Experiment mit einem Gassensor aus PEDOT:PSS
4.1.1 Sensoraufbau und vereinfachtes Sensormodell
4.2 Experimentelle Ergebnisse
4.2.1 Messaufbau und Versuchsdurchführung
4.2.2 Vorbereitung zur Messung
4.2.3 Durchführung der Messung
4.2.4 Fehlerbetrachtung
4.2.5 Messergebnisse des Gassensors
4.3 Ergebnisse der Merkmalreduktion
4.3.1 CPCA und SVDA
4.3.2 Parameterschätzung mittels Adaptive-Simulated-Annealing (ASA-PE)
4.4 Ergebnisse der Klassifikationen
4.4.1 Ergebnisse der Gasbestimmung mittels Trainingssatz und Testsatz
4.4.1.1 DW-kNN
4.4.1.2 MLP
4.4.1.3 OAO-SVM
4.4.1.4 OAA-SVM
4.4.1.5 Committee machine
4.4.1.6 CM ohne MLP
4.4.1.7 CM mit AAi-Filter
4.4.2 Abhängigkeit der Klassifikationsergebnisse von der Anzahl der Trainingsdaten
5 Zusammenfassung und Ausblick
5.1 Zusammenfassung
5.2 Ausblick
Abbildungsverzeichnis
Formelverzeichnis
Literaturverzeichnis
|
8 |
Data Mining Methods For Malware DetectionSiddiqui, Muazzam 01 January 2008 (has links)
This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares.
|
9 |
Classification of Carpiodes Using Fourier Descriptors: A Content Based Image Retrieval ApproachTrahan, Patrick 06 August 2009 (has links)
Taxonomic classification has always been important to the study of any biological system. Many biological species will go unclassified and become lost forever at the current rate of classification. The current state of computer technology makes image storage and retrieval possible on a global level. As a result, computer-aided taxonomy is now possible. Content based image retrieval techniques utilize visual features of the image for classification. By utilizing image content and computer technology, the gap between taxonomic classification and species destruction is shrinking. This content based study utilizes the Fourier Descriptors of fifteen known landmark features on three Carpiodes species: C.carpio, C.velifer, and C.cyprinus. Classification analysis involves both unsupervised and supervised machine learning algorithms. Fourier Descriptors of the fifteen known landmarks provide for strong classification power on image data. Feature reduction analysis indicates feature reduction is possible. This proves useful for increasing generalization power of classification.
|
10 |
PCA based dimensionality reduction of MRI images for training support vector machine to aid diagnosis of bipolar disorder / PCA baserad dimensionalitetsreduktion av MRI bilder för träning av stödvektormaskin till att stödja diagnostisering av bipolär sjukdomChen, Beichen, Chen, Amy Jinxin January 2019 (has links)
This study aims to investigate how dimensionality reduction of neuroimaging data prior to training support vector machines (SVMs) affects the classification accuracy of bipolar disorder. This study uses principal component analysis (PCA) for dimensionality reduction. An open source data set of 19 bipolar and 31 control structural magnetic resonance imaging (sMRI) samples was used, part of the UCLA Consortium for Neuropsychiatric Phenomics LA5c Study funded by the NIH Roadmap Initiative aiming to foster breakthroughs in the development of novel treatments for neuropsychiatric disorders. The images underwent smoothing, feature extraction and PCA before they were used as input to train SVMs. 3-fold cross-validation was used to tune a number of hyperparameters for linear, radial, and polynomial kernels. Experiments were done to investigate the performance of SVM models trained using 1 to 29 principal components (PCs). Several PC sets reached 100% accuracy in the final evaluation, with the minimal set being the first two principal components. Accumulated variance explained by the PCs used did not have a correlation with the performance of the model. The choice of kernel and hyperparameters is of utmost importance as the performance obtained can vary greatly. The results support previous studies that SVM can be useful in aiding the diagnosis of bipolar disorder, and that the use of PCA as a dimensionality reduction method in combination with SVM may be appropriate for the classification of neuroimaging data for illnesses not limited to bipolar disorder. Due to the limitation of a small sample size, the results call for future research using larger collaborative data sets to validate the accuracies obtained. / Syftet med denna studie är att undersöka hur dimensionalitetsreduktion av neuroradiologisk data före träning av stödvektormaskiner (SVMs) påverkar klassificeringsnoggrannhet av bipolär sjukdom. Studien använder principalkomponentanalys (PCA) för dimensionalitetsreduktion. En datauppsättning av 19 bipolära och 31 friska magnetisk resonanstomografi(MRT) bilder användes, vilka tillhör den öppna datakällan från studien UCLA Consortium for Neuropsychiatric Phenomics LA5c som finansierades av NIH Roadmap Initiative i syfte att främja genombrott i utvecklingen av nya behandlingar för neuropsykiatriska funktionsnedsättningar. Bilderna genomgick oskärpa, särdragsextrahering och PCA innan de användes som indata för att träna SVMs. Med 3-delad korsvalidering inställdes ett antal parametrar för linjära, radiala och polynomiska kärnor. Experiment gjordes för att utforska prestationen av SVM-modeller tränade med 1 till 29 principalkomponenter (PCs). Flera PC uppsättningar uppnådde 100% noggrannhet i den slutliga utvärderingen, där den minsta uppsättningen var de två första PCs. Den ackumulativa variansen över antalet PCs som användes hade inte någon korrelation med prestationen på modellen. Valet av kärna och hyperparametrar är betydande eftersom prestationen kan variera mycket. Resultatet stödjer tidigare studier att SVM kan vara användbar som stöd för diagnostisering av bipolär sjukdom och användningen av PCA som en dimensionalitetsreduktionsmetod i kombination med SVM kan vara lämplig för klassificering av neuroradiologisk data för bipolär och andra sjukdomar. På grund av begränsningen med få dataprover, kräver resultaten framtida forskning med en större datauppsättning för att validera de erhållna noggrannheten.
|
Page generated in 0.0753 seconds