Global ETD Search

111	High Specificity Literature Mining Method Based on Microarray Expression Profile for Discovering Hidden Connections among Diseases, Genes, and Drugs Wu, Jain-Shing 05 September 2011 (has links) In recent years, with the microarray technique widely adopted, a large amount of biomedical literatures are published to provide a lot of useful information. However, some relationships among disease, genes and drug are still to be explored, since the authors only focus on part of the significant genes to the disease or the significant genes to the drug but not connect them to obtain new relationships. There are several methods proposed for finding out the hidden relationships, however many of them requires manual involvements. The main objective of this dissertation is to discover the hidden connections between human diseases and genes and the connections between drugs and the same genes. In order achieve this goal, the intermediate nodes (signification genes) must be found first. When a gene has more significant difference in observed group (abnormal patients) than in control group (normal persons), this gene is called significant genes to the disease. These signification genes often play a crucial role in cancer diagnosis and treatment. Via classifying the microarray gene expression data to find these significant genes, doctors can obtain the feasible and appropriate information for treatments that can give to the patients according to their cancer symptoms. A variety of existing classifiers have been proposed for this problem. However, most of them often work inefficiently when attributes grow up over thousands. To further improve the accuracy and the speed of the existing classifiers, a novel microarray attribute reduction scheme (MARS) is proposed for selecting significant genes to the disease. Experimental results demonstrate that combining the proposed scheme with multiclass support vector machine (MCSVM) obtains better performance than other different gene selection methods with the same MCSVM. In addition, the proposed scheme with MCSVM performs better than the results listed in the existing literature.. Furthermore, 19 of 22 genes selected by the proposed scheme in acute lymphoblastic leukemia and acute myeloid leukemia (AML-ALL) dataset are related to the AML and ALL diseases that have been reported in the literatures. Thus the proposed scheme not only can significantly reduce large amount of attributes (genes) for gene expression classification problem, but also increase the classification accuracy. MARS finds related gene set according to a threshold determined by using receiver operating characteristic (ROC) curve. However, it requires repeating the experiment many times to determine the best threshold. Hence, we propose a novel disease-oriented feature selection algorithm (DOFA) to improve MARS. DOFA uses the Genetic Algorithm (GA) in the selection method for automatic picking up the related genes and Support Vector Machine (SVM) and K-nearest-neighborhood (KNN) as the classifier. DOFA is tested on picking up related genes for AML-ALL and Colon datasets. For AML-ALL and Colon datasets, it selects 21 genes and 25 genes, respectively. Based on the literatures, it shows that 20 of 21 genes are related to the disease or cancers related for AML-ALL dataset and one of these genes is still uncertain. And 20 of 25 genes are directly related to the disease colon cancer or cancers related and 5 of these genes are still uncertain. Three more experiments are conducted to verify the discriminability of the genes selected by DOFA. Experimental results all indicate that DOFA obtains better performance than other competing methods. Thus DOFA not only can select the genes related to the diseases, but also increase the classification accuracy. After obtaining the significant gene group, we can further use these genes to obtain the hidden connections. We propose a high specificity literature mining method based on microarray expression profile for discovering hidden connections among disease, drug, and genes. The proposed method can automatically select related genes from the disease or drug microarray expression profiles, and use the disease names or the drug names and gene names or aliases of the selected genes to obtain the related abstract collections. An alias expansion scheme and a weight function are used to eliminate the unrelated literatures. We perform three scenarios to verify the proposed method. Experimental results show that using the proposed method can obtain the hidden connections among diseases, genes and drugs. The (ROC) curve shows that the proposed method can not only find the hidden connections between diseases and drugs but also have high specificity. Concluding this dissertation, our goal is to discover the hidden connections between the diseases and the drugs. In order to achieve this goal, we first proposed MARS to select the significant genes to the diseases. And then, we proposed DOFA to improve the ability of MARS. We proposed a high specificity literature mining method based on microarray expression profile for discovering the hidden connections among diseases, genes, and drugs. The proposed method combines the power of searching significant genes to the disease of DOFA to further obtain the hidden connections. Experimental results show that the proposed method not only can obtain the hidden connections among diseases, genes, and drugs, but also has high specificity. genetic algorithm feature selection Hidden relationship literature mining gene expression profile
112	Improved Approaches for Attribute Clustering Based on the Group Genetic Algorithm Lin, Feng-Shih 09 September 2011 (has links) Feature selection is a pre-processing step in data-mining and machine learning, and plays an important role for analyzing high-dimensional data. Appropriately selected features can not only reduce the complexity of the mining or learning process, but also improve the accuracy of results. In the past, the concept of performing the task of feature selection by attribute clustering was proposed. If similar attributes could be clustered into groups, attributes could be easily replaced by others in the same group when some attribute values were missed. Hong et al. also proposed several genetic algorithms for finding appropriate attribute clusters. Their approaches, however, suffered from the weakness that multiple chromosomes would represent the same attribute clustering result (feasible solution) due to the combinatorial property, thus causing a larger search space than needed. In this thesis, we thus attempt to improve the performance of the GA-based attribute-clustering process based on the grouping genetic algorithm (GGA). Two GGA-based attribute clustering approaches are proposed. In the first approach, the general GGA representation and operators are used to reduce the redundancy of chromosome representation for attribute clustering. In the second approach, a new encoding scheme with corresponding crossover and mutation operators are designed, and an improved fitness function is proposed to achieve better convergence speed and provide more flexible alternatives than the first one. At last, experiments are made to compare the efficiency and the accuracy of the proposed approaches and the previous ones. feature selection genetic algorithm grouping genetic algorithm data mining Attribute clustering
113	Forward-Selection-Based Feature Selection for Genre Analysis and Recognition of Popular Music Chen, Wei-Yu 09 September 2012 (has links) In this thesis, a popular music genre recognition approach for Japanese popular music using SVM (support vector machine) with forward feature selection is proposed. First, various common acoustic features are extracted from the digital signal of popular music songs, including sub-bands, energy, rhythm, tempo, formants. A set of the most appropriate features for the genre identification is then selected by the proposed forward feature selection technique. Experiments conducted on the database consisting of 296 Japanese popular music songs demonstrate that the accuracy of recognition the proposed algorithm can achieve approximately 78.81% and the accuracy is stable when the number of testing music songs is increased. RBF (radial basis function) SVM (support vector machine) forward selection Genre recognition feature selection
114	Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning Liao, Ting-Yi 23 October 2012 (has links) From the large scale of documents effective to find the near-duplicate document, has been a very important issue. In this paper, we propose a new method to detect near-duplicate document from the large scale dataset, our method is divided into three parts, feature selection, similarity measure and discriminant derivation. In feature selection, document will be detected after preprocessed. Documents have to remove signals, stop words ... and so on. We measure the value of the term weight in the sentence, and then choose the terms which have higher weight in the sentence. These terms collected as a feature of the document. The document¡¦s feature set collected by these features. Similarity measure is based on similarity function to measure the similarity value between two feature sets. Discriminant derivation is based on support vector machine which train a classifiers to identify whether a document is a near-duplicate or not. support vector machine is a supervised learning strategy. It trains a classifier by the training patterns. In the characteristics of documents, the sentence-level features are more effective than terms-level features. Besides, learning a discriminant by SVM can avoid trial-and-error efforts required in conventional methods. Trial-and-error is going to find a threshold, a discriminant value to define document¡¦s relation. In the final analysis of experiment, our method is effective in near-duplicate document detection than other methods. Near-duplicate threshold trial-and-error support vector machine feature selection stop words similarity function
115	Efficient case-based reasoning through feature weighting, and its application in protein crystallography Gopal, Kreshna 02 June 2009 (has links) Data preprocessing is critical for machine learning, data mining, and pattern recognition. In particular, selecting relevant and non-redundant features in highdimensional data is important to efficiently construct models that accurately describe the data. In this work, I present SLIDER, an algorithm that weights features to reflect relevance in determining similarity between instances. Accurate weighting of features improves the similarity measure, which is useful in learning algorithms like nearest neighbor and case-based reasoning. SLIDER performs a greedy search for optimum weights in an exponentially large space of weight vectors. Exhaustive search being intractable, the algorithm reduces the search space by focusing on pivotal weights at which representative instances are equidistant to truly similar and different instances in Euclidean space. SLIDER then evaluates those weights heuristically, based on effectiveness in properly ranking pre-determined matches of a set of cases, relative to mismatches. I analytically show that by choosing feature weights that minimize the mean rank of matches relative to mismatches, the separation between the distributions of Euclidean distances for matches and mismatches is increased. This leads to a better distance metric, and consequently increases the probability of retrieving true matches from a database. I also discuss how SLIDER is used to improve the efficiency and effectiveness of case retrieval in a case-based reasoning system that automatically interprets electron density maps to determine the three-dimensional structures of proteins. Electron density patterns for regions in a protein are represented by numerical features, which are used in a distance metric to efficiently retrieve matching patterns by searching a large database. These pre-selected cases are then evaluated by more expensive methods to identify truly good matches – this strategy speeds up the retrieval of matching density regions, thereby enabling fast and accurate protein model-building. This two-phase case retrieval approach is potentially useful in many case-based reasoning systems, especially those with computationally expensive case matching and large case libraries. Case-Based Reasoning Nearest Neighbor Learning Feature Selection Feature Weighting Protein Crystallography
116	Operational Knowledge Acquisition of Refuse Incinerator Using Data Mining Techniques Lai, Po-Chuan 05 August 2005 (has links) The physical and chemical mechanisms in a refuse ncinerator are complex. It is difficult to make a full comprehension of the system without a thorough research and long-term on-site experiments. In addition, many sensors are equipped in refuse incineration plant and much data are collected, those data were supposed to be useful since there may be some operational experience within. But to cope with the huge data that may exceed the computation capability, sequential Forward Floating Search algorithm (SFFS) is used to reduce the data dimension and find relevant features as well as to remove redundant information. In this research, data mining technique is applied toward three critical target attributes, steam production, NOx and SOx, to build decision tree models and extract operational experiences in the form of decision rules. Those models are evaluated by predicting accuracies, and rules extracted from decision tree models are also of great help to the on-site operation and prediction as well. Decision Tree Classification Refuse Incinerator Classification Analysis Feature Selection Data Mining
117	Feature Set Evaluation For A Generic Missile Detection System Avan, Selcuk Kazim 01 February 2007 (has links) (PDF) Missile Detection System (MDS) is one of the main components of a self-protection system developed against the threat of guided missiles for airborne platforms. The requirements such as time critical operation and high accuracy in classification performance make the &lsquo / Pattern Recognition&rsquo / problem of an MDS a hard task. Problem can be defined in two main parts such as &lsquo / Feature Set Evaluation&rsquo / (FSE) and &lsquo / Classifier&rsquo / designs. The main goal of feature set evaluation is to employ a dimensionality reduction process for the input data set, while not disturbing the classification performance in the result. In this thesis study, FSE approaches are investigated for the pattern recognition problem of a generic MDS. First, synthetic data generation is carried out in software environment by employing generic models and assumptions in order to reflect the nature of a realistic problem environment. Then, data sets are evaluated in order to draw a baseline for further feature set evaluation approaches. Further, a theoretical background including the concepts of Class Separability, Feature Selection and Feature Extraction is given. Several widely used methods are assessed in terms of convenience for the problem by giving necessary justifications depending on the data set characteristics. Upon this background, software implementations are performed regarding several feature set evaluation techniques. Simulations are carried out in order to process dimensionality reduction. For the evaluation of the resulting data sets in terms of classification performance, software implementation of a classifier is realized. Resulting classification performances of the applied approaches are compared and evaluated.
118	Extraction Of Buildings In Satellite Images Cetin, Melih 01 May 2010 (has links) (PDF) In this study, an automated building extraction system, which is capable of detecting buildings from satellite images using only RGB color band is implemented. The approach used in this work has four main steps: local feature extraction, feature selection, classification and post processing. There are many studies in literature that deal with the same problem. The main issue is to find the most suitable features to distinguish a building. This work presents a feature selection scheme that is connected with the classification framework of Adaboost. As well as Adaboost, four SVM kernels are used for classification. Detailed analysis regarding window type and size, feature type, feature selection, feature count and training set is done for determining the optimal parameters for the classifiers. A detailed comparison of SVM and Adaboost is done based on pixel and object performances and the results obtained are presented both numerically and visually. It is observed that SVM performs better if quadratic kernel is used than the cases using linear, RBF or polynomial kernels. SVM performance is better if features are selected either by Adaboost or by considering errors obtained on histograms of features. The performance obtained by quadratic kernel SVM operated on Adaboost selected features is found to be 38% in terms of pixel based performance criteria quality percentage and 48% in terms object based performance criteria correct detection with building detection threshold 0.4. Adaboost performed better than SVM resulting in 43% quality percentage and 67% correct detection with the same threshold. TK Electronics 7800-8360
119	Discovery of Evolution Patterns from Sequences of Documents Chang, Yu-Hsiu 06 August 2001 (has links) Due to the ever-increasing volume of textual documents, text mining is a rapidly growing application of knowledge discovery in databases. Past text mining techniques predominately concentrated on discovering intra-document patterns from textual documents, such as text categorization, document clustering, query expansion, and event tracking. Mining inter-document patterns from textual documents has been largely ignored in the literature. This research focuses on discovering inter-document patterns, called evolution patterns, from document-sequences and proposed the evolution pattern discovery (EPD) technique for mining evolution patterns from a set of ordered sequences of documents. The discovery of evolution patterns can be applied in such domains as environmental scanning and knowledge management, and can be used to facilitate existing document management and retrieval techniques (e.g., event tracking). Feature Extraction Feature Selection Document Clustering Frequent Temporal Patterns Feature-Based Evolution Patterns Text Mining
120	A Classification Framework for Imbalanced Data Phoungphol, Piyaphol 18 December 2013 (has links) As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them. Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced. Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM. Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy. Imbalanced Data Privacy Distributed Learning Multiple Kernel Learning Support Vector Machine Feature Selection

Search results