41 |
Evaluating feature selection in a marketing classification problemSalmeron Perez, Irving Ivan January 2015 (has links)
Nowadays machine learning is becoming more popular in prediction andclassification tasks for many fields. In banks, telemarketing area is usingthis approach by gathering information from phone calls made to clientsover the past campaigns. The true fact is that sometimes phone calls areannoying and time consuming for both parts, the marketing department andthe client. This is why this project is intended to prove that feature selectioncould improve machine learning models. A Portuguese bank gathered data regarding phone calls and clientsstatistics information like their actual jobs, salaries and employment statusto determine the probabilities if a person would buy the offered productand/or service. C4.5 decision tree (J48) and multilayer perceptron (MLP)are the machine learning models to be used for the experiments. For featureselection correlation-based feature selection (Cfs), Chi-squared attributeselection and RELIEF attribute selection algorithms will be used. WEKAframework will provide the tools to test and implement the experimentscarried out in this research. The results were very close over the two data mining models with aslight improvement by C4.5 over the correct classifications and MLP onROC curve rate. With these results it was confirmed that feature selectionimproves classification and/or prediction results.
|
42 |
Feature Construction, Selection And Consolidation For Knowledge DiscoveryLi, Jiexun January 2007 (has links)
With the rapid advance of information technologies, human beings increasingly rely on computers to accumulate, process, and make use of data. Knowledge discovery techniques have been proposed to automatically search large volumes of data for patterns. Knowledge discovery often requires a set of relevant features to represent the specific domain. My dissertation presents a framework of feature engineering for knowledge discovery, including feature construction, feature selection, and feature consolidation.Five essays in my dissertation present novel approaches to construct, select, or consolidate features in various applications. Feature construction is used to derive new features when relevant features are unknown. Chapter 2 focuses on constructing informative features from a relational database. I introduce a probabilistic relational model-based approach to construct personal and social features for identity matching. Experiments on a criminal dataset showed that social features can improve the matching performance. Chapter 3 focuses on identifying good features for knowledge discovery from text. Four types of writeprint features are constructed and shown effective for authorship analysis of online messages. Feature selection is aimed at identifying a subset of significant features from a high dimensional feature space. Chapter 4 presents a framework of feature selection techniques. This essay focuses on identifying marker genes for microarray-based cancer classification. Our experiments on gene array datasets showed excellent performance for optimal search-based gene subset selection. Feature consolidation is aimed at integrating features from diverse data sources or in heterogeneous representations. Chapter 5 presents a Bayesian framework to integrate gene functional relations extracted from heterogeneous data sources such as gene expression profiles, biological literature, and genome sequences. Chapter 6 focuses on kernel-based methods to capture and consolidate information in heterogeneous data representations. I design and compare different kernels for relation extraction from biomedical literature. Experiments show good performances of tree kernels and composite kernels for biomedical relation extraction.These five essays together compose a framework of feature engineering and present different techniques to construct, select, and consolidate relevant features. This feature engineering framework contributes to the domain of information systems by improving the effectiveness, efficiency, and interpretability of knowledge discovery.
|
43 |
Feature Selection for Gene Expression Data Based on Hilbert-Schmidt Independence CriterionZarkoob, Hadi 21 May 2010 (has links)
DNA microarrays are capable of measuring expression levels of thousands of genes, even the whole genome, in a single experiment. Based on this, they have been widely used to extend the studies of cancerous tissues to a genomic level. One of the main goals in DNA microarray experiments is to identify a set of relevant genes such that the desired outputs of the experiment mostly depend on this set, to the exclusion of the rest of the
genes. This is motivated by the fact that the biological process in cell typically involves only a subset of genes, and not the whole genome. The task of selecting a subset of relevant genes is called feature (gene) selection. Herein, we propose a feature selection algorithm for gene expression data. It is based on the Hilbert-Schmidt independence criterion, and partly motivated
by Rank-One Downdate (R1D) and the Singular Value
Decomposition (SVD). The algorithm is computationally very fast and
scalable to large data sets, and can be applied to response variables of arbitrary type (categorical and continuous). Experimental
results of the proposed technique are presented
on some synthetic and well-known microarray data sets. Later, we discuss the capability of HSIC in providing a general framework which encapsulates many widely used techniques for dimensionality reduction, clustering and metric learning. We will use this framework to explain two metric learning algorithms, namely the Fisher discriminant analysis (FDA) and closed form metric learning (CFML). As a result of this framework, we are able to propose a new metric learning method. The proposed technique uses the concepts from normalized cut spectral clustering and is associated with an underlying convex optimization problem.
|
44 |
Integrated feature, neighbourhood, and model optimization for personalised modelling and knowledge discoveryLiang, Wen January 2009 (has links)
“Machine learning is the process of discovering and interpreting meaningful information, such as new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Larose, 2005). From my understanding, machine learning is a process of using different analysis techniques to observe previously unknown, potentially meaningful information, and discover strong patterns and relationships from a large dataset. Professor Kasabov (2007b) classified computational models into three categories (e.g. global, local, and personalised) which have been widespread and used in the areas of data analysis and decision support in general, and in the areas of medicine and bioinformatics in particular. Most recently, the concept of personalised modelling has been widely applied to various disciplines such as personalised medicine, personalised drug design for known diseases (e.g. cancer, diabetes, brain disease, etc.) as well as for other modelling problems in ecology, business, finance, crime prevention, and so on. The philosophy behind the personalised modelling approach is that every person is different from others, thus he/she will benefit from having a personalised model and treatment. However, personalised modelling is not without issues, such as defining the correct number of neighbours or defining an appropriate number of features. As a result, the principal goal of this research is to study and address these issues and to create a novel framework and system for personalised modelling. The framework would allow users to select and optimise the most important features and nearest neighbours for a new input sample in relation to a certain problem based on a weighted variable distance measure in order to obtain more precise prognostic accuracy and personalised knowledge, when compared with global modelling and local modelling approaches.
|
45 |
Integrated feature, neighbourhood, and model optimization for personalised modelling and knowledge discoveryLiang, Wen January 2009 (has links)
“Machine learning is the process of discovering and interpreting meaningful information, such as new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Larose, 2005). From my understanding, machine learning is a process of using different analysis techniques to observe previously unknown, potentially meaningful information, and discover strong patterns and relationships from a large dataset. Professor Kasabov (2007b) classified computational models into three categories (e.g. global, local, and personalised) which have been widespread and used in the areas of data analysis and decision support in general, and in the areas of medicine and bioinformatics in particular. Most recently, the concept of personalised modelling has been widely applied to various disciplines such as personalised medicine, personalised drug design for known diseases (e.g. cancer, diabetes, brain disease, etc.) as well as for other modelling problems in ecology, business, finance, crime prevention, and so on. The philosophy behind the personalised modelling approach is that every person is different from others, thus he/she will benefit from having a personalised model and treatment. However, personalised modelling is not without issues, such as defining the correct number of neighbours or defining an appropriate number of features. As a result, the principal goal of this research is to study and address these issues and to create a novel framework and system for personalised modelling. The framework would allow users to select and optimise the most important features and nearest neighbours for a new input sample in relation to a certain problem based on a weighted variable distance measure in order to obtain more precise prognostic accuracy and personalised knowledge, when compared with global modelling and local modelling approaches.
|
46 |
Sharing visual features for multiclass and multiview object detectionTorralba, Antonio, Murphy, Kevin P., Freeman, William T. 14 April 2004 (has links)
We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional approaches require applying a battery of different classifiers to the image, at multiple locations and scales. This can be slow and can require a lot of training data, since each classifier requires the computation of many different image features. In particular, for independently trained detectors, the (run-time) computational complexity, and the (training-time) sample complexity, scales linearly with the number of classes to be detected. It seems unlikely that such an approach will scale up to allow recognition of hundreds or thousands of objects.We present a multi-class boosting procedure (joint boosting) that reduces the computational and sample complexity, by finding common features that can be shared across the classes (and/or views). The detectors for each class are trained jointly, rather than independently. For a given performance level, the total number of features required, and therefore the computational cost, is observed to scale approximately logarithmically with the number of classes. The features selected jointly are closer to edges and generic features typical of many natural structures instead of finding specific object parts. Those generic features generalize better and reduce considerably the computational cost of an algorithm for multi-class object detection.
|
47 |
Query Expansion For Handling Exploratory And Ambiguous Keyword QueriesJanuary 2011 (has links)
abstract: Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical options, in order to narrow down the search and reach the desired result. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. These empirical methods fail to cover all semantics of categories present in the query results. More importantly these methods do not consider the semantic relationship between the keywords featured in an expanded query. Contrary to a normal keyword search setting, these factors are non-trivial in an exploratory and ambiguous query setting where the user's precise discernment of different categories present in the query results is more important for making subsequent search decisions. In this thesis, I propose a new framework for keyword query expansion: generating a set of queries that correspond to the categorization of original query results, which is referred as Categorizing query expansion. Two approaches of algorithms are proposed, one that performs clustering as pre-processing step and then generates categorizing expanded queries based on the clusters. The other category of algorithms handle the case of generating quality expanded queries in the presence of imperfect clusters. / Dissertation/Thesis / M.S. Computer Science 2011
|
48 |
Species Discrimination and Monitoring of Abiotic Stress Tolerance by Chlorophyll Fluorescence TransientsMISHRA, Anamika January 2012 (has links)
Chlorophyll fluorescence imaging has now become a versatile and standard tool in fundamental and applied plant research. This method captures time series images of the chlorophyll fluorescence emission of whole leaves or plants upon various illuminations, typically combination of actinic light and saturating flashes. Several conventional chlorophyll fluorescence parameters have been recognized that have physiological interpretation and are useful for, e.g., assessment of plant health status and early detection of biotic and abiotic stresses. Chlorophyll florescence imaging enabled us to probe the performance of plants by visualizing physiologically relevant fluorescence parameters reporting physiology and biochemistry of the plant leaves. Sometimes there is a need to find the most contrasting fluorescence features/parameters in order to quantify stress response at very early stage of the stress treatment. The conventional fluorescence utilizes well defined single image such as F0, Fp, Fm, Fs or arithmetic combinations of basic images such as Fv/Fm, PSII, NPQ, qP. Therefore, although conventional fluorescence parameters have physiological interpretation, they may not be representing highly contrasting image sets. In order to find the effect of stress treatments at very early stage, advanced statistical techniques, based on classifiers and feature selection methods, have been developed to select highly contrasting chlorophyll fluorescence images out of hundreds of captured images. We combined sets of highly performing images resulting in images with very high contrast, the so called combinatorial imaging. The application of advanced statistical methods on chlorophyll fluorescence imaging data allows us to succeed in tasks, where conventional approaches do not work. This thesis aims to explore the application of conventional chlorophyll fluorescence parameters as well as advanced statistical techniques of classifiers and feature selection methods for high-throughput screening. We demonstrate the applicability of the technique in discriminating three species of the same family Lamiaceae at a very early stage of their growth. Further, we show that chlorophyll fluorescence imaging can be used for measuring cold and drought tolerance of Arabidopsis thaliana and tomato plants, respectively, in a simulated high ? throughput screening.
|
49 |
A credit scoring model based on classifiers consensus system approachAla'raj, Maher A. January 2016 (has links)
Managing customer credit is an important issue for each commercial bank; therefore, banks take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. The manual estimation of customer creditworthiness has become both time- and resource-consuming. Moreover, a manual approach is subjective (dependable on the bank employee who gives this estimation), which is why devising and implementing programming models that provide loan estimations is the only way of eradicating the ‘human factor’ in this problem. This model should give recommendations to the bank in terms of whether or not a loan should be given, or otherwise can give a probability in relation to whether the loan will be returned. Nowadays, a number of models have been designed, but there is no ideal classifier amongst these models since each gives some percentage of incorrect outputs; this is a critical consideration when each percent of incorrect answer can mean millions of dollars of losses for large banks. However, the LR remains the industry standard tool for credit-scoring models development. For this purpose, an investigation is carried out on the combination of the most efficient classifiers in credit-scoring scope in an attempt to produce a classifier that exceeds each of its classifiers or components. In this work, a fusion model referred to as ‘the Classifiers Consensus Approach’ is developed, which gives a lot better performance than each of single classifiers that constitute it. The difference of the consensus approach and the majority of other combiners lie in the fact that the consensus approach adopts the model of real expert group behaviour during the process of finding the consensus (aggregate) answer. The consensus model is compared not only with single classifiers, but also with traditional combiners and a quite complex combiner model known as the ‘Dynamic Ensemble Selection’ approach. As a pre-processing technique, step data-filtering (select training entries which fits input data well and remove outliers and noisy data) and feature selection (remove useless and statistically insignificant features which values are low correlated with real quality of loan) are used. These techniques are valuable in significantly improving the consensus approach results. Results clearly show that the consensus approach is statistically better (with 95% confidence value, according to Friedman test) than any other single classifier or combiner analysed; this means that for similar datasets, there is a 95% guarantee that the consensus approach will outperform all other classifiers. The consensus approach gives not only the best accuracy, but also better AUC value, Brier score and H-measure for almost all datasets investigated in this thesis. Moreover, it outperformed Logistic Regression. Thus, it has been proven that the use of the consensus approach for credit-scoring is justified and recommended in commercial banks. Along with the consensus approach, the dynamic ensemble selection approach is analysed, the results of which show that, under some conditions, the dynamic ensemble selection approach can rival the consensus approach. The good sides of dynamic ensemble selection approach include its stability and high accuracy on various datasets. The consensus approach, which is improved in this work, may be considered in banks that hold the same characteristics of the datasets used in this work, where utilisation could decrease the level of mistakenly rejected loans of solvent customers, and the level of mistakenly accepted loans that are never to be returned. Furthermore, the consensus approach is a notable step in the direction of building a universal classifier that can fit data with any structure. Another advantage of the consensus approach is its flexibility; therefore, even if the input data is changed due to various reasons, the consensus approach can be easily re-trained and used with the same performance.
|
50 |
Improved shrunken centroid method for better variable selection in cancer classification with high throughput molecular dataXukun, Li January 1900 (has links)
Master of Science / Department of Statistics / Haiyan Wang / Cancer type classification with high throughput molecular data has received much attention. Many methods have been published in this area. One of them is called PAM (nearest centroid shrunken algorithm), which is simple and efficient. It can give very good prediction accuracy. A problem with PAM is that this method selects too many genes, some of which may have no influence on cancer type. A reason for this phenomenon is that PAM assumes that all genes have identical distribution and give a common threshold parameter for genes selection. This may not hold in reality since expressions from different genes could have very different distributions due to complicated biological process. We propose a new method aimed to improve the ability of PAM to select informative genes. Keeping informative genes while reducing false positive variables can lead to more accurate classification result and help to pinpoint target genes for further studies. To achieve this goal, we introduce variable specific test based on Edgeworth expansion to select informative genes. We apply this test on each gene and select some genes based on the result of the test so that a large number of genes will be excluded. Afterward, soft thresholding with cross-validation can be further applied to decide a common threshold value. Simulation and real application show that our method can reduce the irrelevant information and select the informative genes more precisely. The simulation results give us more insight about where the newly proposed procedure could improve the accuracy, especially when the data set is skewed or unbalanced. The method can be applied to broad molecular data, including, for example, lipidomic data from mass spectrum, copy number data from genomics, eQLT analysis with GWAS data, etc. We expect the proposed method will help life scientists to accelerate discoveries with highthroughput data.
|
Page generated in 0.0793 seconds