Spelling suggestions: "subject:"feature byelection"" "subject:"feature dielection""
1 |
L’Apprentissage artificiel pour la fouille de données multilingues : application à la classification automatique des documents arabes / Machine learning and the data mining of multilingual documents : application to the automatic classification of arabic documentsRaheel, Saeed 22 October 2010 (has links)
La classification automatique des documents, une approche issue de l’apprentissage artificiel et de la fouille de textes, s’avère être très efficace pour l’organisation des ressources textuelles multilingues. Très peu des travaux se rapportent à la classification automatique de documents écrits en caractères arabes malgré la richesse morphologique de cette langue. Pour cela, nous nous intéressons dans cette thèse à la question de la classification automatique des documents écrits en caractères arabes. Il faut noter que pour surmonter les difficultés liées au traitement automatique de l’arabe, nous nous basons dans cette thèse sur une solution très performante celle basée sur la ressource linguistique informatisée de l’arabe DIINAR.1 et son analyseur morphologique. Le choix de la nature des attributs est un élément très important pour une classification automatique efficace et mérite être fait avec le plus grand soin puisqu’il a un effet directe sur la fidélité des classifieurs. Ainsi, nous avons mené dans cette thèse une étude comparative entre les n-grammes, les racines, les lemmes, et les mots comme nature d’attributs qui nous a permis de conclure une instabilité dans la performance des classifieurs basés sur les corpus construit via les n-grammes vis-à-vis d’une stabilité dans le comportement des classifieurs basés sur les corpus construits à partir des racines.De plus, on constate dans la plupart des travaux menés sur des documents écrits en caractères arabes qu’ils se basent sur des algorithmes d’apprentissage modernes comme, par exemple, les machines à vecteurs supports, les réseaux bayésiens naïfs, et les arbres de décision qui sont connus être parmi les meilleurs performants classifieurs du domaine. Toutefois, on ne trouve, à l’heure actuelle, aucun travail portant sur la classification automatique des documents écrits en caractères arabes qui utilise l’algorithme du dopage (« Boosting »). Pour cela, nous avons mené une étude comparative de la fidélité des arbres de décision (C4.5) dopés, d’une part, et les arbres de décision (C4.5) (sans dopage), les machines à vecteurs supports (SMO), et les réseaux bayésiens naïfs (NBM), d’un autre part, en fonction de la classification automatique des documents écrits en caractères arabes. Nous avons constaté que l’algorithme C4.5 boosté n’a pas pu surpasser la fidélité des algorithmes SVM et NBM. Nous attribuons cette faiblesse, sans reprocher le dopage, au fait que les arbres de décision sont très sensibles au moindre changement de leurs données sous-jacentes qui sont régulièrement pondérées et modifiées lors du dopage.Un document arabe peut être rédigé en une seule ou plusieurs langues i.e. le contenu du document est un mélange de mots écrits en caractères arabes ainsi que d’autres écrits en caractère latins. Tous les travaux portant sur la classification automatique des documents écrits en caractères arabes abordent le sujet d’un point de vue monolingue i.e. en exploitant uniquement le texte écrit en caractères arabes et en éliminant tout autre texte écrit dans d’autres langues. En conséquence, une partie vitale des informations présentes dans les documents est perdue délibérément sachant qu’elle aurait pu contribuer à la subjectivité de la décision prise par le classifieur puisque l’attribution d’un document à une catégorie ou une autre se base, principalement, sur son contenu. En conséquent, l’élimination des mots écrits en caractères latins tronque le texte ce qui met en question le degré de la subjectivité de la décision finale prise par le modèle de prédiction. Pour cela, nous nous sommes intéressés aussi dans cette thèse à la classification automatique des documents arabes ayant un contenu multilingues i.e. écrits en plusieurs langues. / The automatic classification of documents is an approach resulting from the hybridization of machine learning and text mining techniques. It is has proven to be very effective for the automatic organization of text based resources, in particularly, multilingual ones. We find, however, very little literature written on the subject when it comes to Arabic documents despite the fact that this language is morphologically much richer than Latin based ones. It should be noted that, in order to overcome the difficulties related to the automatic processing of Arabic documents, a deep analysis, such as the one performed by the morphological analyzer based on the computerized dictionary for Arabic DIINAR.1, is required.One of the intrinsic elements of any automatic classification system is the choice of the attribute’s nature. Great care should be taken while making that choice since it has a great impact on the classifier’s accuracy. One of the contributions made by this thesis is the presentation of a comparative study between Support Vector Machines (SMO) and Naïve Bayes Multinomial (NBM) algorithms based on multiple corpuses generated from n-grams, stems, lemmas, and words. We concluded that the performance of the classifiers based on corpuses generated from stems was better than the one based on lemmas and words. In addition, the performance of the classifiers based on stems was more stable than the one based on corpuses generated from n-grams.Another contribution made by this thesis is the use of Boosting as a classifier. None of the literature written on the automatic classification of Arabic documents has ever used it before despite the fact that this algorithm was designed for that purpose. Therefore, we have conducted a comparative study between Decision Trees (C4.5), Boosted Decision Trees (C4.5 and AdaBoost.M1), SMO, and NBM algorithms respectively. Boosting was indeed able to boost the performance of C4.5 but the regular re-weighting made by Boosting to the dataset’s instances hampered C4.5 from bypassing the performances of SMO and NBM algorithms. This weakness is due to the very nature of decision trees that renders them very sensitive to any change in their underlying data.We noticed while analyzing our dataset that an Arabic document is either written in one (i.e. Arabic) or multiple languages (i.e. it will contain words written in Arabic mixed with a minority of words written in another language). All of the literature written on the automatic classification of Arabic documents treats both cases equally and eliminates all the foreign terms in case it finds any. This deliberate elimination deprives the learning process from a vital part of the information found in the documents knowing that it could have contributed to the decision taken by the classifier since to assign to a document one category or another relies basically on its content and as such the degree of certainty of the decision made by the classifier is being compromised. Therefore, the main contribution made by this thesis is that it deals with the automatic classification of Arabic documents from a multilingual perspective and tries to preserve as much as possible of the foreign terms while eliminating only the useless ones (e.g. stowords).
|
2 |
A multilevel search algorithm for feature selection in biomedical dataOduntan, Idowu Olayinka 10 April 2006 (has links)
The automated analysis of patients’ biomedical data can be used to derive diagnostic and prognostic inferences about the observed patients. Many noninvasive techniques for acquiring biomedical samples generate data that are characterized by a large number of distinct attributes (i.e. features) and a small number of observed patients (i.e. samples). Deriving reliable inferences, such as classifying a given patient as either cancerous or non-cancerous, using these biomedical data requires that the ratio r of the number of samples to the number of features be within the range 5 < r < 10. To satisfy this requirement, the original set of features in the biomedical datasets can be reduced to an ‘optimal’ subset of features that most discriminates the observed patients. Feature selection techniques strategically seek the ‘optimal’ subset.
In this thesis, I present a new feature selection technique - multilevel feature selection. The technique seeks the ‘optimal’ feature subset in biomedical datasets using a multilevel search algorithm. This algorithm combines a hierarchical search framework with a search method. The framework, which provides the capability to easily adapt the technique to different forms of biomedical datasets, consists of increasingly coarse forms of the original feature set that are strategically and progressively explored by the search method. Tabu search (a search meta-heuristics) is the search method used in the multilevel feature selection technique.
I evaluate the performance of the new technique, in terms of the solution quality, using experiments that compare the classification inferences derived from the result of the technique with those derived from the result of other feature selection techniques such as the basic tabu-search-based feature selection, sequential forward selection, and random feature selection. In the experiments, the same biomedical dataset is used and equivalent amount of computational resource is allocated to the evaluated techniques to provide a common basis for comparison. The empirical results show that the multilevel feature selection technique finds ‘optimal’ subsets that enable more accurate and stable classification than those selected using the other feature selection techniques. Also, a similar comparison of the new technique with a genetic algorithm feature selection technique that selects highly discriminatory regions of consecutive features shows that the multilevel technique finds subsets that enable more stable classification. / February 2006
|
3 |
A multilevel search algorithm for feature selection in biomedical dataOduntan, Idowu Olayinka 10 April 2006 (has links)
The automated analysis of patients’ biomedical data can be used to derive diagnostic and prognostic inferences about the observed patients. Many noninvasive techniques for acquiring biomedical samples generate data that are characterized by a large number of distinct attributes (i.e. features) and a small number of observed patients (i.e. samples). Deriving reliable inferences, such as classifying a given patient as either cancerous or non-cancerous, using these biomedical data requires that the ratio r of the number of samples to the number of features be within the range 5 < r < 10. To satisfy this requirement, the original set of features in the biomedical datasets can be reduced to an ‘optimal’ subset of features that most discriminates the observed patients. Feature selection techniques strategically seek the ‘optimal’ subset.
In this thesis, I present a new feature selection technique - multilevel feature selection. The technique seeks the ‘optimal’ feature subset in biomedical datasets using a multilevel search algorithm. This algorithm combines a hierarchical search framework with a search method. The framework, which provides the capability to easily adapt the technique to different forms of biomedical datasets, consists of increasingly coarse forms of the original feature set that are strategically and progressively explored by the search method. Tabu search (a search meta-heuristics) is the search method used in the multilevel feature selection technique.
I evaluate the performance of the new technique, in terms of the solution quality, using experiments that compare the classification inferences derived from the result of the technique with those derived from the result of other feature selection techniques such as the basic tabu-search-based feature selection, sequential forward selection, and random feature selection. In the experiments, the same biomedical dataset is used and equivalent amount of computational resource is allocated to the evaluated techniques to provide a common basis for comparison. The empirical results show that the multilevel feature selection technique finds ‘optimal’ subsets that enable more accurate and stable classification than those selected using the other feature selection techniques. Also, a similar comparison of the new technique with a genetic algorithm feature selection technique that selects highly discriminatory regions of consecutive features shows that the multilevel technique finds subsets that enable more stable classification.
|
4 |
A multilevel search algorithm for feature selection in biomedical dataOduntan, Idowu Olayinka 10 April 2006 (has links)
The automated analysis of patients’ biomedical data can be used to derive diagnostic and prognostic inferences about the observed patients. Many noninvasive techniques for acquiring biomedical samples generate data that are characterized by a large number of distinct attributes (i.e. features) and a small number of observed patients (i.e. samples). Deriving reliable inferences, such as classifying a given patient as either cancerous or non-cancerous, using these biomedical data requires that the ratio r of the number of samples to the number of features be within the range 5 < r < 10. To satisfy this requirement, the original set of features in the biomedical datasets can be reduced to an ‘optimal’ subset of features that most discriminates the observed patients. Feature selection techniques strategically seek the ‘optimal’ subset.
In this thesis, I present a new feature selection technique - multilevel feature selection. The technique seeks the ‘optimal’ feature subset in biomedical datasets using a multilevel search algorithm. This algorithm combines a hierarchical search framework with a search method. The framework, which provides the capability to easily adapt the technique to different forms of biomedical datasets, consists of increasingly coarse forms of the original feature set that are strategically and progressively explored by the search method. Tabu search (a search meta-heuristics) is the search method used in the multilevel feature selection technique.
I evaluate the performance of the new technique, in terms of the solution quality, using experiments that compare the classification inferences derived from the result of the technique with those derived from the result of other feature selection techniques such as the basic tabu-search-based feature selection, sequential forward selection, and random feature selection. In the experiments, the same biomedical dataset is used and equivalent amount of computational resource is allocated to the evaluated techniques to provide a common basis for comparison. The empirical results show that the multilevel feature selection technique finds ‘optimal’ subsets that enable more accurate and stable classification than those selected using the other feature selection techniques. Also, a similar comparison of the new technique with a genetic algorithm feature selection technique that selects highly discriminatory regions of consecutive features shows that the multilevel technique finds subsets that enable more stable classification.
|
5 |
Localized Feature Selection for ClassificationArmanfard, Narges January 2017 (has links)
The main idea of this thesis is to present the novel concept of localized feature selection (LFS) for data classification and its application for coma outcome prediction.
Typical feature selection methods choose an optimal global feature subset that is applied over all regions of the sample space. In contrast, in this study we propose a novel localized feature selection approach whereby each region of the sample space is associated with its own distinct optimized feature set, which may vary both in membership and size across the sample space. This allows the feature set to optimally adapt to local variations in the sample space. An associated localized classification method is also proposed.
The proposed LFS method selects a feature subset such that, within a localized region, within-class and between-class distances are respectively minimized and maximized. We first determine the localized region using an iterative procedure based on the distances in the original feature space. This results in a linear programming optimization problem. Then, the second method is formulated as a non-linear joint convex/increasing quasi-convex optimization problem where a logistic function is applied to focus the optimization process on the localized region within the unknown co-ordinate system. This results in a more accurate classification performance at the expense of some sacrifice in computational time. Experimental results on synthetic and real-world data sets demonstrate the effectiveness of the proposed localized approach.
Using the LFS idea, we propose a practical machine learning approach for automatic and continuous assessment of event related potentials for detecting the presence of the mismatch negativity component, whose existence has a high correlation with coma awakening. This process enables us to determine prognosis of a coma patient. Experimental results on normal and comatose subjects demonstrate the effectiveness of the proposed method. / Dissertation / Doctor of Philosophy (PhD) / This study proposes a novel form of pattern classification method, which is formulated in a way so that it is easily executable on a computer. Two different versions of the method are developed. These are the LFS (localized feature selection) and lLFS (logistic LFS) methods. Both versions are appropriate for analysis of data with complex distributions, such as datasets that occur in biological signal processing problems. We have shown that the performance of the proposed methods is significantly improved over that of previous methods, on the datasets that were considered in this thesis.
The proposed method is applied to the specific problem of determining the prognosis of a coma patient. The viability of the formulation and the effectiveness of the proposed algorithm are demonstrated on several synthetic and real world datasets, including comatose subjects.
|
6 |
Web Page Classification Using Features from Titles and SnippetsLu, Zhengyang January 2015 (has links)
Nowadays, when a keyword is provided, a search engine can return a large number of web pages, which makes it difficult for people to find the right information. Web page classification is a technology that can help us to make a relevant and quick selection of information that we are looking for. Moreover, web page classification is important for companies that provide marketing and analytics platforms, because it can help them to build a healthy mix of listings on search engines and large directories. This will provide more insight into the distribution of the types of web pages their local business listings are found on, and finally will help marketers to make better-informed decisions about marketing campaigns and strategies.
In this thesis we perform a literature review that introduces web page classification, feature selection and feature extraction. The literature review also includes a comparison of three commonly used classification algorithms and a description of metrics for performance evaluation. The findings in the literature enable us to extend existing classification techniques, methods and algorithms to address a new web page classification problem faced by our industrial partner SweetIQ (a company that provides location-based marketing services and an analytics platform).
We develop a classification method based on SweetIQ's data and business needs. Our method includes typical feature selection and feature extraction methods, but the features we use in this thesis are largely different from traditional ones used in the literature. We test selected features and find that the text extracted from the title and snippet of a web page can help a classifier to achieve good performance. Our classification method does not require the full content of a web page. Thus, it is fast and saves a lot of space.
|
7 |
Quantifying the stability of feature selectionNogueira, Sarah January 2018 (has links)
Feature Selection is central to modern data science, from exploratory data analysis to predictive model-building. The "stability"of a feature selection algorithm refers to the robustness of its feature preferences, with respect to data sampling and to its stochastic nature. An algorithm is "unstable" if a small change in data leads to large changes in the chosen feature subset. Whilst the idea is simple, quantifying this has proven more challenging---we note numerous proposals in the literature, each with different motivation and justification. We present a rigorous statistical and axiomatic treatment for this issue. In particular, with this work we consolidate the literature and provide (1) a deeper understanding of existing work based on a small set of properties, and (2) a clearly justified statistical approach with several novel benefits. This approach serves to identify a stability measure obeying all desirable properties, and (for the first time in the literature) allowing confidence intervals and hypothesis tests on the stability of an approach, enabling rigorous comparison of feature selection algorithms.
|
8 |
Improving Multi-class Text Classification with Naive BayesRennie, Jason D. M. 01 September 2001 (has links)
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.
|
9 |
Multiclass Classification of SRBCTsYeo, Gene, Poggio, Tomaso 25 August 2001 (has links)
A novel approach to multiclass tumor classification using Artificial Neural Networks (ANNs) was introduced in a recent paper cite{Khan2001}. The method successfully classified and diagnosed small, round blue cell tumors (SRBCTs) of childhood into four distinct categories, neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS), using cDNA gene expression profiles of samples that included both tumor biopsy material and cell lines. We report that using an approach similar to the one reported by Yeang et al cite{Yeang2001}, i.e. multiclass classification by combining outputs of binary classifiers, we achieved equal accuracy with much fewer features. We report the performances of 3 binary classifiers (k-nearest neighbors (kNN), weighted-voting (WV), and support vector machines (SVM)) with 3 feature selection techniques (Golub's Signal to Noise (SN) ratios cite{Golub99}, Fisher scores (FSc) and Mukherjee's SVM feature selection (SVMFS))cite{Sayan98}.
|
10 |
Sparse Value Function Approximation for Reinforcement LearningPainter-Wakefield, Christopher Robert January 2013 (has links)
<p>A key component of many reinforcement learning (RL) algorithms is the approximation of the value function. The design and selection of features for approximation in RL is crucial, and an ongoing area of research. One approach to the problem of feature selection is to apply sparsity-inducing techniques in learning the value function approximation; such sparse methods tend to select relevant features and ignore irrelevant features, thus automating the feature selection process. This dissertation describes three contributions in the area of sparse value function approximation for reinforcement learning.</p><p>One method for obtaining sparse linear approximations is the inclusion in the objective function of a penalty on the sum of the absolute values of the approximation weights. This <italic>L<sub>1</sub></italic> regularization approach was first applied to temporal difference learning in the LARS-inspired, batch learning algorithm LARS-TD. In our first contribution, we define an iterative update equation which has as its fixed point the <italic>L<sub>1</sub></italic> regularized linear fixed point of LARS-TD. The iterative update gives rise naturally to an online stochastic approximation algorithm. We prove convergence of the online algorithm and show that the <italic>L<sub>1</sub></italic> regularized linear fixed point is an equilibrium fixed point of the algorithm. We demonstrate the ability of the algorithm to converge to the fixed point, yielding a sparse solution with modestly better performance than unregularized linear temporal difference learning.</p><p>Our second contribution extends LARS-TD to integrate policy optimization with sparse value learning. We extend the <italic>L<sub>1</sub></italic> regularized linear fixed point to include a maximum over policies, defining a new, "greedy" fixed point. The greedy fixed point adds a new invariant to the set which LARS-TD maintains as it traverses its homotopy path, giving rise to a new algorithm integrating sparse value learning and optimization. The new algorithm is demonstrated to be similar in performance with policy iteration using LARS-TD.</p><p>Finally, we consider another approach to sparse learning, that of using a simple algorithm that greedily adds new features. Such algorithms have many of the good properties of the <italic>L<sub>1</sub></italic> regularization methods, while also being extremely efficient and, in some cases, allowing theoretical guarantees on recovery of the true form of a sparse target function from sampled data. We consider variants of orthogonal matching pursuit (OMP) applied to RL. The resulting algorithms are analyzed and compared experimentally with existing <italic>L<sub>1</sub></italic> regularized approaches. We demonstrate that perhaps the most natural scenario in which one might hope to achieve sparse recovery fails; however, one variant provides promising theoretical guarantees under certain assumptions on the feature dictionary while another variant empirically outperforms prior methods both in approximation accuracy and efficiency on several benchmark problems.</p> / Dissertation
|
Page generated in 0.0971 seconds