Global ETD Search

1	A GA-Fuzzy-Based Voting Mechanism for Microarray Data Classification Chen, Ming-cheng 30 September 2008 (has links) The microarray technology plays an important role of clinical oncology field. The patient can be diagnosed a symptom about cancer through microarray data. Currently, to solve classification of microarray data is still a wild open issue. Existing methods may have a good performance, but need to spend much time to analyze microarray data, such as SVM. In this thesis, we propose a novel GA-Fuzzy-based voting mechanism to find genes which affect the symptom to better diagnose patient. The proposed algorithm can blur the boundary between classes to handle the ambiguous regions. In order to simulate the gene selection mechanism, we proposed upper bound £\-Cut and lower bound £\-Cut in voting mechanism. Two groups of data collected from the literature are used to test the performance of the proposed algorithm. In the first group of dataset, experimental results show that the accuracies of five datasets using the proposed algorithm are better than those methods proposed by Pochet et al. But, there are the four datasets which the accuracies using the proposed algorithm are a little bit worse than the methods proposed by Pochet et al. For the second group of dataset, the accuracies of seven datasets using the proposed algorithm are better than KerNN proposed by Xiong and Chen. But, there are four datasets which the accuracies using the proposed algorithm are worse than KerNN proposed by Xiong and Chen. Nevertheless, experimental results show that the proposed algorithm performs the best for multi-class data. Microarray Data Classification Genetic Algorithm Fuzzy Theory
2	Analysis of the migratory potential of cancerous cells by image preprocessing, segmentation and classification / Analyse du potentiel migratoire des cellules cancéreuses par prétraitement et segmentation d'image et classification des données Syed, Tahir Qasim 13 December 2011 (has links) Ce travail de thèse s’insère dans un projet de recherche plus global dont l’objectif est d’analyser le potentiel migratoire de cellules cancéreuses. Dans le cadre de ce doctorat, on s’intéresse à l’utilisation du traitement des images pour dénombrer et classifier les cellules présentes dans une image acquise via un microscope. Les partenaires biologistes de ce projet étudient l’influence de l’environnement sur le comportement migratoire de cellules cancéreuses à partir de cultures cellulaires pratiquées sur différentes lignées de cellules cancéreuses. Le traitement d’images biologiques a déjà donné lieu `a un nombre important de publications mais, dans le cas abordé ici et dans la mesure où le protocole d’acquisition des images acquises n'était pas figé, le défi a été de proposer une chaîne de traitements adaptatifs ne contraignant pas les biologistes dans leurs travaux de recherche. Quatre étapes sont détaillées dans ce mémoire. La première porte sur la définition des prétraitements permettant d’homogénéiser les conditions d’acquisition. Le choix d’exploiter l’image des écarts-type plutôt que la luminosité est un des résultats issus de cette première partie. La deuxième étape consiste à compter le nombre de cellules présentent dans l’image. Un filtre original, nommé filtre «halo», permettant de renforcer le centre des cellules afin d’en faciliter leur comptage, a été proposé. Une étape de validation statistique de ces centres permet de fiabiliser le résultat obtenu. L’étape de segmentation des images, sans conteste la plus difficile, constitue la troisième partie de ce travail. Il s’agit ici d’extraire des «vignettes», contenant une seule cellule. Le choix de l’algorithme de segmentation a été celui de la «Ligne de Partage des Eaux», mais il a fallu adapter cet algorithme au contexte des images faisant l’objet de cette étude. La proposition d’utiliser une carte de probabilités comme données d’entrée a permis d’obtenir une segmentation au plus près des bords des cellules. Par contre cette méthode entraine une sur-segmentation qu’il faut réduire afin de tendre vers l’objectif : «une région = une cellule». Pour cela un algorithme utilisant un concept de hiérarchie cumulative basée morphologie mathématique a été développée. Il permet d’agréger des régions voisines en travaillant sur une représentation arborescente de ces régions et de leur niveau associé. La comparaison des résultats obtenus par cette méthode à ceux proposés par d’autres approches permettant de limiter la sur-segmentation a permis de prouver l’efficacité de l’approche proposée. L’étape ultime de ce travail consiste dans la classification des cellules. Trois classes ont été définies : cellules allongées (migration mésenchymateuse), cellules rondes «blebbantes» (migration amiboïde) et cellules rondes «lisses» (stade intermédiaire du mode de migration). Sur chaque vignette obtenue à la fin de l’étape de segmentation, des caractéristiques de luminosité, morphologiques et texturales ont été calculées. Une première analyse de ces caractéristiques a permis d’élaborer une stratégie de classification, à savoir séparer dans un premier temps les cellules rondes des cellules allongées, puis séparer les cellules rondes «lisses» des «blebbantes». Pour cela on divise les paramètres en deux jeux qui vont être utilisés successivement dans ces deux étapes de classification. Plusieurs algorithmes de classification ont été testés pour retenir, au final, l’utilisation de deux réseaux de neurones permettant d’obtenir plus de 80% de bonne classification entre cellules longues et cellules rondes, et près de 90% de bonne classification entre cellules rondes «lisses» et «blebbantes». / This thesis is part of a broader research project which aims to analyze the potential migration of cancer cells. As part of this doctorate, we are interested in the use of image processing to count and classify cells present in an image acquired usinga microscope. The partner biologists of this project study the influence of the environment on the migratory behavior of cancer cells from cell cultures grown on different cancer cell lines. The processing of biological images has so far resulted in a significant number of publications, but in the case discussed here, since the protocol for the acquisition of images acquired was not fixed, the challenge was to propose a chain of adaptive processing that does not constrain the biologists in their research. Four steps are detailed in this paper. The first concerns the definition of pre-processing steps to homogenize the conditions of acquisition. The choice to use the image of standard deviations rather than the brightness is one of the results of this first part. The second step is to count the number of cells present in the image. An original filter, the so-called “halo” filter, that reinforces the centre of the cells in order to facilitate counting, has been proposed. A statistical validation step of the centres affords more reliability to the result. The stage of image segmentation, undoubtedly the most difficult, constitutes the third part of this work. This is a matter of extracting images each containing a single cell. The choice of segmentation algorithm was that of the “watershed”, but it was necessary to adapt this algorithm to the context of images included in this study. The proposal to use a map of probabilities as input yielded a segmentation closer to the edges of cells. As against this method leads to an over-segmentation must be reduced in order to move towards the goal: “one region = one cell”. For this algorithm the concept of using a cumulative hierarchy based on mathematical morphology has been developed. It allows the aggregation of adjacent regions by working on a tree representation ofthese regions and their associated level. A comparison of the results obtained by this method with those proposed by other approaches to limit over-segmentation has allowed us to prove the effectiveness of the proposed approach. The final step of this work consists in the classification of cells. Three classes were identified: spread cells (mesenchymal migration), “blebbing” round cells (amoeboid migration) and “smooth” round cells (intermediate stage of the migration modes). On each imagette obtained at the end of the segmentation step, intensity, morphological and textural features were calculated. An initial analysis of these features has allowed us to develop a classification strategy, namely to first separate the round cells from spread cells, and then separate the “smooth” and “blebbing” round cells. For this we divide the parameters into two sets that will be used successively in Two the stages of classification. Several classification algorithms were tested, to retain in the end, the use of two neural networks to obtain over 80% of good classification between long cells and round cells, and nearly 90% of good Classification between “smooth” and “blebbing” round cells. Data classification and machine learning
3	Utilising Restricted For-Loops in Genetic Programming Li, Xiang, xiali@cs.rmit.edu.au January 2007 (has links) Genetic programming is an approach that utilises the power of evolution to allow computers to evolve programs. While loops are natural components of most programming languages and appear in every reasonably-sized application, they are rarely used in genetic programming. The work is to investigate a number of restricted looping constructs to determine whether any significant benefits can be obtained in genetic programming. Possible benefits include: Solving problems which cannot be solved without loops, evolving smaller sized solutions which can be more easily understood by human programmers and solving existing problems quicker by using fewer evaluations. In this thesis, a number of explicit restricted loop formats were formulated and tested on the Santa Fe ant problem, a modified ant problem, a sorting problem, a visit-every-square problem and a difficult object classification problem. The experimental results showed that these explicit loops can be success fully used in genetic programming. The evolutionary process can decide when, where and how to use them. Runs with these loops tended to generate smaller sized solutions in fewer evaluations. Solutions with loops were found to some problems that could not be solved without loops. The results and analysis of this thesis have established that there are significant benefits in using loops in genetic programming. Restricted loops can avoid the difficulties of evolving consistent programs and the infinite iterations problem. Researchers and other users of genetic programming should not be afraid of loops. Genetic Programming Data mining Data Classification Restricted Loops Loops
4	Statistical analysis applied to data classification and image filtering ALMEIDA, Marcos Antonio Martins de 21 December 2016 (has links) Submitted by Fernanda Rodrigues de Lima (fernanda.rlima@ufpe.br) on 2018-08-03T20:52:13Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) TESE Marcos Antonio Martins de Almeida.pdf: 11555397 bytes, checksum: db589d39915a5dda1d8b9e763a9cf4c0 (MD5) / Approved for entry into archive by Alice Araujo (alice.caraujo@ufpe.br) on 2018-08-09T20:49:00Z (GMT) No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) TESE Marcos Antonio Martins de Almeida.pdf: 11555397 bytes, checksum: db589d39915a5dda1d8b9e763a9cf4c0 (MD5) / Made available in DSpace on 2018-08-09T20:49:01Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) TESE Marcos Antonio Martins de Almeida.pdf: 11555397 bytes, checksum: db589d39915a5dda1d8b9e763a9cf4c0 (MD5) Previous issue date: 2016-12-21 / Statistical analysis is a tool of wide applicability in several areas of scientific knowledge. This thesis makes use of statistical analysis in two different applications: data classification and image processing targeted at document image binarization. In the first case, this thesis presents an analysis of several aspects of the consistency of the classification of the senior researchers in computer science of the Brazilian research council, CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico. The second application of statistical analysis developed in this thesis addresses filtering-out the back to front interference which appears whenever a document is written or typed on both sides of translucent paper. In this topic, an assessment of the most important algorithms found in the literature is made, taking into account a large quantity of parameters such as the strength of the back to front interference, the diffusion of the ink in the paper, and the texture and hue of the paper due to aging. A new binarization algorithm is proposed, which is capable of removing the back-to-front noise in a wide range of documents. Additionally, this thesis proposes a new concept of “intelligent” binarization for complex documents, which besides text encompass several graphical elements such as figures, photos, diagrams, etc. / Análise estatística é uma ferramenta de grande aplicabilidade em diversas áreas do conhecimento científico. Esta tese faz uso de análise estatística em duas aplicações distintas: classificação de dados e processamento de imagens de documentos visando a binarização. No primeiro caso, é aqui feita uma análise de diversos aspectos da consistência da classificação de pesquisadores sêniores do CNPq - Conselho Nacional de Desenvolvimento Científico e Tecnológico, na área de Ciência da Computação. A segunda aplicação de análise estatística aqui desenvolvida trata da filtragem da interferência frente-verso que surge quando um documento é escrito ou impresso em ambos os lados da folha de um papel translúcido. Neste tópico é inicialmente feita uma análise da qualidade dos mais importantes algoritmos de binarização levando em consideração parâmetros tais como a intensidade da interferência frente-verso, a difusão da tinta no papel e a textura e escurecimento do papel pelo envelhecimento. Um novo algoritmo para a binarização eficiente de documentos com interferência frente-verso é aqui apresentado, tendo se mostrado capaz de remover tal ruído em uma grande gama de documentos. Adicionalmente, é aqui proposta a binarização “inteligente” de documentos complexos que envolvem diversos elementos gráficos (figuras, diagramas, etc). Electrical Eingineering Data processing Data classification Image filtering
5	Tidsaspekt för informationsklassificering inom svenska myndigheter / Timescale for information classification in Swedish governmental agencies Susi, Tommy January 2016 (has links) No description available. information classification data classification data labeling Computer Sciences Datavetenskap (datalogi)
6	Porovnanie metód machine learningu pre analýzu kreditného rizika / Comparison of machine learning methods for credit risk analysis Bušo, Bohumír January 2015 (has links) Recently, machine learning has been put into connection with a field called ,,Big Data'' more and more. Usually, in this field, a lot of data is available and we need to gather useful information based on this data. Nowadays, when still more and more data is generated by use of mobile phones, credit cards, etc., a need for high-performance methods is serious. In this work, we describe six different methods that serve this purpose. These are logistic regression, neural networks and deep neural networks, bagging, boosting and stacking. Last three methods compose a group called Ensemble Learning. We apply all six methods on real data, which were generously provided by one of the loan providers. These methods can help them to distinguish between good and bad potential takers of loans, when the decision about the loan is being made. Lastly, the results of particular methods are compared and we also briefly outline possible ways of interpretation.
7	Localized Feature Selection for Classification Armanfard, Narges January 2017 (has links) The main idea of this thesis is to present the novel concept of localized feature selection (LFS) for data classification and its application for coma outcome prediction. Typical feature selection methods choose an optimal global feature subset that is applied over all regions of the sample space. In contrast, in this study we propose a novel localized feature selection approach whereby each region of the sample space is associated with its own distinct optimized feature set, which may vary both in membership and size across the sample space. This allows the feature set to optimally adapt to local variations in the sample space. An associated localized classification method is also proposed. The proposed LFS method selects a feature subset such that, within a localized region, within-class and between-class distances are respectively minimized and maximized. We first determine the localized region using an iterative procedure based on the distances in the original feature space. This results in a linear programming optimization problem. Then, the second method is formulated as a non-linear joint convex/increasing quasi-convex optimization problem where a logistic function is applied to focus the optimization process on the localized region within the unknown co-ordinate system. This results in a more accurate classification performance at the expense of some sacrifice in computational time. Experimental results on synthetic and real-world data sets demonstrate the effectiveness of the proposed localized approach. Using the LFS idea, we propose a practical machine learning approach for automatic and continuous assessment of event related potentials for detecting the presence of the mismatch negativity component, whose existence has a high correlation with coma awakening. This process enables us to determine prognosis of a coma patient. Experimental results on normal and comatose subjects demonstrate the effectiveness of the proposed method. / Dissertation / Doctor of Philosophy (PhD) / This study proposes a novel form of pattern classification method, which is formulated in a way so that it is easily executable on a computer. Two different versions of the method are developed. These are the LFS (localized feature selection) and lLFS (logistic LFS) methods. Both versions are appropriate for analysis of data with complex distributions, such as datasets that occur in biological signal processing problems. We have shown that the performance of the proposed methods is significantly improved over that of previous methods, on the datasets that were considered in this thesis. The proposed method is applied to the specific problem of determining the prognosis of a coma patient. The viability of the formulation and the effectiveness of the proposed algorithm are demonstrated on several synthetic and real world datasets, including comatose subjects. Local Feature Selection Data Classification Coma Outcome Prediction Feature Selection
8	Data Mining of Medical Datasets with Missing Attributes from Different Sources Sajja, Sunitha January 2010 (has links) No description available. Computer Science data mining missing attributes data classification outliers
9	Oblique decision trees in transformed spaces. Wickramarachchi, Darshana Chitraka January 2015 (has links) Decision trees (DTs) play a vital role in statistical modelling. Simplicity and interpretability of the solution structure have made the method popular in a wide range of disciplines. In data classification problems, DTs recursively partition the feature space into disjoint sub-regions until each sub-region becomes homogeneous with respect to a particular class. Axis parallel splits, the simplest form of splits, partition the feature space parallel to feature axes. However, for some problem domains DTs with axis parallel splits can produce complicated boundary structures. As an alternative, oblique splits are used to partition the feature space potentially simplifying the boundary structure. Various approaches have been explored to find optimal oblique splits. One approach is based on optimisation techniques. This is considered the benchmark approach, however, its major limitation is that the tree induction algorithm is computationally expensive. On the other hand, split finding approaches based on heuristic arguments have gained popularity and have made improvements on benchmark methods. This thesis proposes a methodology to induce oblique decision trees in transformed spaces based on a heuristic argument. As the first goal of the thesis, a new oblique decision tree algorithm, called HHCART (\underline{H}ouse\underline{H}older \underline{C}lassification and \underline{R}egression \underline{T}ree) is proposed. The proposed algorithm utilises a series of Householder matrices to reflect the training data at each non-terminal node during the tree construction. Householder matrices are constructed using the eigenvectors from each classes' covariance matrix. Axis parallel splits in the reflected (or transformed) spaces provide an efficient way of finding oblique splits in the original space. Experimental results show that the accuracy and size of the HHCART trees are comparable with some benchmark methods in the literature. The appealing features of HHCART is that it can handle both qualitative and quantitative features in the same oblique split, conceptually simple and computationally efficient. Data mining applications often come with massive example sets and inducing oblique DTs for such example sets often consumes considerable time. HHCART is a serial computing memory resident algorithm which may be ineffective when handling massive example sets. As the second goal of the thesis parallel computing and disk resident versions of the HHCART algorithm are presented so that HHCART can be used irrespective of the size of the problem. HHCART is a flexible algorithm and the eigenvectors defining Householder matrices can be replaced by other vectors deemed effective in oblique split finding. The third endeavour of this thesis explores this aspect of HHCART. HHCART can be used with other vectors in order to improve classification results. For example, a normal vector of the angular bisector, introduced in the Geometric Decision Tree (GDT) algorithm, is used to construct the Householder reflection matrix. The proposed method produces better results than GDT for some problem domains. In the second case, \textit{Class Representative Vectors} are introduced and used to construct Householder reflection matrices. The results of this experiment show that these oblique trees produce classification results competitive with those achieved with some benchmark decision trees. DTs are constructed using two approaches, namely: top-down and bottom-up. HHCART is a top-down tree, which is the most common approach. As the fourth idea of the thesis, the concept of HHCART is used to induce a new DT, HHBUT, using the bottom-up approach. The bottom-up approach performs cluster analysis prior to the tree building to identify the terminal nodes. The use of the Bayesian Information Criterion (BIC) to determine the number of clusters leads to accurate and compact trees when compared with Cross Validation (CV) based bottom-up trees. We suggest that HHBUT is a good alternative to the existing bottom-up tree especially when the number of examples is much higher than the number of features. Statistical Data Classification Oblique Decision Trees Householder Reflection Bottom-Up Tree Induction
10	Advances Towards Data-Race-Free Cache Coherence Through Data Classification Davari, Mahdad January 2017 (has links) Providing a consistent view of the shared memory based on precise and well-defined semantics—memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics. Shared Memory Architectures Multicore Memory Hierarchy Cache Coherence Data Classification Computer Systems Datorsystem

Search results