Global ETD Search

11	Bayesian Biclustering on Discrete Data: Variable Selection Methods Guo, Lei 18 October 2013 (has links) Biclustering is a technique for clustering rows and columns of a data matrix simultaneously. Over the past few years, we have seen its applications in biology-related fields, as well as in many data mining projects. As opposed to classical clustering methods, biclustering groups objects that are similar only on a subset of variables. Many biclustering algorithms on continuous data have emerged over the last decade. In this dissertation, we will focus on two Bayesian biclustering algorithms we developed for discrete data, more specifically categorical data and ordinal data. / Statistics Statistics Biostatistics BiClustering Categorical data Hapmap Ordinal data population structure
12	Recommender System using Reinforcement Learning January 2020 (has links) abstract: Currently, recommender systems are used extensively to find the right audience with the "right" content over various platforms. Recommendations generated by these systems aim to offer relevant items to users. Different approaches have been suggested to solve this problem mainly by using the rating history of the user or by identifying the preferences of similar users. Most of the existing recommendation systems are formulated in an identical fashion, where a model is trained to capture the underlying preferences of users over different kinds of items. Once it is deployed, the model suggests personalized recommendations precisely, and it is assumed that the preferences of users are perfectly reflected by the historical data. However, such user data might be limited in practice, and the characteristics of users may constantly evolve during their intensive interaction between recommendation systems. Moreover, most of these recommender systems suffer from the cold-start problems where insufficient data for new users or products results in reduced overall recommendation output. In the current study, we have built a recommender system to recommend movies to users. Biclustering algorithm is used to cluster the users and movies simultaneously at the beginning to generate explainable recommendations, and these biclusters are used to form a gridworld where Q-Learning is used to learn the policy to traverse through the grid. The reward function uses the Jaccard Index, which is a measure of common users between two biclusters. Demographic details of new users are used to generate recommendations that solve the cold-start problem too. Lastly, the implemented algorithm is examined with a real-world dataset against the widely used recommendation algorithm and the performance for the cold-start cases. / Dissertation/Thesis / Masters Thesis Computer Science 2020 Artificial intelligence Computer science Biclustering Qlearning Recommender System Reinforcement Learning
13	Application of biclustering algorithms to biological data Eren, Kemal 20 June 2012 (has links) No description available. Computer Science biclustering data mining gene expression microarray
14	Topics in One-Way Supervised Biclustering Using Gaussian Mixture Models Wong, Monica January 2017 (has links) Cluster analysis identifies homogeneous groups that are relevant within a population. In model-based clustering, group membership is estimated using a parametric finite mixture model, commonly the mathematically tractable Gaussian mixture model. One-way clustering methods can be restrictive in cases where there are suspected relationships between the variables in each component, leading to the idea of biclustering, which refers to clustering both observations and variables simultaneously. When the relationships between the variables are known, biclustering becomes one-way supervised. To this end, this thesis focuses on a novel one-way supervised biclustering family based on the Gaussian mixture model. In cases where biclustering may be overestimating the number of components in the data, a model averaging technique utilizing Occam's window is applied to produce better clustering results. Automatic outlier detection is introduced into the biclustering family using mixtures of contaminated Gaussian mixture models. Algorithms for model-fitting and parameter estimation are presented for the techniques described in this thesis, and simulation and real data studies are used to assess their performance. / Thesis / Doctor of Philosophy (PhD) Biclustering One-way supervision Finite mixture models Model-based clustering
15	Desarrollo de técnicas de aprendizaje automático y computación evolutiva multiobjetivo para la inferencia de redes de asociación entre vías biológicas Dussaut, Julieta Sol 14 March 2016 (has links) En la biología de sistemas, una ruta biológica representa una secuencia de reacciones o interacciones entre un grupo de genes expresados que participan en un proceso biológico. Durante la última década, el análisis de las rutas biológicas se ha convertido en una estrategia clave para la comprensión de los significados biológicos de experimentos de alto rendimiento sobre un grupo de genes. Detrás de la idea del análisis de estas rutas existe el supuesto de que, para muchos fenómenos celulares complejos, resulta muy difícil encontrar una explicación mediante estudios que sólo se centran en una mirada al nivel de los genes. En particular esta tesis se centra en la investigación de técnicas de análisis de diafonía (cross-talk) entre rutas biológicas (pathways), enriqueciendo esta información por datos de experimentos de microarray mediante biclustering. De esta forma, se busca proveer una metodología bioinformática que identifique relaciones entre rutas biológicas y las explique, proporcionando información útil para asistir a expertos en biología molecular. Para cumplir este objetivo se desarrollaron métodos computacionales para el análisis tanto topológico como de enriquecimiento a nivel de rutas biológicas. Una de las herramientas desarrolladas, BAT(Gallo, Dussaut, Carballido, & Ponzoni, 2010), plantea la ejecución del algoritmo BiHEA(Gallo, Carballido, & Ponzoni, 2009), que realiza biclustering sobre los datos. Esto permite la identificación de grupos de genes co-expresados bajo ciertos subconjuntos de condiciones experimentales. Esta herramienta es utilizada en conjunto con otra, denominada PET, diseñada para utilizar datos topológicos relevantes a nivel de genes y proyectarlos a nivel de rutas biológicas para una mejor comprensión de los mecanismos de señalización que coordinan distintos procesos celulares. Se estudiaron y validaron estos métodos con datos de la enfermedad de Alzheimer, contrastando los resultados con los obtenidos por otros métodos publicados recientemente. De este modo, se puso en evidencia la relevancia de combinar técnicas de análisis topológico con enriquecimiento basado en datos de expresión y detección de sincronización entre rutas biológicas mediante el uso de métodos de biclustering como una estrategia integral para la identificación de diafonía entre procesos biológicos. / In systems biology, a pathway represents a sequence of reactions or interactions between a group of expressed genes involved in a biological process. During the last decade, the analysis of biological pathways has become a key strategy for the understanding of biological meanings in high throughput experiments on a group of genes. Behind the idea of the analysis of these pathways there is the assumption that, for many complex cellular phenomena, it is very difficult to find an explanation through studies that focus only at a gene level. In particular, this thesis focuses on the investigation of cross-talk analysis techniques between biological pathways, also enriching this information by microarray experiments data usingbiclustering. By means of this combination, the idea is to count with a bioinformatics approach that identifies and explains relationships between biological pathways thus providing useful information to assist experts in molecular biology information. To meet this objective, computational methods for analysis of biological pathways, including enrichment analysis, and analysis at a topological level,has been developed. One of the tools developed, BAT (Gallo, Dussaut, Carballido, & Ponzoni, 2010)raises the algorithm execution BiHEA (Gallo, Carballido, & Ponzoni, 2009), which is a biclustering multi-objective algorithm. This allows the identification of clusters of co-expressed subsets of genes under certain experimental conditions. This tool is used in conjunction with other, called PET, designed to use topological data relevant at gene level and project biological pathways for better understanding of the signaling mechanisms that coordinate various cellular processes. We studied these methods and validated them with data from Alzheimer's disease, contrasting results with those of other recently published methods. Thus, is highlighted the importance of combining topological analysis techniques with enrichment expression data based on detection and synchronization between biological pathways using methods of biclustering as a comprehensive strategy for identifying crosstalk between biological processes. Ciencias de la computación Bioinformática Algoritmo evolutivo Aprendizaje automático Biclustering Vía biológica
16	A Biclustering Approach to Combinatorial Transcription Control Srinivasan, Venkataraghavan 11 August 2005 (has links) Combinatorial control of transcription is a well established phenomenon in the cell. Multiple transcription factors often bind to the same transcriptional control region of a gene and interact with each other to control the expression of the gene. It is thus necessary to consider the joint conservation of sequence pairs in order to identify combinations of binding sites to which the transcription factors bind. Conventional motif finding algorithms fail to address this issue. We propose a novel biclustering algorithm based on random sampling to identify candidate binding site combinations. We establish bounds on the various parameters to the algorithm and study the conditions under which the algorithm is guaranteed to identify candidate binding sites. We analyzed a yeast cell cycle gene expression data set using our algorithm and recovered certain novel combinations of binding sites, besides those already reported in the literature. / Master of Science Biclustering Combinatorial transcription control Promoter analysis Random sampling
17	Contributions à l'indexation et à la recherche d'information avec l'analyse formelle de concepts / Contributions to indexing and retrieval using Formal Concept Analysis Codocedo-Henríquez, Víctor 04 September 2015 (has links) Un des premiers modèles d'indexation de documents qui utilise des termes comme descripteurs était une structure de treillis, cela une vingtaine d'années avant l'arrivée de l'analyse formelle de concepts (FCA pour "Formal Concept Analysis"), qui s'affirme maintenant comme un formalisme théorique important et solide pour l'analyse de données et la découverte de connaissances. Actuellement, la communauté en recherche d'information (RI) s'intéresse particulièrement à des techniques avancées pour la recherche des documents qui relèvent des probabilités et des statistiques. En parallèle, l'intérêt de la communauté FCA au développement de techniques qui font avancer l'état de l'art en RI tout en offrant des fonctionnalités sémantiques lui est toujours bien vivant. Dans cette thèse, nous présentons un ensemble de contributions sur ce que nous avons appelé les systèmes FCA de recherche d'information ("FCA-based IR systems''). Nous avons divisé nos contributions en deux parties, à savoir l'extraction et l'indexation. Pour la récupération, nous proposons une nouvelle technique qui exploite les relations sémantiques entre les descripteurs dans un corpus de documents. Pour l'indexation, nous proposons un nouveau modèle qui permet de mettre en oeuvre un modèle vectoriel d'indexation des documents s'appuyant sur un treillis de concepts (ou treillis de Galois). En outre, nous proposons un modèle perfectionné pour l'indexation hétérogène dans lequel nous combinons le modèle vectoriel et le modèle de recherche booléen. Finalement, nous présentons une technique de fouille de données inspiré de l'indexation des documents, à savoir un modèle d'énumération exhaustive des biclusters en utilisant la FCA. Le biclustering est une nouvelle technique d'analyse de données dans laquelle les objets sont liés via la similitude dans certains attributs de l'espace de description, et non pas par tous les attributs comme dans le "clustering'' standard. En traduisant ce problème en termes d'analyse formelle de concepts, nous pouvons exploiter l'algorithmique associée à la FCA pour développer une technique d'extraction de biclusters de valeurs similaires. Nous montrons le très bon comportement de notre technique, qui fonctionne mieux que les techniques actuelles de biclustering avec énumération exhaustive / One of the first models ever to be considered as an index for documents using terms as descriptors, was a lattice structure, a couple of decades before the arrival of Formal Concept Analysis (FCA) as a solid theory for data mining and knowledge discovery.While the Information Retrieval (IR) community has shifted to more advanced techniques for document retrieval, like probabilistic and statistic paradigms, the interest of the FCA community on developing techniques that would improve the state-of-the-art in IR while providing relevance feedback and semantic based features, never decayed. In this thesis we present a set of contributions on what we call FCA-based IR systems. We have divided our contributions in two sets, namely retrieval and indexing. For retrieval, we propose a novel technique that exploits semantic relations among descriptors in a document corpus and a new concept lattice navigation strategy (called cousin concepts), enabling us to support classification-based reasoning to provide better results compared with state-of-the-art retrieval techniques. The basic notion in our strategy is supporting query modification using "term replacements'' using the lattice structure and semantic similarity. For indexing, we propose a new model that allows supporting the vector space model of retrieval using concept lattices. One of the main limitations of current FCA-based IR systems is related to the binary nature of the input data required for FCA to generate a concept lattice. We propose the use of pattern structures, an extension of FCA to deal with complex object descriptions, in order to support more advanced retrieval paradigms like the vector space model. In addition, we propose an advanced model for heterogeneous indexing through which we can combine the vector space model and the Boolean retrieval model. The main advantage of this approach is the ability of supporting indexing of convex regions in an arbitrary vectorial space built from a document collection. Finally, we move forward to a mining model associated with document indexing, namely exhaustive bicluster enumeration using FCA. Biclustering is an emerging data analysis technique in which objects are related by similarity under certain attributes of the description space, instead of the whole description space like in standard clustering. By translating this problem to the framework of FCA, we are able to exploit the robust machinery associated with the computation of concept lattices to provide an algorithm for mining biclusters based on similar values. We show how our technique performs better than current exhaustive enumeration biclustering techniques. Analyse formelle de concepts Recherche d'information Biclustering Relevance feedback Systèmes de recommandation Formal concept analysis Information retrieval Biclustering Relevance feedback Recommender systems 025.4
18	Data mining using the crossing minimization paradigm Abdullah, Ahsan January 2007 (has links) Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis. 005.3
19	Avaliação sistemática de técnicas de bi-agrupamento de dados / A systematic comparative evaluation of biclustering techniques Padilha, Victor Alexandre 23 September 2016 (has links) Análise de agrupamento é um problema fundamental de aprendizado de máquina não supervisionado em que se objetiva determinar categorias que descrevam um conjunto de objetos de acordo com suas similaridades ou inter-relacionamentos. Na formulação tradicional do problema, busca-se por partições ou hierarquias de partições contendo grupos cujos objetos são de alguma forma similares entre si e dissimilares aos objetos dos demais grupos, segundo alguma medida direta ou indireta de (dis)similaridade que leva em conta o conjunto completo de atributos que descrevem os objetos na base de dados sob análise. Entretanto, apesar de décadas de aplicações bem sucedidas, existem situações em que a natureza dos agrupamentos contidos nos dados não pode ser representada segundo este tipo de formulação. Em particular, existem situações em que grupos de objetos se caracterizam como tais apenas segundo um subconjunto dos atributos que os descrevem, sendo que tal subconjunto pode ser distinto para cada grupo. Ao contrário de algoritmos de agrupamento tradicionais, algoritmos de bi-agrupamento são capazes de agrupar simultaneamente linhas e colunas de uma matriz de dados. Tais algoritmos produzem bi-grupos formados por subconjuntos de objetos e subconjuntos de atributos de alguma forma fortemente co-relacionados. Esses algoritmos passaram a atrair a atenção da comunidade científica quando se evidenciou a relevância da tarefa de bi-agrupamento em problemas de análise de dados de expressão gênica em bioinformática. Embora em menor grau, as abordagens de bi-agrupamento também têm ganho atenção em outros domínios de aplicação, tais como mineração de textos (text mining) e filtragem colaborativa em sistemas de recomendação. O problema é que uma variedade de algoritmos de bi-agrupamento têm sido propostos na literatura baseados em diferentes princípios e suposições sobre os dados, podendo chegar a resultados completamente distintos em uma mesma aplicação. Nesse cenário, torna-se importante a realização de estudos comparativos que possam contrastar o comportamento e desempenho dos diversos algoritmos. Neste trabalho é apresentado um estudo comparativo envolvendo 17 algoritmos de bi-agrupamento (representativos das principais categorias de algoritmos existentes) em coleções de bases de dados tanto de natureza real como simulada, com particular ênfase em problemas de análise de dados de expressão gênica. Diversos aspectos metodológicos e procedimentos para a avaliação experimental foram considerados, a fim de superar as limitações de estudos comparativos anteriores da literatura. Além da comparação em si, todo o arcabouço comparativo pode ser reutilizado para a comparação de outros algoritmos no futuro. / Data clustering is a fundamental problem in the unsupervised machine learning field, whose objective is to find categories that describe a dataset according to similarities between its objects. In its traditional formulation, we search for partitions or hierarchies of partitions containing clusters such that the objects contained in the same cluster are similar to each other and dissimilar to objects from other clusters according to a similarity or dissimilarity measure that uses all the data attributes in its calculation. So, it is supposed that all clusters are characterized in the same feature space. However, there are several applications where the clusters are characterized only in a subset of the attributes, which could be different from one cluster to another. Different than traditional data clustering algorithms, biclustering algorithms are able to cluster the rows and columns of a data matrix simultaneously, producing biclusters formed with strongly related subsets of objects and subsets of attributes. These algorithms started to draw the scientific communitys attention only after some studies that show their importance for gene expression data analysis. To a lesser degree, biclustering techniques have also been used in other application domains, such as text mining and collaborative filtering in recommendation systems. The problem is that several biclustering algorithms have been proposed in the past recent years with different principles and assumptions, which could result in different outcomes in the same dataset. So, it becomes important to perform comparative studies that could illustrate the behavior and performance of some algorithms. In this thesis, it is presented a comparative study with 17 biclustering algorithms (which are representative of the main categories of algorithms in the literature) which were tested on synthetic and real data collections, with particular emphasis on gene expression data analysis. Several methodologies and experimental evaluation procedures were taken into account during the research, in order to overcome the limitations of previous comparative studies from the literature. Beyond the presented comparison, the comparative methodology developed could be reused to compare other algorithms in the future. Agrupamento de dados Bi-agrupamento de dados Biclustering Clustering Expressão gênica Gene expression
20	Identification of gene expression changes in human cancer using bioinformatic approaches Griffith, Obi Lee 05 1900 (has links) The human genome contains tens of thousands of gene loci which code for an even greater number of protein and RNA products. The highly complex temporal and spatial expression of these genes makes possible all the biological processes of life. Altered gene expression by mutation or deregulation is fundamental for the development of many human diseases. The ultimate aim of this thesis was to identify gene expression changes relevant to cancer. The advent of genome-wide expression profiling techniques, such as microarrays, has provided powerful new tools to identify such changes and researchers are now faced with an explosion of gene expression data. Processing, comparing and integrating these data present major challenges. I approached these challenges by developing and assessing novel methods for cross-platform analysis of expression data, scalable subspace clustering, and curation of experimental gene regulation data from the published literature. I found that combining results from different expression platforms increases reliability of coexpression predictions. However, I also observed that global correlation between platforms was generally low, and few gene pairs reached reasonable thresholds for high-confidence coexpression. Therefore, I developed a novel subspace clustering algorithm, able to identify coexpressed genes in experimental subsets of very large gene expression datasets. Biological assessment against several metrics indicates that this algorithm performs well. I also developed a novel meta-analysis method to identify consistently reported genes from differential expression studies when raw data are unavailable. This method was applied to thyroid cancer, producing a ranked list of significantly over-represented genes. Tissue microarray analysis of some of these candidates and others identified a number of promising biomarkers for diagnostic and prognostic classification of thyroid cancer. Finally, I present ORegAnno (www.oreganno.org), a resource for the community-driven curation of experimentally verified regulatory sequences. This resource has proven a great success with ~30,000 sequences entered from over 900 publications by ~50 contributing users. These data, methods and resources contribute to our overall understanding of gene regulation, gene expression, and the changes that occur in cancer. Such an understanding should help identify new cancer mechanisms, potential treatment targets, and have significant diagnostic and prognostic implications. Bioinformatics Gene expression Gene regulation SAGE Tissue microarray Thyroid cancer Subspace clustering Biclustering Ontology Biomarker

Search results