Spelling suggestions: "subject:"clustering analysis"" "subject:"klustering analysis""
1 |
Osteological Comparisons of the Eastern Newt (Notophthalmus viridescens) Between the Terrestrial Eft and Adult Stage.Hardgrave, Aaron, Carter, Richard T 06 April 2022 (has links)
Eastern Newts (Notophthalmus viridescens) are a ubiquitous member of eastern North America’s caudate fauna. Unlike the typical amphibian, their life cycle is split into three phases instead of two, commonly called a triphasic life cycle. The larvae of N. viridescens are fully aquatic, eventually metamorphosing to become terrestrial juveniles, called efts. Upon sexual maturity, the eft will metamorphose into a semi-aquatic adult where its external morphology is typical of an aquatic salamander. Since there are apparent differences in their ecological niche, there are different forces acting on their skeletons. We hypothesize that due to differences in buoyancy, torsion, and locomotion, differences are expected in the morphology of the axial skeleton. Using image data generated on a SkyScan 1273 micro-computed tomography (µCT) scanner, 3D shape analyses will be used to quantify shape differences between vertebrae and test the hypothesis. Three dimensional digital models of each vertebrae of interest will be rendered from the scans in Dragonfly (Object Research Systems). Each 3D model is then loaded into SlicerMorph (3D Slicer), where landmarks are placed upon homologous structures on each vertebra. A Generalized Procrustes Analysis (GPA) followed by a Principal Component Analysis (PCA) is conducted for each vertebra to test for potential shape differences between each life stage. GPA and PCA analysis will be conducted on 10 terrestrial juveniles, 10 semi-aquatic adults, 5 aquatic juveniles, and 5 paedomorphic adults. The 5 aquatic juveniles and 5 paedomorphic adults, eastern newts that remain in the water through their entire lives, will validate if the semi-aquatic adult is truly adapting towards an aquatic lifestyle. If GPA and PCA indicate statistical shape differences between certain vertebrae, those vertebrae will be run through the Automated Landmarking through Pointcloud Alignment and Correspondence Analysis (ALPACA) module of SlicerMorph to produce heatmap data on the 3D models showing where exactly the shape changes are occurring in the vertebra.
|
2 |
A framework for emerging topic detection in biomedicineMadlock-Brown, Charisse Renee 01 December 2014 (has links)
Emerging topic detection algorithms have the potential to assist researchers in maintaining awareness of current trends in biomedical fields--a feat not easily achieved with existing methods. Though topic detection algorithms for news-cycles exist, several aspects of this particular area make applying them directly to scientific literature problematic.
This dissertation offers a framework for emerging topic detection in biomedicine. The framework includes a novel set of weightings based on the historical importance of each topic identified. Features such as journal impact factor and funding data are used to develop a fitness score to identify which topics are likely to burst in the future. Characterization of bursts over an extended planning horizon by discipline was performed to understand what a typical burst trend looks like in this space to better understand how to identify important or emerging trends. Cluster analysis was used to create an overlapping hierarchical structure of scientific literature at the discipline level. This allows for granularity adjustment (e.g. discipline level or research area level) in emerging topic detection for different users. Using cluster analysis allows for the identification of terms that may not be included in annotated taxonomies, as they are new or not considered as relevant at the time the taxonomy was last updated. Weighting topics by historical frequency allows for better identification of bursts that are associated with less well-known areas, and therefore more surprising. The fitness score allows for the early identification of bursty terms. This framework will benefit policy makers, clinicians and researchers.
|
3 |
MACHINE LEARNING BASED IDS LOG ANALYSISTianshuai Guan (10710258) 06 May 2021 (has links)
<p>With the rapid development of information technology, network traffic is also increasing dramatically. However, many cyber-attack records are buried in this large amount of network trafficking. Therefore, many Intrusion Detection Systems (IDS) that can extract those malicious activities have been developed. Zeek is one of them, and due to its powerful functions and open-source environment, Zeek has been adapted by many organizations. Information Technology at Purdue (ITaP), which uses Zeek as their IDS, captures netflow logs for all the network activities in the whole campus area but has not delved into effective use of the information. This thesis examines ways to help increase the performance of anomaly detection. As a result, this project intends to combine basic database concepts with several different machine learning algorithms and compare the result from different combinations to better find potential attack activities in log files.</p>
|
4 |
K-groups: A Generalization of K-means by Energy DistanceLi, Songzi 29 April 2015 (has links)
No description available.
|
5 |
Discovering Issue Networks Using Data Mining TechniquesChuang, Tse-sheng 01 August 2002 (has links)
By means of data mining techniques development these days, the knowledge discovered by virtue of data mining has ranging from business application to fraud detection. However, too often, we see only the profit-making justification for investing in data mining while losing sight of the fact that they can help resolve issues of global or national importance. In this research, we propose the architecture for issue oriented information construction and knowledge discovery that related to political or public policy issues. In this architecture, we adopt issue networks as the description model and data mining as the core technique. This study is also performed and verified with prototype system constructing and case data analyzing.
There are three main topics in our research. The issue networks information construction starts with text files information retrieving of specified issue from news reports. Keywords retrieved from news reports are converted into structuralized network nodes and presented in the form of issue networks. The second topic is the clustering of network actors. We adopt an issue-association clustering method to provide views of clustering of issue participators based on relations of issues. In third topic, we use specified link analysis method to compute the importance of actors and sub-issues.
Our study concludes with performance evaluation via domain experts. We conduct recall, precision evaluation for first topic above and certainty, novelty, utility evaluation for others.
|
6 |
Using Hadoop to Cluster Data in Energy SystemHou, Jun 03 June 2015 (has links)
No description available.
|
7 |
Approximation to K-Means-Type ClusteringWei, Yu 05 1900 (has links)
<p> Clustering involves partitioning a given data set into several groups based on some similarity/dissimilarity measurements. Cluster analysis has been widely used in information retrieval, text and web mining, pattern recognition, image segmentation and software reverse engineering.</p> <p> K-means is the most intuitive and popular clustering algorithm and
the working horse for clustering. However, the classical K-means suffers from several flaws. First, the algorithm is very sensitive to the initialization method and can be easily trapped at a local minimum regarding to the measurement (the sum of squared errors) used in the model. On the other hand, it has been proved that finding a global minimal sum of the squared errors is NP-hard even when k = 2. In the present model for K-means clustering, all the variables are required to be discrete and the objective is nonlinear and nonconvex.</p> <p> In the first part of the thesis, we consider the issue of how to derive an optimization model to the minimum sum of squared errors for a given data set based on continuous convex optimization. For this, we first transfer the K-means clustering into a novel optimization model, 0-1 semidefinite programming where the eigenvalues of involved matrix argument must be 0 or 1. This provides an unified way for many other clustering approaches such as spectral clustering and normalized cut. Moreover, the new optimization model also allows us to attack the original problem based on the relaxed linear and semidefinite programming.</p> <p> Moreover, we consider the issue of how to get a feasible solution of the original clustering from an approximate solution of the relaxed problem. By using principal component analysis, we construct a rounding procedure to extract a feasible clustering and show that our algorithm can provide a 2-approximation to the global solution of the original problem. The complexity of our rounding procedure is O(n^(k2(k-1)/2)), which improves substantially a
similar rounding procedure in the literature with a complexity O(n^k3/2). In particular, when k = 2, our rounding procedure runs in O(n log n) time. To the best of our knowledge, this is the lowest complexity that has been reported in the literature to find a solution to K-means clustering with guaranteed quality.</p> <p> In the second part of the thesis, we consider approximation methods for the so-called balanced bi-clustering. By using a simple heuristics, we prove that we can improve slightly the constrained K-means for bi-clustering. For the special case where the size of each cluster is fixed, we develop a new algorithm, called Q means, to find a 2-approximation solution to the balanced bi-clustering. We prove that the Q-means has a complexity O(n^2).</p> <p> Numerical results based our approaches will be reported in the thesis as well.</p> / Thesis / Master of Science (MSc)
|
8 |
Quantitative and evolutionary global analysis of enzyme reaction mechanismsNath, Neetika January 2015 (has links)
The most widely used classification system describing enzyme-catalysed reactions is the Enzyme Commission (EC) number. Understanding enzyme function is important for both fundamental scientific and pharmaceutical reasons. The EC classification is essentially unrelated to the reaction mechanism. In this work we address two important questions related to enzyme function diversity. First, to investigate the relationship between the reaction mechanisms as described in the MACiE (Mechanism, Annotation, and Classification in Enzymes) database and the main top-level class of the EC classification. Second, how well these enzymes biocatalysis are adapted in nature. In this thesis, we have retrieved 335 enzyme reactions from the MACiE database. We consider two ways of encoding the reaction mechanism in descriptors, and three approaches that encode only the overall chemical reaction. To proceed through my work, we first develop a basic model to cluster the enzymatic reactions. Global study of enzyme reaction mechanism may provide important insights for better understanding of the diversity of chemical reactions of enzymes. Clustering analysis in such research is very common practice. Clustering algorithms suffer from various issues, such as requiring determination of the input parameters and stopping criteria, and very often a need to specify the number of clusters in advance. Using several well known metrics, we tried to optimize the clustering outputs for each of the algorithms, with equivocal results that suggested the existence of between two and over a hundred clusters. This motivated us to design and implement our algorithm, PFClust (Parameter-Free Clustering), where no prior information is required to determine the number of cluster. The analysis highlights the structure of the enzyme overall and mechanistic reaction. This suggests that mechanistic similarity can influence approaches for function prediction and automatic annotation of newly discovered protein and gene sequences. We then develop and evaluate the method for enzyme function prediction using machine learning methods. Our results suggest that pairs of similar enzyme reactions tend to proceed by different mechanisms. The machine learning method needs only chemoinformatics descriptors as an input and is applicable for regression analysis. The last phase of this work is to test the evolution of chemical mechanisms mapped onto ancestral enzymes. This domain occurrence and abundance in modern proteins has showed that the / architecture is probably the oldest fold design. These observations have important implications for the origins of biochemistry and for exploring structure-function relationships. Over half of the known mechanisms are introduced before architectural diversification over the evolutionary time. The other halves of the mechanisms are invented gradually over the evolutionary timeline just after organismal diversification. Moreover, many common mechanisms includes fundamental building blocks of enzyme chemistry were found to be associated with the ancestral fold.
|
9 |
Genetic diversity and structure of livestock breedsWilkinson, Samantha January 2012 (has links)
This thesis addresses the genetic characterisation of livestock breeds, a key aspect of the long-term future breed preservation and, thus, of primary interest for animal breeders and management in the industry. First, the genetic diversity and structure of breeds were investigated. The application of individual-based population genetic approaches at characterising genetic structure was assessed using the British pig breeds. All approaches, except for Principle Component Analysis (PCA), found that the breeds were distinct genetic populations. Bayesian genotypic clustering tools agreed that breeds had little individual genetic admixture. However, inconsistent results were observed between the Bayesian methods. Primarily, BAPS detected finer genetic differentiation than other approaches, producing biologically credible genetic populations. BAPS also detected substructure in the British Meishan, consistent with prior known population information. In contrast, STRUCTURE detected substructure in the British Saddleback breed that could not wholly be explained. Further analysis of the British Saddleback revealed that the genetic subdivision did not reflect its historical origin (union of Essex pig and Wessex Saddleback) but was associated with herds. The Rainbarrow appeared to be moderately differentiated from the other herds, and relatively lower allelic diversity and higher individual inbreeding, a possible result of certain breeding strategies. The genetic structure and diversity of the British traditional chicken breeds was also characterised. The breeds were found to be highly distinctive populations with moderately high levels of within-breed genetic diversity. However, majority of the breeds had an observed heterozygote deficit. Although individuals clustered to their origin for some of the breeds, genetic subdivision of individuals was observed in some breeds. For two breeds the inferred genetic subpopulations were associated with morphological varieties, but in others they were associated with flock supplier. As with the British Saddleback breed, gene flow between flocks within the chicken breeds should be enhanced to maintain current levels of genetic diversity. Second, the thesis focused on breed identification through the assignment of individuals to breed origin. Dense genome-wide assays provide an opportunity to develop tailor-made panels for food authentication, especially for verifying traditional breed-labelled products. In European cattle breeds, the prior selection of informative markers produced higher correct individual identification than panels of randomly selected markers. Selecting breed informative markers was more powerful using delta (allele frequency difference) and Wright's FST (allele frequency variation), than PCA. However, no further gain in power of assignment was achieved by sampling in excess of 200 markers. The power of assignment and number of markers required was dependent on the levels of breed genetic distinctiveness. Use of dense genome-wide assays and marker selection was further assessed in the British pig breeds. With delta, it was found that 96 informative SNP markers were sufficient for breed differentiation, with the exception of Landrace and Welsh pair. Assignment of individuals to breed origin was high and few individuals were falsely assigned, especially for the traditional breeds. The probability that a sample of a presumed origin actually originated from that breed was high in the traditional breeds. Validation of the 96-SNP panel using independent test samples of known origin and market samples revealed a high level of breed label conformity.
|
10 |
Clustering Algorithms for Time Series Gene Expression in Microarray DataZhang, Guilin 08 1900 (has links)
Clustering techniques are important for gene expression data analysis. However, efficient computational algorithms for clustering time-series data are still lacking. This work documents two improvements on an existing profile-based greedy algorithm for short time-series data; the first one is implementation of a scaling method on the pre-processing of the raw data to handle some extreme cases; the second improvement is modifying the strategy to generate better clusters. Simulation data and real microarray data were used to evaluate these improvements; this approach could efficiently generate more accurate clusters. A new feature-based algorithm was also developed in which steady state value; overshoot, rise time, settling time and peak time are generated by the 2nd order control system for the clustering purpose. This feature-based approach is much faster and more accurate than the existing profile-based algorithm for long time-series data.
|
Page generated in 0.114 seconds