Spelling suggestions: "subject:"clustering algorithm"" "subject:"clustering allgorithm""
11 |
Synthesizing Regularity Exposing Attributes in Large Protein Databasesde la Maza, Michael 01 May 1993 (has links)
This thesis describes a system that synthesizes regularity exposing attributes from large protein databases. After processing primary and secondary structure data, this system discovers an amino acid representation that captures what are thought to be the three most important amino acid characteristics (size, charge, and hydrophobicity) for tertiary structure prediction. A neural network trained using this 16 bit representation achieves a performance accuracy on the secondary structure prediction problem that is comparable to the one achieved by a neural network trained using the standard 24 bit amino acid representation. In addition, the thesis describes bounds on secondary structure prediction accuracy, derived using an optimal learning algorithm and the probably approximately correct (PAC) model.
|
12 |
Structure Pattern Analysis Using Term Rewriting and Clustering AlgorithmFu, Xuezheng 27 June 2007 (has links)
Biological data is accumulated at a fast pace. However, raw data are generally difficult to understand and not useful unless we unlock the information hidden in the data. Knowledge/information can be extracted as the patterns or features buried within the data. Thus data mining, aims at uncovering underlying rules, relationships, and patterns in data, has emerged as one of the most exciting fields in computational science. In this dissertation, we develop efficient approaches to the structure pattern analysis of RNA and protein three dimensional structures. The major techniques used in this work include term rewriting and clustering algorithms. Firstly, a new approach is designed to study the interaction of RNA secondary structures motifs using the concept of term rewriting. Secondly, an improved K-means clustering algorithm is proposed to estimate the number of clusters in data. A new distance descriptor is introduced for the appropriate representation of three dimensional structure segments of RNA and protein three dimensional structures. The experimental results show the improvements in the determination of the number of clusters in data, evaluation of RNA structure similarity, RNA structure database search, and better understanding of the protein sequence-structure correspondence.
|
13 |
An Improved Density-Based Clustering Algorithm Using Gravity and Aging ApproachesAl-Azab, Fadwa Gamal Mohammed January 2015 (has links)
Density-based clustering is one of the well-known algorithms focusing on grouping samples according to their densities. In the existing density-based clustering algorithms, samples are clustered according to the total number of points within the radius of the defined dense region. This method of determining density, however, provides little knowledge about the similarities among points. Additionally, they are not flexible enough to deal with dynamic data that changes over time. The current study addresses these challenges by proposing a new approach that incorporates new measures to evaluate the attributes similarities while clustering incoming samples rather than considering only the total number of points within a radius. The new approach is developed based on the notion of Gravity where incoming samples are clustered according to the force of their neighbouring samples. The Mass (density) of a cluster is measured using various approaches including the number of neighbouring samples and Silhouette measure. Then, the neighbouring sample with the highest force is the one that pulls in the new incoming sample to be part of that cluster. Taking into account the attribute similarities of points provides more information by accurately defining the dense regions around the incoming samples. Also, it determines the best neighbourhood to which the new sample belongs. In addition, the proposed algorithm introduces a new approach to utilize the memory efficiently. It forms clusters with different shapes over time when dealing with dynamic data. This approach, called Aging, enables the proposed algorithm to utilize the memory efficiently by removing points that are aged if they do not participate in clustering incoming samples, and consequently, changing the shapes of the clusters incrementally.
Four experiments are conducted in this study to evaluate the performance of the proposed algorithm. The performance and effectiveness of the proposed algorithm are validated on a synthetic dataset (to visualize the changes of the clusters’ shapes over time), as well as real datasets. The experimental results confirm that the proposed algorithm is improved in terms of the performance measures including Dunn Index and SD Index. The experimental results also demonstrate that the proposed algorithm utilizes less memory, with the ability to form clusters with arbitrary shapes that are changeable over time.
|
14 |
Transforming GPS Points to Daily Activities Using Simultaneously Optimized DBSCAN-TE ParametersRiches, Gillian Michele 05 December 2022 (has links)
With the recent upsurge in mental health concerns and ongoing isolation regulations brought about by the COVID-19 pandemic, it is important to understand how an individual's daily travel behavior can affect their mental health. Before finding any correlations to mental health, researchers must first have individual travel behavior information: an accurate number of activities and locations of those activities. One way to obtain daily travel behavior information is through the interpretation of cellular Global Positioning System (GPS) data. Previous methods that interpret GPS data into travel behavior information have limitations. Specifically, rule-based algorithms are structured around subjective rule-based tests, clustering algorithms include only spatial parameters that are chosen sequentially or require further exploration, and imputation algorithms are sensitive to provided context (input parameters) and/or require lots of training data to validate the results of the algorithm. Due to the lack of provided training data that would be required for an imputation algorithm, this thesis uses a previously adopted clustering method. The objective of this thesis is to determine which spatial, entropy, and time parameters cause the clustering algorithm to give the most accurate travel behavior results. This optimal set of parameters was determined using a comparison of two non-linear optimization methods: simulated annealing and a limited-memory Broyden-Fletcher-Goldfarb-Shanno Bound (L-BFGS-B) optimizer. Ultimately, simulated annealing optimization found the best set of clustering parameters leading to 91% clustering algorithm accuracy whereas L-BFGS-B optimization found parameters that were only able to produce a maximum of 79% accuracy. Using the most optimal set of parameters in the clustering algorithm, an entire set of GPS data can be interpreted to determine an individual's daily travel behavior. This resulting individual travel behavior sets the groundwork to answer the question of how individual travel behavior can affect mental health.
|
15 |
Initialization of the k-means algorithm : A comparison of three methodsJorstedt, Simon January 2023 (has links)
k-means is a simple and flexible clustering algorithm that has remained in common use for 50+ years. In this thesis, we discuss the algorithm in general, its advantages, weaknesses and how its ability to locate clusters can be enhanced with a suitable initialization method. We formulate appropriate requirements for the (batched) UnifRandom, k-means++ and Kaufman initialization methods and compare their performance on real and generated data through simulations. We find that all three methods (followed by the k-means procedure) are able to accurately locate at least up to nine well-separated clusters, but the appropriately batched UnifRandom and the Kaufman methods are both significantly more computationally expensive than the k-means++ method already for K = 5 clusters in a dataset of N = 1000 points.
|
16 |
Exploring the Nature of Benefits and Costs of Open Innovation for Universities by Using a Stochastic Multi-Criteria Clustering Approach: The Case of University-Industry Research CollaborationZare, Javid January 2022 (has links)
Open innovation that Henry Chesbrough introduced in 2003 promotes the usage of the input of outsiders to strengthen internal innovation processes and the search for outside commercialization opportunities for what is developed internally. Open innovation has enabled both academics and practitioners to design innovation strategies based on the reality of our connected world.
Although the literature has identified and explored a variety of benefits and costs, to the best of our knowledge, no study has reviewed the benefits and costs of open innovation in terms of their importance for strategic performance. To conduct such a study, we need to take into account two main issues. First, the number of benefits and costs of open innovation are multifold; so, to have a comprehensive comparison, a large number of benefits and costs must be compared. Second, to have a fair comparison, benefits and costs must be compared in terms of different performance criteria, including financial and non-financial.
Concerning the issues above, we will face a complex process of exploring benefits and costs. In this regard, we use multiple criterion decision-making (MCDM) methods that have shown promising solutions to complex exploratory problems. In particular, we present how using a stochastic multi-criteria clustering algorithm that is one of the recently introduced MCDM methods can bring promising results when it comes to exploring the strategic importance of benefits and costs of open innovation.
Since there is no comprehensive understanding of the nature of the benefits and costs of open innovation, the proposed model aims to cluster them into hierarchical groups to help researchers identify the most crucial benefits and costs concerning different dimensions of performance. In addition, the model is able to deal with uncertainties related to technical parameters such as criteria weights and preference thresholds. We apply the model in the context of open innovation for universities concerning their research collaboration with industries. An online survey was conducted to collect experts' opinions on the open-innovation benefits and costs of university-industry research collaboration, given different performance dimensions.
The results obtained through the cluster analysis specify that university researchers collaborate with industry mainly because of knowledge-related and research-related reasons rather than economic reasons. This research also indicates that the most important benefits of university-industry research collaboration for universities are implementing the learnings, increased know-how, accessing specialized infrastructures, accessing a greater idea and knowledge base, sensing and seizing new technological trends, and keeping the employees engaged. In addition, the results show that the most important costs are the lack of necessary resources to monitor activities between university and industry, an increased resistance to change among employees, conflict of interest (different missions), an increased employees' tendency to avoid using the knowledge that they do not create themselves, paying time costs associated with bureaucracy rules, and loss of focus. The research's findings enable researchers to analyze open innovation's related issues for universities more effectively and define their research projects on these issues in line with the priorities of universities.
|
17 |
Clustering System and Clustering Support Vector Machine for Local Protein Structure PredictionZhong, Wei 02 August 2006 (has links)
Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied.
|
18 |
Discovery and Extraction of Protein Sequence Motif Information that Transcends Protein Family BoundariesChen, Bernard 17 July 2009 (has links)
Protein sequence motifs are gathering more and more attention in the field of sequence analysis. The recurring patterns have the potential to determine the conformation, function and activities of the proteins. In our work, we obtained protein sequence motifs which are universally conserved across protein family boundaries. Therefore, unlike most popular motif discovering algorithms, our input dataset is extremely large. As a result, an efficient technique is essential. We use two granular computing models, Fuzzy Improved K-means (FIK) and Fuzzy Greedy K-means (FGK), in order to efficiently generate protein motif information. After that, we develop an efficient Super Granular SVM Feature Elimination model to further extract the motif information. During the motifs searching process, setting up a fixed window size in advance may simplify the computational complexity and increase the efficiency. However, due to the fixed size, our model may deliver a number of similar motifs simply shifted by some bases or including mismatches. We develop a new strategy named Positional Association Super-Rule to confront the problem of motifs generated from a fixed window size. It is a combination approach of the super-rule analysis and a novel Positional Association Rule algorithm. We use the super-rule concept to construct a Super-Rule-Tree (SRT) by a modified HHK clustering, which requires no parameter setup to identify the similarities and dissimilarities between the motifs. The positional association rule is created and applied to search similar motifs that are shifted some residues. By analyzing the motifs results generated by our approaches, we realize that these motifs are not only significant in sequence area, but also in secondary structure similarity and biochemical properties.
|
19 |
A General Framework for Discovering Multiple Data GroupingsSweidan, Dirar January 2018 (has links)
Clustering helps users gain insights from their data by discovering hidden structures in an unsupervised way. Unlike classification tasks that are evaluated using well-defined target labels, clustering is an intrinsically subjective task as it depends on the interpretation, need and interest of users. In many real-world applications, multiple meaningful clusterings can be hidden in the data, and different users are interested in exploring different perspectives and use cases of this same data. Despite this, most existing clustering techniques only attempt to produce a single clustering of the data, which can be too strict. In this thesis, a general method is proposed to discover multiple alternative clusterings of the data, and let users select the clustering(s) they are most interested in. In order to cover a large set of possible clustering solutions, a diverse set of clusterings is first generated based on various projections of the data. Then, similar clusterings are found, filtered, and aggregated into one representative clustering, allowing the user to only explore a small set of non-redundant representative clusterings. We compare the proposed method against others and analyze its advantages and disadvantages, based on artificial and real-world datasets, as well as on images enabling a visual assessment of the meaningfulness of the discovered clustering solutions. On the other hand, extensive studies and analysis concerning a variety of techniques used in the method are made. Results show that the proposed method is able to discover multiple interesting and meaningful clustering solutions.
|
20 |
Approaches to Modularity in Product ArchitectureBörjesson, Fredrik January 2012 (has links)
Modular product architecture is characterized by the existence of standardized interfaces between the physical building blocks. A module is a collection of technical solutions that perform a function, with interfaces selected for company-specific strategic reasons. Approaches to modularity are the structured methods by which modular product architectures are derived. The approaches include Modular Function Deployment (MFD), Design Structure Matrix (DSM), Function Structure Heuristics and many other, including hybrids. The thesis includes a survey of relevant theory and a discussion of four challenges in product architecture research, detailed in the appended papers. One common experience from project work is structured methods such as DSM or MFD often do not yield fully conclusive results. This is usually because the algorithms used to generate modules do not have enough relevant data. Thus, we ask whether it is possible to introduce new data to make the output more conclusive. A case study is used to answer this question. The analysis indicates that with additional properties to capture product geometry, and flow of matter, energy, or information, the output is more conclusive. If product development projects even have an architecture definition phase, very little time is spent actually selecting the most suitable tool. Several academic models are available, but they use incompatible criteria, and do not capture experience-based or subjective criteria we may wish to include. The research question is whether we can define selection criteria objectively using academic models and experience-based criteria. The author gathers criteria from three academic models, adds experience criteria, performs a pairwise comparison of all available criteria and applies a hierarchical cluster analysis, with subsequent interpretation. The resulting evaluation model is tested on five approaches to modularity. Several conclusions are discussed. One is that of the five approaches studied, MFD and DSM have the most complementary sets of strengths and weaknesses, and that hybrids between these two fundamental approaches would be particularly interesting. The majority of all product development tries to improve existing products. A common criticism against all structured approaches to modularity is they work best for existing products. Is this perhaps a misconception? We ask whether MFD and DSM can be used on novel product types at an early phase of product development. MFD and DSM are applied to the hybrid drive train of a Forwarder. The output of the selected approaches is compared and reconciled, indicating that conclusions about a suitable modular architecture can be derived, even when many technical solutions are unknown. Among several conclusions, one is the electronic inverter must support several operating modes that depend on high-level properties of the drive train itself (such as whether regeneration is used). A modular structure for the electronic inverter is proposed. Module generation in MFD is usually done with Hierarchical Cluster Analysis (HCA), where the results are presented in the form of a Dendrogram. Statistical software can generate a Dendrogram in a matter of seconds. For DSM, the situation is different. Most available algorithms require a fair amount of processing time. One popular algorithm, the Idicula-Gutierrez-Thebeau Algorithm (IGTA), requires a total time of a few hours for a problem of medium complexity (about 60 components). The research question is whether IGTA can be improved to execute faster, while maintaining or improving quality of output. Two algorithmic changes together reduce execution time required by a factor of seven to eight in the trials, and improve quality of output by about 15 percent. / QC 20120605
|
Page generated in 0.0811 seconds