Return to search

Hybrid Algorithms of Finding Features for Clustering Sequential Data

Proteins are
the structural components of living cells and tissues, and thus an
important building block in all living organisms. Patterns in
proteins sequences are some subsequences which appear frequently.
Patterns often denote important functional regions in proteins and
can be used to characterize a protein family or discover the
function of proteins. Moreover, it provides valuable information
about the evolution of species. Grouping protein sequences that
share similar structure helps in identifying sequences with similar
functionality. Many algorithms have been proposed for clustering
proteins according to their similarity, i.e., sequential
patterns in protein databases, for example, feature-based clustering
algorithms of the global approach and the local approach. They use
the algorithm of mining sequential patterns to solve the
no-gap-limit sequential pattern problem in a protein sequences
database, and then find global features and local features
separately for clustering. Feature-based clustering algorithms are
entirely different approaches to protein clustering that do not
require an all-against-all analysis and use a near-linear
complexity K-means based clustering algorithm. Although
feature-based clustering algorithms are scalable and lead to
reasonably good clusters, they consume time on performing the global
approach and the local approach separately. Therefore, in this
thesis, we propose hybrid algorithms to find and mark features for
feature-based clustering algorithms. We observe an interesting
result from the relation between the local features and the closed
frequent sequential patterns. The important observation which we
find is that some features in the closed frequent sequential
patterns can be taken apart to several features in the local
selected features and the total support number of these features in
the local selected features is equal to the support number of the
corresponding feature in the closed frequent sequential patterns.
There are two phases, find-feature and mark-feature, in the global
approach and the local approach after mining sequential patterns. In
our hybrid algorithms of Method 1 (LocalG), we first find and mark
the local features. Then, we find the global features. Finally, we
mark the bit vectors of the global features efficiently from the bit
vector of the local features. In our hybrid algorithms of Method 2
(CLoseLG), we first find the closed frequent sequential patterns
directly. Next, we find local candidate features efficiently from
the closed frequent sequential patterns and then mark the local
features. Finally, we find and mark the global features. From our
performance study based on the biological data and the synthetic
data, we show that our proposed hybrid algorithms are more efficient
than the feature-based algorithm.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0708110-144724
Date08 July 2010
CreatorsChang, Hsi-mei
ContributorsChien-I Lee, Gen-Huey Chen, Ye-In Chang, none
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0708110-144724
Rightswithheld, Copyright information available at source archive

Page generated in 0.0031 seconds