Return to search

Efficient case-based reasoning through feature weighting, and its application in protein crystallography

Data preprocessing is critical for machine learning, data mining, and pattern
recognition. In particular, selecting relevant and non-redundant features in highdimensional
data is important to efficiently construct models that accurately describe the
data. In this work, I present SLIDER, an algorithm that weights features to reflect
relevance in determining similarity between instances. Accurate weighting of features
improves the similarity measure, which is useful in learning algorithms like nearest
neighbor and case-based reasoning. SLIDER performs a greedy search for optimum
weights in an exponentially large space of weight vectors. Exhaustive search being
intractable, the algorithm reduces the search space by focusing on pivotal weights at
which representative instances are equidistant to truly similar and different instances in
Euclidean space. SLIDER then evaluates those weights heuristically, based on
effectiveness in properly ranking pre-determined matches of a set of cases, relative to
mismatches.
I analytically show that by choosing feature weights that minimize the mean rank of
matches relative to mismatches, the separation between the distributions of Euclidean
distances for matches and mismatches is increased. This leads to a better distance metric,
and consequently increases the probability of retrieving true matches from a database. I
also discuss how SLIDER is used to improve the efficiency and effectiveness of case
retrieval in a case-based reasoning system that automatically interprets electron density
maps to determine the three-dimensional structures of proteins. Electron density patterns
for regions in a protein are represented by numerical features, which are used in a distance metric to efficiently retrieve matching patterns by searching a large database.
These pre-selected cases are then evaluated by more expensive methods to identify truly
good matches – this strategy speeds up the retrieval of matching density regions, thereby
enabling fast and accurate protein model-building. This two-phase case retrieval
approach is potentially useful in many case-based reasoning systems, especially those
with computationally expensive case matching and large case libraries.

Identiferoai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/ETD-TAMU-1906
Date02 June 2009
CreatorsGopal, Kreshna
ContributorsIoerger, Thomas R.
Source SetsTexas A and M University
Languageen_US
Detected LanguageEnglish
TypeBook, Thesis, Electronic Dissertation, text
Formatelectronic, application/pdf, born digital

Page generated in 0.0021 seconds