With the rapid advance of information technologies, human beings increasingly rely on computers to accumulate, process, and make use of data. Knowledge discovery techniques have been proposed to automatically search large volumes of data for patterns. Knowledge discovery often requires a set of relevant features to represent the specific domain. My dissertation presents a framework of feature engineering for knowledge discovery, including feature construction, feature selection, and feature consolidation.Five essays in my dissertation present novel approaches to construct, select, or consolidate features in various applications. Feature construction is used to derive new features when relevant features are unknown. Chapter 2 focuses on constructing informative features from a relational database. I introduce a probabilistic relational model-based approach to construct personal and social features for identity matching. Experiments on a criminal dataset showed that social features can improve the matching performance. Chapter 3 focuses on identifying good features for knowledge discovery from text. Four types of writeprint features are constructed and shown effective for authorship analysis of online messages. Feature selection is aimed at identifying a subset of significant features from a high dimensional feature space. Chapter 4 presents a framework of feature selection techniques. This essay focuses on identifying marker genes for microarray-based cancer classification. Our experiments on gene array datasets showed excellent performance for optimal search-based gene subset selection. Feature consolidation is aimed at integrating features from diverse data sources or in heterogeneous representations. Chapter 5 presents a Bayesian framework to integrate gene functional relations extracted from heterogeneous data sources such as gene expression profiles, biological literature, and genome sequences. Chapter 6 focuses on kernel-based methods to capture and consolidate information in heterogeneous data representations. I design and compare different kernels for relation extraction from biomedical literature. Experiments show good performances of tree kernels and composite kernels for biomedical relation extraction.These five essays together compose a framework of feature engineering and present different techniques to construct, select, and consolidate relevant features. This feature engineering framework contributes to the domain of information systems by improving the effectiveness, efficiency, and interpretability of knowledge discovery.
Identifer | oai:union.ndltd.org:arizona.edu/oai:arizona.openrepository.com:10150/193819 |
Date | January 2007 |
Creators | Li, Jiexun |
Contributors | Chen, Hsinchun, Chen, Hsinchun, Nunamaker, Jr., Jay F., Zhang, Zhu |
Publisher | The University of Arizona. |
Source Sets | University of Arizona |
Language | English |
Detected Language | English |
Type | text, Electronic Dissertation |
Rights | Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. |
Page generated in 0.0016 seconds