Global ETD Search

41	Query Expansion For Handling Exploratory And Ambiguous Keyword Queries January 2011 (has links) abstract: Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical options, in order to narrow down the search and reach the desired result. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. These empirical methods fail to cover all semantics of categories present in the query results. More importantly these methods do not consider the semantic relationship between the keywords featured in an expanded query. Contrary to a normal keyword search setting, these factors are non-trivial in an exploratory and ambiguous query setting where the user's precise discernment of different categories present in the query results is more important for making subsequent search decisions. In this thesis, I propose a new framework for keyword query expansion: generating a set of queries that correspond to the categorization of original query results, which is referred as Categorizing query expansion. Two approaches of algorithms are proposed, one that performs clustering as pre-processing step and then generates categorizing expanded queries based on the clusters. The other category of algorithms handle the case of generating quality expanded queries in the presence of imperfect clusters. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science clustering cluster labeling feature selection Keyword search
42	Species Discrimination and Monitoring of Abiotic Stress Tolerance by Chlorophyll Fluorescence Transients MISHRA, Anamika January 2012 (has links) Chlorophyll fluorescence imaging has now become a versatile and standard tool in fundamental and applied plant research. This method captures time series images of the chlorophyll fluorescence emission of whole leaves or plants upon various illuminations, typically combination of actinic light and saturating flashes. Several conventional chlorophyll fluorescence parameters have been recognized that have physiological interpretation and are useful for, e.g., assessment of plant health status and early detection of biotic and abiotic stresses. Chlorophyll florescence imaging enabled us to probe the performance of plants by visualizing physiologically relevant fluorescence parameters reporting physiology and biochemistry of the plant leaves. Sometimes there is a need to find the most contrasting fluorescence features/parameters in order to quantify stress response at very early stage of the stress treatment. The conventional fluorescence utilizes well defined single image such as F0, Fp, Fm, Fs or arithmetic combinations of basic images such as Fv/Fm, PSII, NPQ, qP. Therefore, although conventional fluorescence parameters have physiological interpretation, they may not be representing highly contrasting image sets. In order to find the effect of stress treatments at very early stage, advanced statistical techniques, based on classifiers and feature selection methods, have been developed to select highly contrasting chlorophyll fluorescence images out of hundreds of captured images. We combined sets of highly performing images resulting in images with very high contrast, the so called combinatorial imaging. The application of advanced statistical methods on chlorophyll fluorescence imaging data allows us to succeed in tasks, where conventional approaches do not work. This thesis aims to explore the application of conventional chlorophyll fluorescence parameters as well as advanced statistical techniques of classifiers and feature selection methods for high-throughput screening. We demonstrate the applicability of the technique in discriminating three species of the same family Lamiaceae at a very early stage of their growth. Further, we show that chlorophyll fluorescence imaging can be used for measuring cold and drought tolerance of Arabidopsis thaliana and tomato plants, respectively, in a simulated high ? throughput screening.
43	A credit scoring model based on classifiers consensus system approach Ala'raj, Maher A. January 2016 (has links) Managing customer credit is an important issue for each commercial bank; therefore, banks take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. The manual estimation of customer creditworthiness has become both time- and resource-consuming. Moreover, a manual approach is subjective (dependable on the bank employee who gives this estimation), which is why devising and implementing programming models that provide loan estimations is the only way of eradicating the ‘human factor’ in this problem. This model should give recommendations to the bank in terms of whether or not a loan should be given, or otherwise can give a probability in relation to whether the loan will be returned. Nowadays, a number of models have been designed, but there is no ideal classifier amongst these models since each gives some percentage of incorrect outputs; this is a critical consideration when each percent of incorrect answer can mean millions of dollars of losses for large banks. However, the LR remains the industry standard tool for credit-scoring models development. For this purpose, an investigation is carried out on the combination of the most efficient classifiers in credit-scoring scope in an attempt to produce a classifier that exceeds each of its classifiers or components. In this work, a fusion model referred to as ‘the Classifiers Consensus Approach’ is developed, which gives a lot better performance than each of single classifiers that constitute it. The difference of the consensus approach and the majority of other combiners lie in the fact that the consensus approach adopts the model of real expert group behaviour during the process of finding the consensus (aggregate) answer. The consensus model is compared not only with single classifiers, but also with traditional combiners and a quite complex combiner model known as the ‘Dynamic Ensemble Selection’ approach. As a pre-processing technique, step data-filtering (select training entries which fits input data well and remove outliers and noisy data) and feature selection (remove useless and statistically insignificant features which values are low correlated with real quality of loan) are used. These techniques are valuable in significantly improving the consensus approach results. Results clearly show that the consensus approach is statistically better (with 95% confidence value, according to Friedman test) than any other single classifier or combiner analysed; this means that for similar datasets, there is a 95% guarantee that the consensus approach will outperform all other classifiers. The consensus approach gives not only the best accuracy, but also better AUC value, Brier score and H-measure for almost all datasets investigated in this thesis. Moreover, it outperformed Logistic Regression. Thus, it has been proven that the use of the consensus approach for credit-scoring is justified and recommended in commercial banks. Along with the consensus approach, the dynamic ensemble selection approach is analysed, the results of which show that, under some conditions, the dynamic ensemble selection approach can rival the consensus approach. The good sides of dynamic ensemble selection approach include its stability and high accuracy on various datasets. The consensus approach, which is improved in this work, may be considered in banks that hold the same characteristics of the datasets used in this work, where utilisation could decrease the level of mistakenly rejected loans of solvent customers, and the level of mistakenly accepted loans that are never to be returned. Furthermore, the consensus approach is a notable step in the direction of building a universal classifier that can fit data with any structure. Another advantage of the consensus approach is its flexibility; therefore, even if the input data is changed due to various reasons, the consensus approach can be easily re-trained and used with the same performance. 005.74
44	Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest January 2017 (has links) abstract: Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval. Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships. Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets. Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets. / Dissertation/Thesis / Doctoral Dissertation Biomedical Informatics 2017 Biostatistics feature selection prediction interval predictive modeling random forest
45	On Feature Selection Stability: A Data Perspective January 2013 (has links) abstract: The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation. / Dissertation/Thesis / Ph.D. Computer Science 2013 Computer science classification clustering feature selection stability supervised unsupervised
46	Enhanced Contour Description for People Detection in Images Du, Xiaoyun January 2014 (has links) People detection has been an attractive technology in computer vision. There are many useful applications in our daily life, for instance, intelligent surveillance and driver assistance system. People detection is a challenging matter as people adopt a wide range of poses, wear diverse clothes, and are visible in different kind of backgrounds with significant changes in illumination. In this thesis, some advanced techniques and powerful tools are presented in order to design a robust people detection system. First a baseline model is implemented by combining the Histogram of Oriented Gradients descriptor and linear Support Vector Machines. This baseline model obtains a good performance on the well-known INRIA dataset. Second an advanced model is proposed which has a two-layer cascade framework that achieves both accurate detection and lower computational complexity. For the first layer, the baseline model is used as a filter to generate several candidates. In this procedure, most positive samples survived and the majority of negative samples are rejected according to a preset threshold. The second layer uses a more discriminative model. We combine the Variational Local Binary Patterns descriptor, and the Histogram of Oriented Gradients descriptor as a new discriminative feature. Furthermore multi-scale feature descriptors are used to improve the discriminative power of the Variational Local Binary Patterns feature. Then we perform Feature Selection using the Feature Generating Machine in order to generate a concise descriptor based on this concatenated feature. Moreover Histogram Intersection Kernel Support Vector Machines is employed as an efficient tool of classification. The bootstrapping algorithm is used in the training procedure to exploit the information of the dataset. Finally our approach has a good performance on the INRIA dataset, with results superior to the baseline model. People Detection Variational LBP Feature selection Intersection kernel Bootstrapping
47	Embedded Feature Selection for Model-based Clustering January 2020 (has links) abstract: Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic. The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset. The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features. The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison. / Dissertation/Thesis / Doctoral Dissertation Industrial Engineering 2020 Industrial engineering Statistics clustering feature selection Gaussian Mixture Model
48	A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data Abusamra, Heba 05 1900 (has links) Microarray technology has enriched the study of gene expression in such a way that scientists are now able to measure the expression levels of thousands of genes in a single experiment. Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification, interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies. This thesis aims on a comparative study of state-of-the-art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k- nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t- statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used for this study. Different experiments have been applied to compare the performance of the classification methods with and without performing feature selection. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated. gene expression classification methods comparison study feature selection
49	Importance-Aware Information Networking toward Smart Cities / スマートシティに向けた重要度を考慮した情報ネットワーキング Inagaki, Yuichi 24 September 2021 (has links) 京都大学 / 新制・課程博士 / 博士(情報学) / 甲第23547号 / 情博第777号 / 新制\|\|情\|\|132(附属図書館) / 京都大学大学院情報学研究科通信情報システム専攻 / (主査)教授大木英司, 教授原田博司, 教授黒橋禎夫 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DGAM Smart City IoT Networking Machine learning Feature selection 007
50	Self-Learning Prediciton System for Optimisation of Workload Managememt in a Mainframe Operating System Bensch, Michael, Brugger, Dominik, Rosenstiel, Wolfgang, Bogdan, Martin, Spruth, Wilhelm 06 November 2018 (has links) We present a framework for extraction and prediction of online workload data from a workload manager of a mainframe operating system. To boost overall system performance, the prediction will be corporated into the workload manager to take preventive action before a bottleneck develops. Model and feature selection automatically create a prediction model based on given training data, thereby keeping the system flexible. We tailor data extraction, preprocessing and training to this specific task, keeping in mind the nonstationarity of business processes. Using error measures suited to our task, we show that our approach is promising. To conclude, we discuss our first results and give an outlook on future work.

Search results