Global ETD Search

11	Coevolution Based Prediction Of Protein-protein Interactions With Reduced Training Data Pamuk, Bahar 01 February 2009 (has links) (PDF) Protein-protein interactions are important for the prediction of protein functions since two interacting proteins usually have similar functions in a cell. Available protein interaction networks are incomplete / but, they can be used to predict new interactions in a supervised learning framework. However, in the case that the known protein network includes large number of protein pairs, the training time of the machine learning algorithm becomes quite long. In this thesis work, our aim is to predict protein-protein interactions with a known portion of the interaction network. We used Support Vector Machines (SVM) as the machine learning algoritm and used the already known protein pairs in the network. We chose to use phylogenetic profiles of proteins to form the feature vectors required for the learner since the similarity of two proteins in evolution gives a reasonable rating about whether the two proteins interact or not. For large data sets, the training time of SVM becomes quite long, therefore we reduced the data size in a sensible way while we keep approximately the same prediction accuracy. We applied a number of clustering techniques to extract the most representative data and features in a two categorical framework. Knowing that the training data set is a two dimensional matrix, we applied data reduction methods in both dimensions, i.e., both in data size and in feature vector size. We observed that the data clustered by the k-means clustering technique gave superior results in prediction accuracies compared to another data clustering algorithm which was also developed for reducing data size for SVM training. Still the true positive and false positive rates (TPR-FPR) of the training data sets constructed by the two clustering methods did not give satisfying results about which method outperforms the other. On the other hand, we applied feature selection methods on the feature vectors of training data by selecting the most representative features in biological and in statistical meaning. We used phylogenetic tree of organisms to identify the organisms which are evolutionarily significant. Additionally we applied Fisher&sbquo / &Auml / &ocirc / s test method to select the features which are most representative statistically. The accuracy and TPR-FPR values obtained by feature selection methods could not provide to make a certain decision on the performance comparisons. However it can be mentioned that phylogenetic tree method resulted in acceptable prediction values when compared to Fisher&sbquo / &Auml / &ocirc / s test. QA Computer Software 76.75-76.765
12	Optimal Location for a Mobile Base Station in a Complex Network Moazzami, Farzad, Dean, Richard, Astatke, Yacob 10 1900 (has links) ITC/USA 2013 Conference Proceedings / The Forty-Ninth Annual International Telemetering Conference and Technical Exhibition / October 21-24, 2013 / Bally's Hotel & Convention Center, Las Vegas, NV / The focus of this work is the development of a complete network architecture to enhance telemetry performance using a mobile base station (MBS). The present study proposes a means of enabling both the mobile ad-hoc network (MANET) and a cellular network to operate simultaneously within the same spectrum. In this paper the application of a modified k-means clustering to organize several hundred TAs in a complex network environment is presented. A mobile base station is added to the network to locate the congested area and support the network but positioning itself in the mixed network environment. A scenario with two base stations (one mobile and one stationary) is simulated and results are presented. It is observed that use of an additional mobile base station could greatly increase the quality of communication by providing uniform distribution of node traffic and interference across the clusters in a complex telemetry environment with several hundred TAs. Ad-hoc networks K-means clustering Mixed Networks Spectrum Efficiency QoS
13	Telemetry Network Intrusion Detection System Maharjan, Nadim, Moazzemi, Paria 10 1900 (has links) ITC/USA 2012 Conference Proceedings / The Forty-Eighth Annual International Telemetering Conference and Technical Exhibition / October 22-25, 2012 / Town and Country Resort & Convention Center, San Diego, California / Telemetry systems are migrating from links to networks. Security solutions that simply encrypt radio links no longer protect the network of Test Articles or the networks that support them. The use of network telemetry is dramatically expanding and new risks and vulnerabilities are challenging issues for telemetry networks. Most of these vulnerabilities are silent in nature and cannot be detected with simple tools such as traffic monitoring. The Intrusion Detection System (IDS) is a security mechanism suited to telemetry networks that can help detect abnormal behavior in the network. Our previous research in Network Intrusion Detection Systems focused on "Password" attacks and "Syn" attacks. This paper presents a generalized method that can detect both "Password" attack and "Syn" attack. In this paper, a K-means Clustering algorithm is used for vector quantization of network traffic. This reduces the scope of the problem by reducing the entropy of the network data. In addition, a Hidden-Markov Model (HMM) is then employed to help to further characterize and analyze the behavior of the network into states that can be labeled as normal, attack, or anomaly. Our experiments show that IDS can discover and expose telemetry network vulnerabilities using Vector Quantization and the Hidden Markov Model providing a more secure telemetry environment. Our paper shows how these can be generalized into a Network Intrusion system that can be deployed on telemetry networks. Intrusion Detection System Vector Quantization K-means Clustering Hidden Markov Model Security iNET
14	Stability Selection of the Number of Clusters Reizer, Gabriella v 18 April 2011 (has links) Selecting the number of clusters is one of the greatest challenges in clustering analysis. In this thesis, we propose a variety of stability selection criteria based on cross validation for determining the number of clusters. Clustering stability measures the agreement of clusterings obtained by applying the same clustering algorithm on multiple independent and identically distributed samples. We propose to measure the clustering stability by the correlation between two clustering functions. These criteria are motivated by the concept of clustering instability proposed by Wang (2010), which is based on a form of clustering distance. In addition, the effectiveness and robustness of the proposed methods are numerically demonstrated on a variety of simulated and real world samples. Consistency Cross validation Hierarchical clustering Instability k-means clustering Spectral clustering Stability Mathematics
15	Alternativní způsob měření rozvoje zemí. / Alternative approach to measuring development progress of countries. Efimenko, Valeria January 2018 (has links) This thesis studies the relationship between GDP and Social Progress Index, components of social progress model and their dimensions. Using the dataset of 49 countries and Bayesian Model Averaging (BMA) and clustering analysis we found that there is not straight relationship between GDP and SPI. By testing 15 different models for each of 3 dimension (Basic Human Needs, Foundations of Wellbeing and Opportunity) of SPI we have found that the best variation of components would be to include all of them for each dimension. By using BMA approach we have found that the best model of SPI out of 12 components includes only intercept, tolerance and inclusion variables. The rest of components show quite low probability of inclusion, however, none of them showed 0 posterior probability. JEL Classification A13, C11, E01, I30, Keywords Kuznets, progress, SPI, GDP, BMA Author's e-mail valeria.e.efimenko@gmail.com Supervisor's e-mail daniel.vach@gmail.com
16	Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data January 2020 (has links) abstract: In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed. Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion. Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified. / Dissertation/Thesis / Doctoral Dissertation Statistics 2020 Statistics Design augmentation k-means clustering Latin hypercube designs Model discrimination Model robustness Supersaturated designs
17	Abstractive Representation Modeling for Image Classification Li, Xin 05 October 2021 (has links) No description available. Artificial Intelligence Image Classification Explainability Convolutional Neural Network Abstraction K-Means Clustering
18	Data-driven persona development for a knowledge management system Baldi, Annika January 2021 (has links) Generating personas based entirely on data has gained popularity. Personas describe characteristics of a user group in a human-like format. This project presents the persona creation process from raw data to evaluated personas for Zapiens’ knowledge management system. The objective of the personas is to learn about customer behavior and aid in customer communication. For the described methodology, platform log data was clustered to group the users. The quantitative approach is, thereby, fast, updatable, and scalable. The analysis was split into two different features of the Zapiens platform. Persona sets for the training component and the chatbot component of Zapiens were tried to be created. The group characteristics were then enhanced with data from user surveys. This approach proved to be only successful for the training analysis. The collected data is presented in a web-based persona template to make the personas easily accessible and sharable. The finished training persona set was evaluated using the Persona Perception Scale. The results showed three personas of satisfying quality. The project aims to provide a complete overview of the data-driven persona development process. Personas data-driven persona development k-means clustering Persona Perception Scale Media and Communication Technology Medieteknik
19	A Concave Pairwise Fusion Approach to Clustering of Multi-Response Regression and Its Robust Extensions Chen, Chen, 0000-0003-1175-3027 January 2022 (has links) Solution-path convex clustering is combined with concave penalties by Ma and Huang (2017) to reduce clustering bias. Their method was introduced in the setting of single-response regression to handle heterogeneity. Such heterogeneity may come from either the regression intercepts or the regression slopes. The procedure, realized by the alternating direction method of multipliers (ADMM) algorithm, can simultaneously identify the grouping structure of observations and estimate regression coefficients. In the first part of our work, we extend this procedure to multi-response regression. We propose models to solve cases with heterogeneity in either the regression intercepts or the regression slopes. We combine the existing gadgets of the ADMM algorithm and group-wise concave penalties to find solutions for the model. Our work improves model performance in both clustering accuracy and estimation accuracy. We also demonstrate the necessity of such extension through the fact that by utilizing information in multi-dimensional space, the performance can be greatly improved. In the second part, we introduce robust solutions to our proposed work. We introduce two approaches to handle outliers or long-tail distributions. The first is to replace the squared loss with robust loss, among which are absolute loss and Huber loss. The second is to characterize and remove outliers' effects by a mean-shift vector. We demonstrate that these robust solutions outperform the squared loss based method when outliers are present, or the underlying distribution is long-tailed. / Statistics Statistics K-means clustering Optimization Penalized estimation Robust solution Subgroup detection
20	Quantifying Trust in Deep Learning Ultrasound Models by Investigating Hardware and Operator Variance Zhu, Calvin January 2021 (has links) Ultrasound (US) is the most widely used medical imaging modality due to its low cost, portability, real time imaging ability and use of non-ionizing radiation. However, unlike other imaging modalities such as CT or MRI, it is a heavily operator dependent, requiring trained expertise to leverage these benefits. Recently there has been an explosion of interest in artificial intelligence (AI) across the medical community and many are turning to the growing trend of deep learning (DL) models to assist in diagnosis. However, deep learning models do not perform as well when training data is not fully representative of the problem. Due to this difference in training and deployment, model performance suffers which can lead to misdiagnosis. This issue is known as dataset shift. Two aims to address dataset shift were proposed. The first was to quantify how US operator skill and hardware affects acquired images. The second was to use this skill quantification method to screen and match data to deep learning models to improve performance. A BLUE phantom from CAE Healthcare (Sarasota, FL) with various mock lesions was scanned by three operators using three different US systems (Siemens S3000, Clarius L15, and Ultrasonix SonixTouch) producing 39013 images. DL models were trained on a specific set to classify the presence of a simulated tumour and tested with data from differing sets. The Xception, VGG19, and ResNet50 architectures were used to test the effects with varying frameworks. K-Means clustering was used to separate images generated by operator and hardware into clusters. This clustering algorithm was then used to screen incoming images during deployment to best match input to an appropriate DL model which is trained specifically to classify that type of operator or hardware. Results showed a noticeable difference when models were given data from differing datasets with the largest accuracy drop being 81.26% to 31.26%. Overall, operator differences more significantly affected DL model performance. Clustering models had much higher success separating hardware data compared to operator data. The proposed method reflects this result with a much higher accuracy across the hardware test set compared to the operator data. / Thesis / Master of Applied Science (MASc)

Search results