Global ETD Search

1	Simultaneous prediction of symptom severity and cause in data from a test battery for Parkinson patients, using machine learning methods Khan, Imran Qayyum January 2009 (has links) The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes. Naïve Bayes CART KNN
2	Protein Contact Prediction Based on Protein Sequences Lin, Dong-Jian 06 September 2011 (has links) The biological function of a protein is mainly maintained by its three-dimensional structure. Protein folds support the three-dimensional structure of a protein, and then the inter-residue contacts in the protein impact the formation of protein folds and the stability of its protein structure. Therefore, the protein contact plays a critical role in building protein structures and analyzing biological functions. In this thesis, we propose a methodology to predict the residue-residue contacts of a target protein and develop a new measurement to evaluate the accuracy of prediction. With three prediction tools, the support vector machine (SVM), the k-nearest neighbor algorithm (KNN), and the penalized discriminant analysis (PDA), we compare these classiﬁers based on the self-testing of the training set, which are derived from representative protein chains from PDB (PDB-REPRDB), and apply the best (SVM) to predict a testing set of 173 protein chains derived from previous study. The experimental results show that the accuracy of our prediction achieves 24.84%,15.68%, and 8.23% for three categories of diﬀerent contacts, which greatly improves the result of random exploration (5.31%, 3.33%, and 1.12%, respectively). prediction Contact SVM KNN PDA
3	High Performance Lead--free Piezoelectric Materials Gupta, Shashaank 10 June 2013 (has links) Piezoelectric materials find applications in number of devices requiring inter-conversion of mechanical and electrical energy. These devices include different types of sensors, actuators and energy harvesting devices. A number of lead-based perovskite compositions (PZT, PMN-PT, PZN-PT etc.) have dominated the field in last few decades owing to their giant piezoresponse and convenient application relevant tunability. With increasing environmental concerns, in the last one decade, focus has been shifted towards developing a better understanding of lead-free piezoelectric compositions in order to achieve an improved application relevant performance. Sodium potassium niobate (KxNa1-xNbO3, abbreviated as KNN) is one of the most interesting candidates in the class of lead-free piezoelectrics. Absence of any poisonous element makes it unique among all the other lead-free candidates having presence of bismuth. Curie temperature of 400"C, even higher than that of PZT is another advantage from the point of view of device applications. Present work focuses on the development of fundamental understanding of the crystallographic nature, domain structure and domain dynamics of KNN. Since compositions close to x = 0.5 are of primary interest because of their superior piezoelectric activity among other compositions (0 < x < 1), crystallographic and domain structure studies are focused on this region of the phase diagram. KNN random ceramic, textured ceramic and single crystals were synthesized, which in complement to each other help in understanding the behavior of KNN. K0.5Na0.5NbO3 single crystals grown by the flux method were characterized for their ferroelectric and piezoelectric behavior and dynamical scaling analysis was performed to reveal the origin of their moderate piezoelectric performance. Optical birefringence technique used to reveal the macro level crystallographic nature of x = 0.4, 0.5 and 0.6 crystals suggested them to have monoclinic Mc, monoclinic MA/B and orthorhombic structures respectively. Contrary to that, pair distribution function analysis performed on same composition crystals implies them to belonging to monoclinic Mc structure at local scale. Linear birefringence and piezoresponse force microscopy (PFM) were used to reveal the domain structure at macro and micros scales respectively. A noble sintering technique was developed to achieve > 99% density for KNN ceramics. These high density ceramics were characterized for their dielectric, ferroelectric and piezoelectric properties. A significant improvement in different piezoelectric coefficients of these ceramics validates the advantages of this sintering technique. Also lower defect levels in these high density ceramics lead to the superior ferroelectric fatigue behavior as well. To understand the role of seed crystals in switching behavior of textured ceramic, highly textured KNN ceramics (Lotgering factor ~ 88 %) were synthesized using TGG method. A sintering technique similar to one employed for random ceramics, was used to sinter textured KNN ceramics as well. Piezoresponse force microscopy (PFM) study suggested these textured ceramics to have about 6¼m domains as compared to 2¼m domain size for random ceramics. Local switching behavior studied using switching spectroscopy (SS-PFM) revealed about two and half time improvement of local piezoresponse as compared to random counterpart. / Ph. D. KNN Ferroelectric Piezoelectric Lead free
4	Statistical Methods for High Throughput Screening Drug Discovery Data Wang, Yuanyuan (Marcia) January 2005 (has links) High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class. <br /><br /> Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques. <br /><br /> In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of <em>k</em> (the number of nearest neighbours). A more local model (bigger tree or smaller <em>k</em>) gives a better performance in terms of drug discovery. <br /><br /> Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of <em>k</em> is optimized for each test point to be predicted. The empirically observed superiority of allowing <em>k</em> to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the <em>k</em>-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method. <br /><br /> High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality. <br /><br /> In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The <em>k</em>-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best. <br /><br /> Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data. Statistics SAR HTS averaging models CART KNN
5	Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining Kuhlman, Caitlin Anne 20 January 2017 (has links) This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality. distributed computing kNN Search data Mining
6	Visualization of Regional Liver Function with Hepatobiliary Contrast Agent Gd-EOB-DTPA Samuelsson, Johanna January 2011 (has links) Liver biopsy is a very common, but invasive procedure for diagnosing liver disease. However, such a biopsy may result in severe complications and in some cases even death. Therefore, it would be highly desirable to develop a non-invasive method which would provide the same amount of information on staging of the disease and also the location of pathologies. This thesis describes the implementation of such a non-invasive method for visualizing and quantifying liver function by the combination of MRI (Magnetic Resonance Imaging), image reconstruction, and image analysis, and pharmacokinetic modeling. The first attempt involved automatic segmentation, functional clustering (k-means) and classification (kNN) of in-data (liver, spleen and blood vessel segments) in the pharmacokinetic model. However, after implementing and analyzing this method some important issues were identified and the image segmentation method was therefore revised. The segmentation method that was subsequently developed involved a semi-automatic procedure, based on a modified image forest transform (IFT). The data were then simulated and optimized using a pharmacokinetic model describing the pharmacokinetics of the liver specific contrast agent Gd-EOB-DTPA in the human body. The output from the modeling procedure was then further analyzed, using a least-squares method, in order to assess liver function by estimating the fractions of hepatocytes, extracellular extravascular space (EES) and blood plasma in each voxel of the image. The result were in fair agreement with literature values, although further analyses and developments will be required in order to validate and also to confirm the accuracy of the method. k-means kNN IFT model pharmacokinetics
7	Statistical Methods for High Throughput Screening Drug Discovery Data Wang, Yuanyuan (Marcia) January 2005 (has links) High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class. <br /><br /> Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques. <br /><br /> In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of <em>k</em> (the number of nearest neighbours). A more local model (bigger tree or smaller <em>k</em>) gives a better performance in terms of drug discovery. <br /><br /> Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of <em>k</em> is optimized for each test point to be predicted. The empirically observed superiority of allowing <em>k</em> to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the <em>k</em>-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method. <br /><br /> High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality. <br /><br /> In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The <em>k</em>-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best. <br /><br /> Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data. Statistics SAR HTS averaging models CART KNN
8	DNIDS: A dependable network intrusion detection system using the CSI-KNN algorithm Kuang, Liwei 14 September 2007 (has links) The dependability of an Intrusion Detection System (IDS) relies on two factors: ability to detect intrusions and survivability in hostile environments. Machine learning-based anomaly detection approaches are gaining increasing attention in the network intrusion detection community because of their intrinsic ability to discover novel attacks. This ability has become critical since the number of new attacks has kept growing in recent years. However, most of today’s anomaly-based IDSs generate high false positive rates and miss many attacks because of a deficiency in their ability to discriminate attacks from legitimate behaviors. These unreliable results damage the dependability of IDSs. In addition, even if the detection method is sound and effective, the IDS might still be unable to deliver detection service when under attack. With the increasing importance of the IDS, some attackers attempt to disable the IDS before they launch a thorough attack. In this thesis, we propose a Dependable Network Intrusion Detection System (DNIDS) based on the Combined Strangeness and Isolation measure K-Nearest Neighbor (CSI-KNN) algorithm. The DNIDS can effectively detect network intrusions while providing continued service even under attacks. The intrusion detection algorithm analyzes different characteristics of network data by employing two measures: strangeness and isolation. Based on these measures, a correlation unit raises intrusion alerts with associated confidence estimates. In the DNIDS, multiple CSI-KNN classifiers work in parallel to deal with different types of network traffic. An intrusion-tolerant mechanism monitors the classifiers and the hosts on which the classifiers reside and enables the IDS to survive component failure due to intrusions. As soon as a failed IDS component is discovered, a copy of the component is installed to replace it and the detection service continues. We evaluate our detection approach over the KDD’99 benchmark dataset. The experimental results show that the performance of our approach is better than the best result of KDD’99 contest winner’s. In addition, the intrusion alerts generated by our algorithm provide graded confidence that offers some insight into the reliability of the intrusion detection. To verify the survivability of the DNIDS, we test the prototype in simulated attack scenarios. In addition, we evaluate the performance of the intrusion-tolerant mechanism and analyze the system reliability. The results demonstrate that the mechanism can effectively tolerate intrusions and achieve high dependability. / Thesis (Master, Computing) -- Queen's University, 2007-09-05 14:36:57.128 Intrusion detection CSI-KNN Intrusion-tolerant IDS
9	Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining Kuhlman, Caitlin Anne 20 January 2017 (has links) This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality. distributed computing kNN Search data Mining
10	Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining Kuhlman, Caitlin Anne 20 January 2017 (has links) This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality. distributed computing kNN Search data Mining

Search results