Global ETD Search

21	On the effect of INQUERY term-weighting scheme on query-sensitive similarity measures Kini, Ananth Ullal 12 April 2006 (has links) Cluster-based information retrieval systems often use a similarity measure to compute the association among text documents. In this thesis, we focus on a class of similarity measures named Query-Sensitive Similarity (QSS) measures. Recent studies have shown QSS measures to positively influence the outcome of a clustering procedure. These studies have used QSS measures in conjunction with the ltc term-weighting scheme. Several term-weighting schemes have superseded the ltc term-weighing scheme and demonstrated better retrieval performance relative to the latter. We test whether introducing one of these schemes, INQUERY, will offer any benefit over the ltc scheme when used in the context of QSS measures. The testing procedure uses the Nearest Neighbor (NN) test to quantify the clustering effectiveness of QSS measures and the corresponding term-weighting scheme. The NN tests are applied on certain standard test document collections and the results are tested for statistical significance. On analyzing results of the NN test relative to those obtained for the ltc scheme, we find several instances where the INQUERY scheme improves the clustering effectiveness of QSS measures. To be able to apply the NN test, we designed a software test framework, Ferret, by complementing the features provided by dtSearch, a search engine. The test framework automates the generation of NN coefficients by processing standard test document collection data. We provide an insight into the construction and working of the Ferret test framework. Information Retrieval Term-Weighting Similarity Measure INQUERY ltc Query-Sensitive Similarity Clustering Nearest Neighbors Document Collection Text Search
22	Predicting gene–phenotype associations in humans and other species from orthologous and paralogous phenotypes Woods, John Oates, III 21 February 2014 (has links) Phenotypes and diseases may be related by seemingly dissimilar phenotypes in other species by means of the orthology of underlying genes. Such "orthologous phenotypes," or "phenologs," are examples of deep homology, and one member of the orthology relationship may be used to predict candidate genes for its counterpart. (There exists evidence of "paralogous phenotypes" as well, but validation is non-trivial.) In Chapter 2, I demonstrate the utility of including plant phenotypes in our database, and provide as an example the prediction of mammalian neural crest defects from an Arabidopsis thaliana phenotype, negative gravitropism defective. In the third chapter, I describe the incorporation of additional phenotypes into our database (including chicken, zebrafish, E. coli, and new C. elegans datasets). I present a method, developed in coordination with Martin Singh-Blom, for ranking predicted candidate genes by way of a k nearest neighbors naïve Bayes classifier drawing phenolog information from a variety of species. The fourth chapter relates to a computational method and application for identifying shared and overlapping pathways which contribute to phenotypes. I describe a method for rapidly querying a database of phenotype--gene associations for Boolean combinations of phenotypes which yields improved predictions. This method offers insight into the divergence of orthologous pathways in evolution. I demonstrate connections between breast cancer and zebrafish methylmercury response (through oxidative stress and apoptosis); human myopathy and plant red light response genes, minus those involved in water deprivation response (via autophagy); and holoprosencephaly and an array of zebrafish phenotypes. In the first appendix, I present the SciRuby Project, which I co-founded in order to bring scientific libraries to the Ruby programming language. I describe the motivation behind SciRuby and my role in its creation. Finally in Appendix B, I discuss the first beta release of NMatrix, a dense and sparse matrix library for the Ruby language, which I developed in part to facilitate and validate rapid phenolog searches. In this work, I describe the concept of phenologs as well as the development of the necessary computational tools for discovering phenotype orthology relationships, for predicting associated genes, and for statistically validating the discovered relationships and predicted associations. / text Deep homology Phenologs Phenotype orthology Phenotype paralogy Homology Gene--phenotype associations Ruby Sciruby Nmatrix k nearest neighbors
23	An Analysis Tool for Flight Dynamics Monte Carlo Simulations Restrepo, Carolina 1982- 16 December 2013 (has links) Spacecraft design is inherently difficult due to the nonlinearity of the systems involved as well as the expense of testing hardware in a realistic environment. The number and cost of flight tests can be reduced by performing extensive simulation and analysis work to understand vehicle operating limits and identify circumstances that lead to mission failure. A Monte Carlo simulation approach that varies a wide range of physical parameters is typically used to generate thousands of test cases. Currently, the data analysis process for a fully integrated spacecraft is mostly performed manually on a case-by-case basis, often requiring several analysts to write additional scripts in order to sort through the large data sets. There is no single method that can be used to identify these complex variable interactions in a reliable and timely manner as well as be applied to a wide range of flight dynamics problems. This dissertation investigates the feasibility of a unified, general approach to the process of analyzing flight dynamics Monte Carlo data. The main contribution of this work is the development of a systematic approach to finding and ranking the most influential variables and combinations of variables for a given system failure. Specifically, a practical and interactive analysis tool that uses tractable pattern recognition methods to automate the analysis process has been developed. The analysis tool has two main parts: the analysis of individual influential variables and the analysis of influential combinations of variables. This dissertation describes in detail the two main algorithms used: kernel density estimation and nearest neighbors. Both are non-parametric density estimation methods that are used to analyze hundreds of variables and combinations thereof to provide an analyst with insightful information about the potential cause for a specific system failure. Examples of dynamical systems analysis tasks using the tool are provided. spacecraft design nearest neighbors kernel density estimation guidance, navigation, and control pattern recognition data analysis Monte Carlo simulation
24	Efficient Approach for Order Selection of Projection-Based Model Order Reduction Baggu, Gnanesh 08 August 2018 (has links) The present thrust in the electronics industry towards integrating multiple functions on a single chip while operating at very high frequencies has highlighted the need for efficient Electronic Design Automation (EDA) tools to shorten the design cycle and capture market windows. However, the increasing complexity in modern circuit design has made simulation a computationally cumbersome task. The notion of model order reduction has emerged as an effective tool to address this difficulty. Typically, there are numerous approaches and several issues involved in the implementation of model-order reduction techniques. Among the important ones of those issues is the problem of determining a suitable order (or size) for the reduced system. An optimal order would be the minimal order that enables the reduced system to capture the behavior of the original (more complex and larger) system up to a user-defined frequency. The contribution presented in this thesis describes a new approach aimed at determining the order of the reduced system. The proposed approach is based on approximating the impulse response of the original system in the time-domain. The core methodology in obtaining that approximation is based on numerically inverting the Laplace-domain of the representation of the impulse response from the complex-domain (s-domain) into the time-domain. The main advantage of the proposed approach is that it allows the order selection algorithm to operate directly on the time-domain form of the impulse response. It is well-known that numerically generating the impulse response in the time-domain is very difficult and its not impossible, since it requires driving the original network with the Dirac-delta function, which is a mathematical abstraction rather than a concrete waveform that can be implemented on a digital computer. However, such a difficulty is avoided in the proposed approach since it uses the Laplace-domain image of the impulse response to obtain its time-domain representation. The numerical simulations presented in the thesis demonstrate that using the time-domain waveform of the impulse response, computed using the proposed approach and properly filtered with a Butterworth filter, guides the order selection algorithm to select a smaller order, i.e., the reduced system becomes more compact in size. The phrase "smaller or more compact" in this context refers to the comparison with existing techniques currently in use, which seek to generate some form of time-domain approximations for the impulse response through driving the original network with pulse-shaped function (e.g., Gaussian pulse). Model Order Reduction Projection-based techniques Order Selection False Nearest Neighbors Numerical Inversion of Lapalce Transform Impulse response Butterworth filter
25	Extensão do Método de Predição do Vizinho mais Próximo para o modelo Poisson misto / An Extension of Nearest Neighbors Prediction Method for mixed Poisson model Helder Alves Arruda 28 March 2017 (has links) Várias propostas têm surgido nos últimos anos para problemas que envolvem a predição de observações futuras em modelos mistos, contudo, para os casos em que o problema trata-se em atribuir valores para os efeitos aleatórios de novos grupos existem poucos trabalhos. Tamura, Giampaoli e Noma (2013) propuseram um método que consiste na computação das distâncias entre o novo grupo e os grupos com efeitos aleatórios conhecidos, baseadas nos valores das covariáveis, denominado Método de Predição do Vizinho Mais Próximo ou NNPM (Nearest Neighbors Prediction Method), na sigla em inglês, considerando o modelo logístico misto. O objetivo deste presente trabalho foi o de estender o método NNPM para o modelo Poisson misto, além da obtenção de intervalos de confiança para as predições, para tais fins, foram propostas novas medidas de desempenho da predição e o uso da metodologia Bootstrap para a criação dos intervalos. O método de predição foi aplicado em dois conjuntos de dados reais e também no âmbito de estudos de simulação, em ambos os casos, obtiveram-se bons desempenhos. Dessa forma, a metodologia NNPM apresentou-se como um método de predição muito satisfatório também no caso Poisson misto. / Many proposals have been created in the last years for problems in the prediction of future observations in mixed models, however, there are few studies for cases that is necessary to assign random effects values for new groups. Tamura, Giampaoli and Noma (2013) proposed a method that computes the distances between a new group and groups with known random effects based on the values of the covariates, named as Nearest Neighbors Prediction Method (NNPM), considering the mixed logistic model. The goal of this dissertation was to extend the NNPM for the mixed Poisson model, in addition to obtaining confidence intervals for predictions. To attain such purposes new prediction performance measures were proposed as well as the use of Bootstrap methodology for the creation of intervals. The prediction method was applied in two sets of real data and in the simulation studies framework. In both cases good performances were obtained. Thus, the NNPM proved to be a viable prediction method also in the mixed Poisson case. Efeitos aleatórios Modelo Poisson misto Predição Vizinho mais próximo Mixed Poisson model Nearest neighbors Prediction Random effects
26	Scaling out-of-core k-nearest neighbors computation on single machines / Faire passer à l'échelle le calcul "out-of-core" des K-plus proche voisins sur une seule machine Olivares, Javier 19 December 2016 (has links) La technique des K-plus proches voisins (K-Nearest Neighbors (KNN) en Anglais) est une méthode efficace pour trouver des données similaires au sein d'un grand ensemble de données. Au fil des années, un grand nombre d'applications ont utilisé les capacités du KNN pour découvrir des similitudes dans des jeux de données de divers domaines tels que les affaires, la médecine, la musique, ou l'informatique. Bien que des années de recherche aient apporté plusieurs approches de cet algorithme, sa mise en œuvre reste un défi, en particulier aujourd'hui alors que les quantités de données croissent à des vitesses inimaginables. Dans ce contexte, l'exécution du KNN sur de grands ensembles pose deux problèmes majeurs: d'énormes empreintes mémoire et de très longs temps d'exécution. En raison de ces coût élevés en termes de ressources de calcul et de temps, les travaux de l'état de l'art ne considèrent pas le fait que les données peuvent changer au fil du temps, et supposent toujours que les données restent statiques tout au long du calcul, ce qui n'est malheureusement pas du tout conforme à la réalité. Nos contributions dans cette thèse répondent à ces défis. Tout d'abord, nous proposons une approche out-of-core pour calculer les KNN sur de grands ensembles de données en utilisant un seul ordinateur. Nous préconisons cette approche comme un moyen moins coûteux pour faire passer à l'échelle le calcul des KNN par rapport au coût élevé d'un algorithme distribué, tant en termes de ressources de calcul que de temps de développement, de débogage et de déploiement. Deuxièmement, nous proposons une approche out-of-core multithreadée (i.e. utilisant plusieurs fils d'exécution) pour faire face aux défis du calcul des KNN sur des données qui changent rapidement et continuellement au cours du temps. Après une évaluation approfondie, nous constatons que nos principales contributions font face aux défis du calcul des KNN sur de grands ensembles de données, en tirant parti des ressources limitées d'une machine unique, en diminuant les temps d'exécution par rapport aux performances actuelles, et en permettant le passage à l'échelle du calcul, à la fois sur des données statiques et des données dynamiques. / The K-Nearest Neighbors (KNN) is an efficient method to find similar data among a large set of it. Over the years, a huge number of applications have used KNN's capabilities to discover similarities within the data generated in diverse areas such as business, medicine, music, and computer science. Despite years of research have brought several approaches of this algorithm, its implementation still remains a challenge, particularly today where the data is growing at unthinkable rates. In this context, running KNN on large datasets brings two major issues: huge memory footprints and very long runtimes. Because of these high costs in terms of computational resources and time, KNN state-of the-art works do not consider the fact that data can change over time, assuming always that the data remains static throughout the computation, which unfortunately does not conform to reality at all. In this thesis, we address these challenges in our contributions. Firstly, we propose an out-of-core approach to compute KNN on large datasets, using a commodity single PC. We advocate this approach as an inexpensive way to scale the KNN computation compared to the high cost of a distributed algorithm, both in terms of computational resources as well as coding, debugging and deployment effort. Secondly, we propose a multithreading out-of-core approach to face the challenges of computing KNN on data that changes rapidly and continuously over time. After a thorough evaluation, we observe that our main contributions address the challenges of computing the KNN on large datasets, leveraging the restricted resources of a single machine, decreasing runtimes compared to that of the baselines, and scaling the computation both on static and dynamic datasets. K-Plus proches voisins Performance des algorithmes Out-Of-Core Seul ordinateur K-Nearest Neighbors Scalability, Algorithm's Performance Out-Of-Core Single machine
27	PREDICTION OF PUBLIC BUS TRANSPORTATION PLANNING BASED ON PASSENGER COUNT AND TRAFFIC CONDITIONS Heidaripak, Samrend January 2021 (has links) Artificial intelligence has become a hot topic in the past couple of years because of its potential of solving problems. The most used subset of artificial intelligence today is machine learning, which is essentially the way a machine can learn to do tasks without getting any explicit instructions. A problem that has historically been solved by common knowledge and experience is the planning of bus transportation, which has been prone to mistakes. This thesis investigates how to extract the key features of a raw dataset and if a couple of machine learning algorithms can be applied to predict and plan the public bus transportation, while also considering the weather conditions. By using a pre-processing method to extract the features before creating and evaluating an k-nearest neighbors model as well as an artificial neural network model, predicting the passenger count on a given route could help planning of the bus transportation. The outcome of the thesis was that the feature extraction was successful, and both models could successfully predict the passenger count based on normal conditions. However, in extreme conditions such as the pandemic during 2020, the models could not be proven to successfully predict the passenger count nor being used to plan the bus transportation. Artificial neural network K-nearest neighbors Machine learning Feature selection Public bus transportation Computer Sciences Datavetenskap (datalogi)
28	Efficient Algorithms for Data Mining with Federated Databases Young, Barrington R. St. A. 03 July 2007 (has links) No description available. Computer Science Federated database Vertical partition Arbitrary attribute overlap Covariance Matrix k-Nearest Neighbors Euclidean distance Cluster
29	A Parallel Algorithm for Query Adaptive, Locality Sensitive Hash Search Carraher, Lee A. 17 September 2012 (has links) No description available. Computer Science Locality Sensitive Hashing Approximate Nearest Neighbors CUDA GPU Image Search Distance Adaptive LSH Parallel Computing
30	Predicting basketball performance based on draft pick : A classification analysis Harmén, Fredrik January 2022 (has links) In this thesis, we will look to predict the performance of a basketball player coming into the NBA depending on where the player was picked in the NBA draft. This will be done by testing different machine learning models on data from the previous 35 NBA drafts and then comparing the models in order to see which model had the highest accuracy of classification. The machine learning methods used are Linear Discriminant Analysis, K-Nearest Neighbors, Support Vector Machines and Random Forests. The results show that the method with the highest accuracy of classification was Random Forests, with an accuracy of 42%. machine learning linear discriminant analysis k-nearest neighbors support vector machines random forests Probability Theory and Statistics Sannolikhetsteori och statistik

Search results