61 |
Improving Search Ranking Using a Composite Scoring ApproachSnedden, Larry D 01 January 2017 (has links)
In this thesis, the improvement to relevance in computerized search results is studied. Information search tools return ranked lists of documents ordered by the relevance of the documents to the user supplied search. Using a small number of words and phrases to represent complex ideas and concepts causes user search queries to be information sparse. This sparsity challenges search tools to locate relevant documents for users. A review of the challenges to information searches helps to identify the problems and offer suggestions in improving current information search tools. Using the suggestions put forth by the Strategic Workshop on Information Retrieval in Lorne (SWIRL), a composite scoring approach (Composite Scorer) is developed. The Composite Scorer considers various aspects of information needs to improve the ranked results of search by returning records relevant to the user’s information need.
The Florida Fusion Center (FFC), a local law enforcement agency has a need for a more effective information search tool. Daily, the agency processes large amounts of police reports typically written as text documents. Current information search methods require inordinate amounts of time and skill to identify relevant police reports from their large collection of police reports.
An experiment conducted by FFC investigators contrasted the composite scoring approach against a common search scoring approach (TF/IDF). In the experiment, police investigators used a custom-built software interface to conduct several use case scenarios for searching for related documents to various criminal investigations. Those expert users then evaluated the results of the top ten ranked documents returned from both search scorers to measure the relevance to the user of the returned documents. The evaluations were collected and measurements used to evaluate the performance of the two scorers. A search with many irrelevant documents has a cost to the users in both time and potentially in unsolved crimes. A cost function contrasted the difference in cost between the two scoring methods for the use cases. Mean Average Precision (MAP) is a common method used to evaluate the performance of ranked list search results. MAP was computed for both scoring methods to provide a numeric value representing the accuracy of each scorer at returning relevant documents in the top-ten documents of a ranked list of search results.
The purpose of this study is to determine if a composite scoring approach to ranked lists, that considers multiple aspects of a user’s search, can improve the quality of search, returning greater numbers of relevant documents during an information search. This research contributes to the understanding of composite scoring methods to improve search results. Understanding the value of composite scoring methods allows researchers to evaluate, explore and possibly extend the approach, incorporating other information aspects such as word and document meaning. Read more
|
62 |
ESTIMATION ON GIBBS ENTROPY FOR AN ENSEMBLESake, Lekhya Sai 01 December 2015 (has links)
In this world of growing technology, any small improvement in the present scenario would create a revolution. One of the popular revolutions in the computer science field is parallel computing. A single parallel execution is not sufficient to see its non-deterministic features, as same execution with the same data at different time would end up with a different path. In order to see how non deterministic a parallel execution can extend up to, creates the need of the ensemble of executions. This project implements a program to estimate the Gibbs Entropy for an ensemble of parallel executions. The goal is to develop tools for studying the non-deterministic feature of parallel code based on execution entropy and use these developed tools for current and future research.
|
63 |
Active Analytics: Adapting Web Pages Automatically Based on Analytics DataCarle, William R., II 01 January 2016 (has links)
Web designers are expected to perform the difficult task of adapting a site’s design to fit changing usage trends. Web analytics tools give designers a window into website usage patterns, but they must be analyzed and applied to a website's user interface design manually. A framework for marrying live analytics data with user interface design could allow for interfaces that adapt dynamically to usage patterns, with little or no action from the designers. The goal of this research is to create a framework that utilizes web analytics data to automatically update and enhance web user interfaces. In this research, we present a solution for extracting analytics data via web services from Google Analytics and transforming them into reporting data that will inform user interface improvements. Once data are extracted and summarized, we expose the summarized reports via our own web services in a form that can be used by our client side User Interface (UI) framework. This client side framework will dynamically update the content and navigation on the page to reflect the data mined from the web usage reports. The resulting system will react to changing usage patterns of a website and update the user interface accordingly. We evaluated our framework by assigning navigation tasks to users on the UNF website and measuring the time it took them to complete those tasks, one group with our framework enabled, and one group using the original website. We found that the group that used the modified version of the site with our framework enabled was able to navigate the site more quickly and effectively. Read more
|
64 |
Performance Evaluation of LINQ to HPC and Hadoop for Big DataSivasubramaniam, Ravishankar 01 January 2013 (has links)
There is currently considerable enthusiasm around the MapReduce paradigm, and the distributed computing paradigm for analysis of large volumes of data. The Apache Hadoop is the most popular open source implementation of MapReduce model and LINQ to HPC is Microsoft's alternative to open source Hadoop. In this thesis, the performance of LINQ to HPC and Hadoop are compared using different benchmarks.
To this end, we identified four benchmarks (Grep, Word Count, Read and Write) that we have run on LINQ to HPC as well as on Hadoop. For each benchmark, we measured each system’s performance metrics (Execution Time, Average CPU utilization and Average Memory utilization) for various degrees of parallelism on clusters of different sizes. Results revealed some interesting trade-offs. For example, LINQ to HPC performed better on three out of the four benchmarks (Grep, Read and Write), whereas Hadoop performed better on the Word Count benchmark. While more research that is extensive has focused on Hadoop, there are not many references to similar research on the LINQ to HPC platform, which is slowly evolving during the writing of this thesis.
|
65 |
A PROBABILISTIC MACHINE LEARNING FRAMEWORK FOR CLOUD RESOURCE SELECTION ON THE CLOUDKhan, Syeduzzaman 01 January 2020 (has links) (PDF)
The execution of the scientific applications on the Cloud comes with great flexibility, scalability, cost-effectiveness, and substantial computing power. Market-leading Cloud service providers such as Amazon Web service (AWS), Azure, Google Cloud Platform (GCP) offer various general purposes, memory-intensive, and compute-intensive Cloud instances for the execution of scientific applications. The scientific community, especially small research institutions and undergraduate universities, face many hurdles while conducting high-performance computing research in the absence of large dedicated clusters. The Cloud provides a lucrative alternative to dedicated clusters, however a wide range of Cloud computing choices makes the instance selection for the end-users. This thesis aims to simplify Cloud instance selection for end-users by proposing a probabilistic machine learning framework to allow to users select a suitable Cloud instance for their scientific applications.
This research builds on the previously proposed A2Cloud-RF framework that recommends high-performing Cloud instances by profiling the application and the selected Cloud instances. The framework produces a set of objective scores called the A2Cloud scores, which denote the compatibility level between the application and the selected Cloud instances. When used alone, the A2Cloud scores become increasingly unwieldy with an increasing number of tested Cloud instances. Additionally, the framework only examines the raw application performance and does not consider the execution cost to guide resource selection. To improve the usability of the framework and assist with economical instance selection, this research adds two Naïve Bayes (NB) classifiers that consider both the application’s performance and execution cost. These NB classifiers include: 1) NB with a Random Forest Classifier (RFC) and 2) a standalone NB module.
Naïve Bayes with a Random Forest Classifier (RFC) augments the A2Cloud-RF framework's final instance ratings with the execution cost metric. In the training phase, the classifier builds the frequency and probability tables. The classifier recommends a Cloud instance based on the highest posterior probability for the selected application.
The standalone NB classifier uses the generated A2Cloud score (an intermediate result from the A2Cloud-RF framework) and execution cost metric to construct an NB classifier. The NB classifier forms a frequency table and probability (prior and likelihood) tables. For recommending a Cloud instance for a test application, the classifier calculates the highest posterior probability for all of the Cloud instances. The classifier recommends a Cloud instance with the highest posterior probability. This study performs the execution of eight real-world applications on 20 Cloud instances from AWS, Azure, GCP, and Linode. We train the NB classifiers using 80% of this dataset and employ the remaining 20% for testing. The testing yields more than 90% recommendation accuracy for the chosen applications and Cloud instances. Because of the imbalanced nature of the dataset and multi-class nature of classification, we consider the confusion matrix (true positive, false positive, true negative, and false negative) and F1 score with above 0.9 scores to describe the model performance. The final goal of this research is to make Cloud computing an accessible resource for conducting high-performance scientific executions by enabling users to select an effective Cloud instance from across multiple providers. Read more
|
66 |
Generating a Normalized Database Using Class NormalizationSudhindaran, Daniel Sushil 01 January 2017 (has links)
Relational databases are the most popular databases used by enterprise applications to store persistent data to this day. It gives a lot of flexibility and efficiency. A process called database normalization helps make sure that the database is free from redundancies and update anomalies. In a Database-First approach to software development, the database is designed first, and then an Object-Relational Mapping (ORM) tool is used to generate the programming classes (data layer) to interact with the database. Finally, the business logic code is written to interact with the data layer to persist the business data to the database. However, in modern application development, a process called Code-First approach evolved where the domain classes and the business logic that interacts with the domain classes are written first. Then an Object Relational Mapping (ORM) tool is used to generate the database from the domain classes. In this approach, since database design is not a concern, software programmers may ignore the process of database normalization altogether. To help software programmers in this process, this thesis takes the theory behind the five database normal forms (1NF - 5NF) and proposes Five Class Normal Forms (1CNF - 5CNF) that software programmers may use to normalize their domain classes. This thesis demonstrates that when the Five Class Normal Forms are applied manually to a class by a programmer, the resulting database that is generated from the Code-First approach is also normalized according to the rules of relational theory. Read more
|
67 |
Compiling Unit Clauses for the Warren Abstract MachineHerbert, George D. 01 January 1987 (has links)
This thesis describes the design, development, and installation of a computer program which compiles unit clauses generated in a Prolog-based environment at Argonne National Laboratories into Warren Abstract Machine (WAM) code. The program enhances the capabilities of the environment by providing rapid unification and subsumption tests for the very significant class of unit clauses. This should improve performance substantially for large programs that generate and use many unit clauses.
|
68 |
Development of a continuous condition monitoring system based on probabilistic modelling of partial discharge data for polymeric insulation cablesAhmed, Zeeshan 09 August 2019 (has links)
Partial discharge (PD) measurements have been widely accepted as an efficient online insulation condition assessment method in high voltage equipment. Two sets of experimental PD measuring setups were established with the aim to study the variations in the partial discharge characteristics over the insulation degradation in terms of the physical phenomena taking place in PD sources, up to the point of failure. Probabilistic lifetime modeling techniques based on classification, regression and multivariate time series analysis were performed for a system of PD response variables, i.e. average charge, pulse repetition rate, average charge current, and largest repetitive discharge magnitude over the data acquisition period. Experimental lifelong PD data obtained from samples subjected to accelerated degradation was used to study the dynamic trends and relationships among those aforementioned response variables. Distinguishable data clusters detected by the T-Stochastics Neighborhood Embedding (tSNE) algorithm allows for the examination of the state-of-the-art modeling techniques over PD data. The response behavior of trained models allows for distinguishing the different stages of the insulation degradation. An alternative approach utilizing a multivariate time series analysis was performed in parallel with Classification and Regression models for the purpose of forecasting PD activity (PD response variables corresponding to insulation degradation). True observed data and forecasted data mean values lie within the 95th percentile confidence interval responses for a definite horizon period, which demonstrates the soundness and accuracy of models. A life-predicting model based on the cointegrated relations between the multiple response variables, trained model responses correlated with experimentally evaluated time-to-breakdown values and well-known physical discharge mechanisms, can be used to set an emergent alarming trigger and as a step towards establishing long-term continuous monitoring of partial discharge activity. Furthermore, this dissertation also proposes an effective PD monitoring system based on wavelet and deflation compression techniques required for an optimal data acquisition as well as an algorithm for high-scale, big data reduction to minimize PD data size and account only for the useful PD information. This historically recorded useful information can thus be used for, not only postault diagnostics, but also for the purpose of improving the performance of modelling algorithms as well as for an accurate threshold detection. Read more
|
Page generated in 0.9646 seconds