Global ETD Search

11	Mining frequent sequences in one database scan using distributed computers Brajczuk, Dale A. 01 September 2011 (has links) Existing frequent-sequence mining algorithms perform multiple scans of a database, or a structure that captures the database. In this M.Sc. thesis, I propose a frequent-sequence mining algorithm that mines each database row as it reads it, so that it can potentially complete mining in the time it takes to read the database once. I achieve this by having my algorithm enumerate all sub-sequences from each row as it reads it. Since sub-sequence enumeration is a time-consuming process, I create a method to distribute the work over multiple computers, processors, and thread units, while balancing the load between all resources, and limiting the amount of communication so that my algorithm scales well in regards to the number of computers used. Experimental results show that my algorithm is effective, and can potentially complete the mining process in near the time it takes to perform one scan of the input database. data mining databases distributed computing
12	Parallel computational techniques for explicit finite element analysis Sziveri, Janos January 1997 (has links) No description available. 005
13	PERFORMANCE ESTIMATION AND SCHEDULING FOR PARALLEL PROGRAMS WITH CRITICAL SECTIONS Dutta, Sourav 01 May 2017 (has links) A fundamental problem in multithreaded parallel programs is the partial serialization that is imposed due to the presence of mutual exclusion variables or critical sections. In this work we investigate a model that considers the threads consisting of an equal number L of functional blocks, where each functional block has the same duration and either accesses a critical section or executes non-critical code. We derived formulas to estimate the average time spent in a critical section in presence of synchronization barrier and in absence of it. We also develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of threads and for any number of critical sections for the case of L = 2. For the general case L > 2, which is NP-complete, we present a competitive heuristic and provide experimental comparisons with the ideal integer linear programming (ILP) formulation. Distributed computing Parallel computing Scheduling
14	Bulk primitives in Linda run-time systems Rowstron, Antony Ian Taylor January 1996 (has links) No description available. 005
15	An Architecture for Geographically-Oriented Service Discovery on the Internet Li, Qiyan January 2002 (has links) Most of the service discovery protocols available on the Internet are built upon its logical structure. This phenomenon can be observed frequently from the way in which they behave. For instance, Jini and SLP service providers announce their presence by multicasting service advertisements, an approach that is neither intended to scale nor capable of scaling to the size of the Internet. With mobile and wireless devices becoming increasingly popular, there appears to be a need for performing service discovery in a wide-area context, as there is very little direct correlation between the Internet topology and geographic locations. Even for desktop computers, such a need can arise from time to time. This problem suggests the necessity for an architecture that allows users to locate resources on the Internet using geographic criteria. This thesis presents an architecture that can be deployed with minimal effort in the existing network infrastructure. The geographic information can be shared among multiple applications in a fashion similar to the way DNS is shared throughout the Internet. The design and implementation of the architecture are discussed in detail, and three case studies are used to illustrate how the architecture can be employed by various applications to satisfy dramatically different needs of end-users. Computer Science service discovery distributed computing
16	Uniform Access to Signal Data in a Distributed Heterogeneous Computing Environment Jeffreys, Steven 10 1900 (has links) International Telemetering Conference Proceedings / October 26-29, 1992 / Town and Country Hotel and Convention Center, San Diego, California / One of the problems in analyzing data is getting the data to the analysis system. The data can be stored in a variety of ways, from simple disk and tape files to a sophisticated relational database system. The variety of storage techniques requires the data analysis system to be aware of the details of how the data may be accessed (e.g., file formats, SQL statements, BBN/Probe commands, etc.). The problem is much worse in a network of heterogeneous machines; besides the details of each storage method, the analysis system must handle the details of network access, and may have to translate data from one vendor format to another as it moves from machine to machine. This paper describes a simple and powerful software interface to telemetry data in a distributed heterogeneous networking environment, and how that interface is being used in a diagnostic expert system. In this case, the interface connects the expert system, running on a Sun UNIX machine, with the data on a VAX/VMS machine. The interface exists as a small subroutine library that can be linked into a variety of data analysis systems. The interface insulates the expert system from all details of data access, providing transparent access to data across the network. A further benefit of this approach is that the data source itself can be a sophisticated data analysis system that may perform some processing of the data, again transparently to the user of the interface. The interface subroutine library can be readily applied to a wide variety of data analysis applications. Computer Networks Data Analysis Distributed Computing
17	The agile design and manufacture of rolling bearings via AI and Internet tools Pan, Peiyuan January 1999 (has links) No description available. 620.0042029
18	Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 16 March 2011 (has links) (PDF) R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size. / Series: Research Report Series / Department of Statistics and Mathematics
19	A tm Plug-In for Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 11 1900 (has links) (PDF) R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of signifficant size. (authors' abstract)
20	Pivot-based Data Partitioning for Distributed k Nearest Neighbor Mining Kuhlman, Caitlin Anne 20 January 2017 (has links) This thesis addresses the need for a scalable distributed solution for k-nearest-neighbor (kNN) search, a fundamental data mining task. This unsupervised method poses particular challenges on shared-nothing distributed architectures, where global information about the dataset is not available to individual machines. The distance to search for neighbors is not known a priori, and therefore a dynamic data partitioning strategy is required to guarantee that exact kNN can be found autonomously on each machine. Pivot-based partitioning has been shown to facilitate bounding of partitions, however state-of-the-art methods suffer from prohibitive data duplication (upwards of 20x the size of the dataset). In this work an innovative method for solving exact distributed kNN search called PkNN is presented. The key idea is to perform computation over several rounds, leveraging pivot-based data partitioning at each stage. Aggressive data-driven bounds limit communication costs, and a number of optimizations are designed for efficient computation. Experimental study on large real-world data (over 1 billion points) compares PkNN to the state-of-the-art distributed solution, demonstrating that the benefits of additional stages of computation in the PkNN method heavily outweigh the added I/O overhead. PkNN achieves a data duplication rate close to 1, significant speedup over previous solutions, and scales effectively in data cardinality and dimension. PkNN can facilitate distributed solutions to other unsupervised learning methods which rely on kNN search as a critical building block. As one example, a distributed framework for the Local Outlier Factor (LOF) algorithm is given. Testing on large real-world and synthetic data with varying characteristics measures the scalability of PkNN and the distributed LOF framework in data size and dimensionality. distributed computing kNN Search data Mining

Search results