Spelling suggestions: "subject:"forminformation science|computer science"" "subject:"forminformation science|coomputer science""
1 |
Identifying Relationships between Scientific DatasetsAlawini, Abdussalam 28 June 2016 (has links)
<p> Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset <i>A</i> is contained in dataset <i>B,</i> then the connection between <i>A</i> and <i>B</i> could be that <i>A</i> was extended to create <i>B.</i> </p><p> We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists' ability to detect such relationships between datasets. </p><p> The promising results of ReConnect's evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist's or a small group of researchers' collections of datasets, and the sensitivity of the overall system to the performance of its various components.</p>
|
2 |
Towards a systematic study of big data performance and benchmarkingEkanayake, Saliya 06 December 2016 (has links)
<p> Big data queries are increasing in complexity and the performance of data analytics is of growing importance. To this end, Big Data on high-performance computing (HPC) infrastructure is becoming a pathway to high-performance data analytics. The state of performance studies on this convergence between Big Data and HPC, however, is limited and ad hoc. A systematic performance study is thus timely and forms the core of this research. </p><p> This thesis investigates the challenges involved in developing Big Data applications with significant computations and strict latency guarantees on multicore HPC clusters. Three key areas it considers are thread models, affinity, and communication mechanisms. Thread models discuss the challenges of exploiting intra-node parallelism on modern multicore chips, while affinity looks at data locality and Non-Uniform Memory Access (NUMA) effects. Communication mechanisms investigate the difficulties of Big Data communications. For example, parallel machine learning depends on collective communications, unlike classic scientific simulations, which mostly use neighbor communications. Minimizing this cost while scaling out to higher parallelisms requires non-trivial optimizations, especially when using high-level languages such as Java or Scala. The investigation also includes a discussion on performance implications of different programming models such as dataflow and message passing used in Big Data analytics. The optimizations identified in this research are incorporated in developing the Scalable Parallel Interoperable Data Analytics Library (SPIDAL) in Java, which includes a collection of multidimensional scaling and clustering algorithms optimized to run on HPC clusters. </p><p> Besides presenting performance optimizations, this thesis explores a novel scheme for characterizing Big Data benchmarks. Fundamentally, a benchmark evaluates a certain performance-related aspect of a given system. For example, HPC benchmarks such as LINPACK and NAS Parallel Benchmark (NPB) evaluate the floating-point operations (flops) per second through a computational workload. The challenge with Big Data workloads is the diversity of their applications, which makes it impossible to classify them along a single dimension. Convergence Diamonds (CDs) is a multifaceted scheme that identifies four dimensions of Big Data workloads. These dimensions are problem architecture, execution, data source and style, and processing view. </p><p> The performance optimizations together with the richness of CDs provide a systematic guide to developing high-performance Big Data benchmarks, specifically targeting data analytics on large, multicore HPC clusters.</p>
|
3 |
Knowledge creation, sharing and reuse in online technical support for Open Source Software /Singh, Vandana. January 2008 (has links)
Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2008. / Source: Dissertation Abstracts International, Volume: 69-05, Section: A, page: 1581. Adviser: Michael B. Twidale. Includes bibliographical references. Available on microfilm from Pro Quest Information and Learning.
|
4 |
Data preservation in intermittently connected sensor networks via data aggregationHsu, Tiffany 08 April 2014 (has links)
<p> Intermittently connected sensor networks are a subset of wireless sensor networks that have a high data volume and suffer from the problem of infrequent data offloading. When the generated data exceeds the storage capacity of the network between offloading opportunities, there must be some method of data preservation. This requires two phases: compressing the data and redistributing it. The use of data aggregation with consideration given to minimizing total energy is examined as a solution for the compression problem. Simulations of the optimal solution and an approximation heuristic are compared.</p>
|
5 |
Multiagent Business ModelingTelang, Pankaj Ramesh 07 December 2013 (has links)
<p> Cross-organizational business processes are common in today’s economy. Of necessity, enterprises conduct their business in cooperation to create products and services for the marketplace. Thus business processes inherently involve autonomous partners with heterogeneous software designs and implementations. The existing business modeling approaches that employ high-level abstractions are difficult to operationalize, and the approaches that employ low-level abstractions lead to highly rigid processes that lack business semantics. We propose a novel business model based on multiagent abstractions. Unlike existing approaches, our model gives primacy to the contractual relationships among the business partners, thus providing a notion of business-level correctness, and offers flexibility to the participants. Our approach employs reusable patterns as building blocks to model recurring business scenarios. A step-by-step methodology guides a modeler in constructing a business model. Our approach employs temporal logic to formalize the correctness properties of a business model, and model checking to verify if a given operationalization satisfies those properties. Developer studies found that our approach yields improved model quality compared to the traditional approaches from the supply chain and healthcare domains.</p><p> Commitments capture how an agent relates with another agent, whereas goals describe states of the world that an agent is motivated to bring about. It makes intuitive sense that goals and commitments be understood as being complementary to each other. More importantly, an agent’s goals and commitments ought to be coherent, in the sense that an agent’s goals would lead it to adopt or modify relevant commitments and an agent’s commitments would lead it to adopt or modify relevant goals. However, despite the intuitive naturalness of the above connections, they have not yet been studied in a formal framework. This dissertation provides a combined operational semantics for goals and commitments. Our semantics yields important desirable properties, including convergence of the configurations of cooperating agents, thereby delineating some theoretically well-founded yet practical modes of cooperation in a multiagent system.</p><p> We formalize the combined operational semantics of achievement commitments and goals in terms of hierarchical task networks (HTNs) and show how HTN planning provides a natural representation and reasoning framework for them. Our approach combines a domain-independent theory capturing the lifecycles of goals and commitments, generic patterns of reasoning, and domain models. We go beyond existing approaches by proposing a first-order representation that accommodates settings where the commitments and goals are templatic and may be applied repeatedly with differing bindings for domain objects. Doing so not only leads to a more perspicuous modeling, it also enables us to support a variety of practical patterns.</p>
|
6 |
A Relational Framework for Clustering and Cluster Validity and the Generalization of the Silhouette MeasureRawashdeh, Mohammad Y. 28 August 2014 (has links)
<p> By clustering one seeks to partition a given set of points into a number of clusters such that points in the same cluster are similar and are dissimilar to points in other clusters. In the virtue of this goal, data of relational nature become typical for clustering. The similarity and dissimilarity relations between the data points are supposed to be the nuts and bolts for cluster formation. Thus, the task is driven by the notion of similarity between the data points. In practice, the similarity is usually measured by the pairwise distances between the data points. Indeed, the objective function of the two widely used clustering algorithms, namely, <i>k</i>-means and fuzzy <i> c</i>-means, appears in terms of the pairwise distances between the data points. </p><p> The clustering task is complicated by the choice of the distance measure and estimating the number of clusters. Fuzzy c-means is convenient when there are uncertainties in allocating points, in overlapping areas, to clusters. The k-means algorithm allocates the points unequivocally to clusters; overlooking the similarities between those points in overlapping areas. The fuzzy approach allows a point to be a member in as many clusters as necessary; thus it provides better insight into the relations between the points in overlapping areas. </p><p> In this thesis we develop a relational framework that is inspired by the silhouette measure of clustering quality. The framework asserts the relations between the data points by means of logical reasoning with the cluster membership values. The original description of computing the silhouettes is limited to crisp partitions. A natural generalization of silhouettes, to fuzzy partitions is given within our framework. Moreover, two notions of silhouettes emerge within the framework at different levels of granularity, namely, point-wise silhouette and center-wise silhouette. Now by the generalization, each silhouette is capable of measuring the extent to which a crisp, or fuzzy, partition has fulfilled the clustering goal at the level of the individual points, or cluster centers. The partitions are evaluated by the silhouette measure in conjunction with point-to-point or center-to-point distances. </p><p> By the generalization, the average silhouette value becomes a reasonable device for selecting between crisp and fuzzy partitions of the same data set. Accordingly, one can find about which partition is better in representing the relations between the data points, in accordance with their pairwise distances. Such powerful feature of the generalized silhouettes has exposed a problem with the partitions generated by fuzzy c-means. We have observed that defuzzifying the fuzzy c-means partitions always improves the overall representation of the relations between the data points. This is due to the inconsistency between some of the membership values and the distances between the data points. This inconsistency was reported, by others, in a couple of occasions in real life applications. </p><p> Finally, we present an experiment that demonstrates a successful application of the generalized silhouette measure in feature selection for highly imbalanced classification. A significant improvement in the classification for a real data set has resulted from a significant reduction in the number of features. </p>
|
7 |
Time-Slicing of Movement Data for Efficient Trajectory ClusteringEdens, Jared M. 10 September 2014 (has links)
<p> Spatio-temporal research frequently results in analyzing large sets of data (i.e., a data set larger than will reside in common PC main memory). Currently, many analytical techniques used to analyze large data sets begin by sampling the data such that it can all reside in main memory. Depending upon the research question posed, information can be lost when outliers are discarded. For example, if the focus of the analysis is on clusters of automobiles, the outliers may not be represented in the sampled dataset. The purpose of this study is to use similarity measures to detect anomalies. The clustering algorithm that is used in this thesis research is DBSCAN. Synthetic data is generated and then analyzed to evaluate the effectiveness of detecting anomalies using similarity measures. Results from this study support the hypothesis, "If similarity measures can be developed, then DBSCAN can be used to find anomalies in trajectory data using time slices." Synthetic data is analyzed using DBSCAN to address the research question -"Can DBSCAN be used to find anomalies in trajectory data using time slices?"</p>
|
8 |
Enhancing Recommender Systems Using Social IndicatorsGartrell, Charles M. 23 October 2014 (has links)
<p> Recommender systems are increasingly driving user experiences on the Internet. In recent years, online social networks have quickly become the fastest growing part of the Web. The rapid growth in social networks presents a substantial opportunity for recommender systems to leverage social data to improve recommendation quality, both for recommendations intended for individuals and for groups of users who consume content together. This thesis shows that incorporating social indicators improves the predictive performance of group-based and individual-based recommender systems. We analyze the impact of social indicators through small-scale and large-scale studies, implement and evaluate new recommendation models that incorporate our insights, and demonstrate the feasibility of using these social indicators and other contextual data in a deployed mobile application that provides restaurant recommendations to small groups of users.</p>
|
9 |
ROVER| A DNS-based method to detect and prevent IP hijacksGersch, Joseph E. 26 February 2014 (has links)
<p> The Border Gateway Protocol (BGP) is critical to the global internet infrastructure. Unfortunately BGP routing was designed with limited regard for security. As a result, IP route hijacking has been observed for more than 16 years. Well known incidents include a 2008 hijack of YouTube, loss of connectivity for Australia in February 2012, and an event that partially crippled Google in November 2012. Concern has been escalating as critical national infrastructure is reliant on a secure foundation for the Internet. Disruptions to military, banking, utilities, industry, and commerce can be catastrophic. </p><p> In this dissertation we propose ROVER (Route Origin VERification System), a novel and practical solution for detecting and preventing origin and sub-prefix hijacks. ROVER exploits the reverse DNS for storing route origin data and provides a fail-safe, best effort approach to authentication. This approach can be used with a variety of operational models including fully dynamic in-line BGP filtering, periodically updated authenticated route filters, and real-time notifications for network operators. </p><p> Our thesis is that ROVER systems can be deployed by a small number of institutions in an incremental fashion and still effectively thwart origin and sub-prefix IP hijacking despite non- participation by the majority of Autonomous System owners. We then present research results supporting this statement. We evaluate the effectiveness of ROVER using simulations on an Inter- net scale topology as well as with tests on real operational systems. Analyses include a study of IP hijack propagation patterns, effectiveness of various deployment models, critical mass requirements, and an examination of ROVER resilience and scalability.</p>
|
10 |
The construction, use, and evaluation of a lexical knowledge base for English-Chinese cross-language information retrieval.Chen, Jiangping. Liddy, Elizabeth D. Unknown Date (has links)
Thesis (PH.D.)--Syracuse University, 2003. / "Publication number AAT 3113231."
|
Page generated in 0.1483 seconds