451 |
Bringing interpretability and visualization with artificial neural networksGritsenko, Andrey 01 August 2017 (has links)
Extreme Learning Machine (ELM) is a training algorithm for Single-Layer Feed-forward Neural Network (SLFN). The difference in theory of ELM from other training algorithms is in the existence of explicitly-given solution due to the immutability of initialed weights. In practice, ELMs achieve performance similar to that of other state-of-the-art training techniques, while taking much less time to train a model. Experiments show that the speedup of training ELM is up to the 5 orders of magnitude comparing to standard Error Back-propagation algorithm.
ELM is a recently discovered technique that has proved its efficiency in classic regression and classification tasks, including multi-class cases. In this thesis, extensions of ELMs for non-typical for Artificial Neural Networks (ANNs) problems are presented. The first extension, described in the third chapter, allows to use ELMs to get probabilistic outputs for multi-class classification problems. The standard way of solving this type of problems is based 'majority vote' of classifier's raw outputs. This approach can rise issues if the penalty for misclassification is different for different classes. In this case, having probability outputs would be more useful. In the scope of this extension, two methods are proposed. Additionally, an alternative way of interpreting probabilistic outputs is proposed.
ELM method prove useful for non-linear dimensionality reduction and visualization, based on repetitive re-training and re-evaluation of model. The forth chapter introduces adaptations of ELM-based visualization for classification and regression tasks. A set of experiments has been conducted to prove that these adaptations provide better visualization results that can then be used for perform classification or regression on previously unseen samples.
Shape registration of 3D models with non-isometric distortion is an open problem in 3D Computer Graphics and Computational Geometry. The fifth chapter discusses a novel approach for solving this problem by introducing a similarity metric for spectral descriptors. Practically, this approach has been implemented in two methods. The first one utilizes Siamese Neural Network to embed original spectral descriptors into a lower dimensional metric space, for which the Euclidean distance provides a good measure of similarity. The second method uses Extreme Learning Machines to learn similarity metric directly for original spectral descriptors. Over a set of experiments, the consistency of the proposed approach for solving deformable registration problem has been proven.
|
452 |
Shared and distributed memory parallel algorithms to solve big data problems in biological, social network and spatial domain applicationsSharma, Rahil 01 December 2016 (has links)
Big data refers to information which cannot be processed and analyzed using traditional approaches and tools, due to 4 V's - sheer Volume, Velocity at which data is received and processed, and data Variety and Veracity. Today massive volumes of data originate in domains such as geospatial analysis, biological and social networks, etc. Hence, scalable algorithms for effcient processing of this massive data is a signicant challenge in the field of computer science. One way to achieve such effcient and scalable algorithms is by using shared & distributed memory parallel programming models. In this thesis, we present a variety of such algorithms to solve problems in various above mentioned domains. We solve five problems that fall into two categories.
The first group of problems deals with the issue of community detection. Detecting communities in real world networks is of great importance because they consist of patterns that can be viewed as independent components, each of which has distinct features and can be detected based upon network structure. For example, communities in social networks can help target users for marketing purposes, provide user recommendations to connect with and join communities or forums, etc. We develop a novel sequential algorithm to accurately detect community structures in biological protein-protein interaction networks, where a community corresponds with a functional module of proteins. Generally, such sequential algorithms are computationally expensive, which makes them impractical to use for large real world networks. To address this limitation, we develop a new highly scalable Symmetric Multiprocessing (SMP) based parallel algorithm to detect high quality communities in large subsections of social networks like Facebook and Amazon. Due to the SMP architecture, however, our algorithm cannot process networks whose size is greater than the size of the RAM of a single machine. With the increasing size of social networks, community detection has become even more difficult, since network size can reach up to hundreds of millions of vertices and edges. Processing such massive networks requires several hundred gigabytes of RAM, which is only possible by adopting distributed infrastructure. To address this, we develop a novel hybrid (shared + distributed memory) parallel algorithm to efficiently detect high quality communities in massive Twitter and .uk domain networks.
The second group of problems deals with the issue of effciently processing spatial Light Detection and Ranging (LiDAR) data. LiDAR data is widely used in forest and agricultural crop studies, landscape classification, 3D urban modeling, etc. Technological advancements in building LiDAR sensors have enabled highly accurate and dense LiDAR point clouds resulting in massive data volumes, which pose computing issues with processing and storage. We develop the first published landscape driven data reduction algorithm, which uses the slope-map of the terrain as a filter to reduce the data without sacrificing its accuracy. Our algorithm is highly scalable and adopts shared memory based parallel architecture. We also develop a parallel interpolation technique that is used to generate highly accurate continuous terrains, i.e. Digital Elevation Models (DEMs), from discrete LiDAR point clouds.
|
453 |
From Crisis to Crisis: A Big Data, Antenarrative Analysis of How Social Media Users Make Meaning During and After Crisis EventsBair, Adam R. 01 May 2016 (has links)
This dissertation examines how individuals use social media to respond to crisis situations, both during and after the event. Using both rhetorical criticism and David Boje’s theories and concepts regarding the development of antenarrative—a process of making sense of past, present, and future events—I explored how social media users make sense of and respond to a crisis. Specifically, my research was guided by three major questions: Are traditional, pre-social media image-repair strategies effective in social media environments? How do participants use social media in crisis events, and how does this usage shape the rhetorical framing of a crisis? How might organizations effectively adapt traditional crisis communication plans to be used in social media during future crisis events?
These questions were applied to four case studies to provide a range of insights about not only how individuals respond to a crisis, but also what strategies organizations use to present information about it. These cases were carefully selected to include a variety of crisis types and responses and include the following: A business (H&R Block) communicating to clients about a software error A governmental organization (the NTSB) presenting information about the cause of an airplane crash and about missteps in its response A governmental group (the CDC) responding to a global health crisis with various audiences and types of responses An activist movement (Black Lives Matter) attempting to unify social media users to lobby for change and highlight the scope of the issues to the nation
Analyses of these cases not only show how individuals and groups used social media to make sense of crisis events, but also how the rhetorical strategies used to respond to a crisis situation. Understanding how individuals and groups make sense of crises will provide additional understanding to information designers, public relations professionals, organizations and businesses, and individuals using social media to effect change.
|
454 |
DCMS: A Data Analytics and Management System for Molecular SimulationBerrada, Meryem 16 March 2015 (has links)
Despite the fact that Molecular Simulation systems represent a major research tool in multiple scientific and engineering fields, there is still a lack of systems for effective data management and fast data retrieval and processing. This is mainly due to the nature of MS which generate a very large amount of data - a system usually encompass millions of data information, and one query usually runs for tens of thousands of time frames. For this purpose, we designed and developed a new application, DCMS (A data Analytics and Management System for molecular Simulation), that intends to speed up the process of new discovery in the medical/physics fields.
DCMS stores simulation data in a database; and provides users with a user-friendly interface to upload, retrieve, query, and analyze MS data without having to deal with any raw data. In addition, we also created a new indexing scheme, the Time-Parameterized Spatial (TPS) tree, to accelerate query processing through indexes that take advantage of the locality relationships between atoms. The tree was implemented directly inside the PostgreSQL kernel, on top of the SP-GiST platform. Along with this new tree, two new data types were also defined, as well as new algorithms for five data points' retrieval queries.
|
455 |
An Analysis of (Bad) Behavior in Online Video GamesBlackburn, Jeremy 04 September 2014 (has links)
This dissertation studies bad behavior at large-scale using data traces from online video games. Video games provide a natural laboratory for exploring bad behavior due to their popularity, explicitly defined (programmed) rules, and a competitive nature that provides motivation for bad behavior. More specifically, we look at two forms of bad behavior: cheating and toxic behavior.
Cheating is most simply defined as breaking the rules of the game to give one player an edge over another. In video games, cheating is most often accomplished using programs, or "hacks," that circumvent the rules implemented by game code. Cheating is a threat to the gaming industry in that it diminishes the enjoyment of fair players, siphons off money that is paid to cheat creators, and requires investment in anti-cheat technologies.
Toxic behavior is a more nebulously defined term, but can be thought of as actions that violate social norms, especially those that harm other members of the society. Toxic behavior ranges from insults or harassment of players (which has clear parallels to the real world) to domain specific instances such as repeatedly "suiciding"" to help an enemy team. While toxic behavior has clear parallels to bad behavior in other online domains, e.g., cyberbullying, if gone unchecked it has the potential to "kill" a game by driving away its players.
We first present a distributed architecture and reference implementation for the collection and analysis of large-scale social data. Using this implementation we then study the social structure of over 10 million gamers collected from a planetary scale Online Social Network, about 720 thousand of whom have been labeled cheaters, finding a significant correlation between social structure and the probability of partaking in cheating behavior. We additionally collect over half a billion daily observations of the cheating status of these gamers. Using about 10 months of detailed server logs from a community owned and operated game server we next analyze how relationships in the aforementioned online social network are backed by in-game interactions. Next, we use the insights gained and find evidence for a contagion process underlying the spread of cheating behavior and perform a data driven simulation using mathematical models for contagion. Finally, we build a model using millions of crowdsourced decisions for predicting toxic behavior in online games.
To the best of our knowledge, this dissertation presents the largest study of bad behavior to date. Our findings confirm theories about cheating and unethical behavior that have previously remained untested outside of controlled laboratory experiments or only with small, survey based studies. We find that the intensity of interactions between players is a predictor of a future relationship forming. We provide statistically significant evidence for cheating as a contagion. Finally, we extract insights from our model for detecting toxic behavior on how human reviewers perceive the presence and severity of bad behavior.
|
456 |
Ontology Driven Model for an Engineered Agile Healthcare SystemRamadoss, Balaji 14 February 2014 (has links)
Healthcare is in urgent need of an effective way to manage the complexity it of its systems and to prepare quickly for immense changes in the economics of healthcare delivery and reimbursement. Centers for Medicare & Medicaid Services (CMS) releases policies affecting inpatient and long-term care hospitals policies that directly affect reimbursement and payment rates. One of these policy changes, a quality-reporting program called Hospital Inpatient Quality Reporting (IQR), will effect approximately 3,400 acute-care and 440 long-term care hospitals. IQR sets guidelines and measures that will contain financial incentives and penalties based on the quality of care provided. CMS, the largest healthcare payer, is aggressively promoting high quality of care by linking payment incentives to outcomes. With CMS assessing each hospital's performance by comparing its Quality Achievements and Quality Improvement scores, there is a growing need and demand to understand these quality measures under the context of patient care, data management and system integration. This focus on patient-centered quality care is difficult for healthcare systems due to the lack of a systemic view of the patient and patient care. This research uniquely addresses the hospital's need to meet these challenges by presenting a healthcare specific framework and methodology for translating data on quality metrics into actionable processes and feedback to produce the desired quality outcome. The solution is based on a patient-care level process ontology, rather than the technology itself, and creates a bridge that applies systems engineering principles to permit observation and control of the system. This is a transformative framework conceived to meet the needs of the rapidly changing healthcare landscape. Without this framework, healthcare is dealing with outcomes that are six to seven months old, meaning patients may not have been cared for effectively. In this research a framework and methodology called the Healthcare Ontology Based Systems Engineering Model (HOB-SEM) is developed to allow for observability and controllability of compartmental healthcare systems. HOB-SEM applies systems and controls engineering principles to healthcare using ontology as the method and the data lifecycle as the framework. The ontology view of patient-level system interaction and the framework to deliver data management and quality lifecycles enables the development of an agile systemic healthcare view for observability and controllability
|
457 |
Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms : An Overview of Performance Limitations in Big Data Systems and Proposed OptimizationsKalavri, Vasiliki January 2014 (has links)
Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome. One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential. The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets. The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions. Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms; PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends. HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation. An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times. Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution. / <p>QC 20140605</p>
|
458 |
Efficient and Private Processing of Analytical Queries in Scientific DatasetsKumar, Anand 01 January 2013 (has links)
Large amount of data is generated by applications used in basic-science research and development applications. The size of data introduces great challenges in storage, analysis and preserving privacy. This dissertation proposes novel techniques to efficiently analyze the data and reduce storage space requirements through a data compression technique while preserving privacy and providing data security.
We present an efficient technique to compute an analytical query called spatial distance histogram (SDH) using spatiotemporal properties of the data. Special spatiotemporal properties present in the data are exploited to process SDH efficiently on the fly. General purpose graphics processing units (GPGPU or just GPU) are employed to further boost the performance of the algorithm.
Size of the data generated in scientific applications poses problems of disk space requirements, input/output (I/O) delays and data transfer bandwidth requirements. These problems are addressed by applying proposed compression technique. We also address the issue of preserving privacy and security in scientific data by proposing a security model. The security model monitors user queries input to the database that stores and manages scientific data. Outputs of user queries are also inspected to detect privacy breach. Privacy policies are enforced by the monitor to allow only those queries and results that satisfy data owner specified policies.
|
459 |
Modeling Large Social Networks in ContextHo, Qirong 01 July 2014 (has links)
Today’s social and internet networks contain millions or even billions of nodes, and copious amounts of side information (context) such as text, attribute, temporal, image and video data. A thorough analysis of a social network should consider both the graph and the associated side information, yet we also expect the algorithm to execute in a reasonable amount of time on even the largest networks. Towards the goal of rich analysis on societal-scale networks, this thesis provides (1) modeling and algorithmic techniques for incorporating network context into existing network analysis algorithms based on statistical models, and (2) strategies for network data representation, model design, algorithm design and distributed multi-machine programming that, together, ensure scalability to very large networks. The methods presented herein combine the flexibility of statistical models with key ideas and empirical observations from the data mining and social networks communities, and are supported by software libraries for cluster computing based on original distributed systems research. These efforts culminate in a novel mixed-membership triangle motif model that easily scales to large networks with over 100 million nodes on just a few cluster machines, and can be readily extended to accommodate network context using the other techniques presented in this thesis.
|
460 |
以MapReduce做有效率的天際線查詢 / Efficient Skyline Computation with MapReduce陳家慶, Chen, Chia Ching Unknown Date (has links)
隨著巨量資料的議題逐漸被重視,有越來越多的巨量資料的分析都利用MapReduce作計算處理。而在資料庫查詢中,天際線查詢是一種常見的決策分析方法,其目的是要幫助使用者找出資料庫中各維度的數值貼近使用者查詢條件的資料。然而,過去在大量資料的查詢方法中,如果資料筆數較多,同時查詢的維度也大的情況下,往往會有著效率不彰的問題。因此,本研究提出一種在大量資料中,有效率應用MapReduce作天際線查詢的方法。而根據實驗結果顯示,我們的方法,比先前方法更有效率。 / With the big data issue being taken seriously today, more and more big data is processed with MapReduce. Moreover, skyline query is a common method for decision making, which helps users find the data whose value in each dimension is close to the user query. In the past, if the data is huge, or the data space involves many dimensions, the query processing becomes inefficient. Therefore, in this study, we present a new method to process skyline queries with MapReduce. According to the experimental results, our method is more efficient than previous methods.
|
Page generated in 0.0834 seconds