401 |
Evaluation of SMP Shared Memory Machines for Use with In-Memory and OpenMP Big Data ApplicationsYounge, Andrew J., Reidy, Christopher, Henschel, Robert, Fox, Geoffrey C. 05 1900 (has links)
While distributed memory systems have shaped the field of distributed systems for decades, the demand for many-core shared memory resources is increasing. Symmetric Multiprocessor Systems (SMPs) have become increasingly important recently among a wide array of disciplines, ranging from Bioinformatics to astrophysics, and beyond. With the increase in big data computing, the size and scope of traditional commodity server systems is often outpaced. While some big data applications can be mapped to distributed memory systems found through many cluster and cloud technologies today, this effort represents a large barrier of entry that some projects cannot cross. Shared memory SMP systems look to effectively and efficiently fill this niche within distributed systems by providing high throughput and performance with minimized development effort, as the computing environment often represents what many researchers are already familiar with. In this paper, we look at the use of two common shared memory systems, the ScaleMP vSMP virtualized SMP deployment at Indiana University, and the SGI UV architecture deployed at University of Arizona. While both systems are notably different in their design, their potential impact on computing is remarkably similar. As such, we look to compare each system first under a set of OpenMP threaded benchmarks via the SPEC group, and to follow up with our experience using each machine for Trinity de-novo assembly. We find both SMP systems are well suited to support various big data applications, with the newer vSMP deployment often slightly faster; however, certain caveats and performance considerations are necessary when considering such SMP systems.
|
402 |
Technology and Big Data Meet the Risk of Terrorism in an Era of Predictive Policing and Blanket SurveillancePatti, Alexandra C 15 May 2015 (has links)
Surveillance studies suffer from a near-total lack of empirical data, partially due to the highly secretive nature of surveillance programs. However, documents leaked by Edward Snowden in June of 2013 provided unprecedented proof of top-secret American data mining initiatives that covertly monitor electronic communications, collect, and store previously unfathomable quantities of data. These documents presented an ideal opportunity for testing theory against data to better understand contemporary surveillance. This qualitative content analysis compared themes of technology, privacy, national security, and legality in the NSA documents to those found in sets of publicly available government reports, laws, and guidelines, finding inconsistencies in the portrayal of governmental commitments to privacy, transparency, and civil liberties. These inconsistencies are best explained by the risk society theoretical model, which predicts that surveillance is an attempt to prevent risk in globalized and complex contemporary societies.
|
403 |
Bayesian-based Traffic State Estimation in Large-Scale Networks Using Big DataGu, Yiming 01 February 2017 (has links)
Traffic state estimation (TSE) aims to estimate the time-varying traffic characteristics (such as flow rate, flow speed, flow density, and occurrence of incidents) of all roads in traffic networks, provided with limited observations in sparse time and locations. TSE is critical to transportation planning, operation and infrastructure design. In this new era of “big data”, massive volumes of sensing data from a variety of source (such as cell phones, GPS, probe vehicles, and inductive loops, etc.) enable TSE in an efficient, timely and accurate manner. This research develops a Bayesian-based theoretical framework, along with statistical inference algorithms, to (1) capture the complex flow patterns in the urban traffic network consisting both highways and arterials; (2) incorporate heterogeneous data sources into the process of TSE; (3) enable both estimation and perdition of traffic states; and (4) demonstrate the scalability to large-scale urban traffic networks. To achieve those goals, a Hierarchical Bayesian probabilistic model is proposed to capture spatio-temporal traffic states. The propagation of traffic states are encapsulated through mesoscopic network flow models (namely the Link Queue Model) and equilibrated fundamental diagrams. Traffic states in the Hierarchical Bayesian model are inferred using the Expectation-Maximization Extended Kalman Filter (EM-EKF). To better estimate and predict states, infrastructure supply is also estimated as part of the TSE process. It is done by adopting a series of algorithms to translate Twitter data into traffic incident information. Finally, the proposed EM-EKF algorithm is implemented and examined on the road networks in Washington DC. The results show that the proposed methods can handle large-scale traffic state estimation, while achieving superior results comparing to traditional temporal and spatial smoothing methods.
|
404 |
Distributed Feature Selection in Large n and Large p Regression ProblemsWang, Xiangyu January 2016 (has links)
<p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation
|
405 |
Maritime Transportation Optimization Using Evolutionary Algorithms in the Era of Big Data and Internet of ThingsCheraghchi, Fatemeh 19 July 2019 (has links)
With maritime industry carrying out nearly 90% of the volume of global trade, the algorithms and solutions to provide quality of services in maritime transportation are of great importance to both academia and the industry. This research investigates an optimization problem using evolutionary algorithms and big data analytics to address an important challenge in maritime disruption management, and illustrates how it can be engaged with information technologies and Internet of Things. Accordingly, in this thesis, we design, develop and evaluate methods to improve decision support systems (DSSs) in maritime supply chain management.
We pursue three research goals in this thesis. First, the Vessel Schedule recovery Problem (VSRP) is reformulated and a bi-objective optimization approach is proposed. We employ bi-objective evolutionary algorithms (MOEAs) to solve optimization problems. An optimal Pareto front provides a valuable trade-off between two objectives (minimizing delay and minimizing financial loss) for a stakeholder in the freight ship company. We evaluate the problem in three
domains, namely scalability analysis, vessel steaming policies, and voyage distance
analysis, and statistically validate their performance significance. According to the experiments, the problem complexity varies in different scenarios, while NSGAII
performs better than other MOEAs in all scenarios.
In the second work, a new data-driven VSRP is proposed, which benefits from the available Automatic Identification System (AIS) data. In the new formulation, the trajectory between the port calls is divided and encoded into adjacent geohashed regions. In each geohash, the historical speed profiles are extracted from AIS data. This results in a large-scale optimization problem called G-S-VSRP with three objectives (i.e., minimizing loss, delay, and maximizing compliance) where the compliance objective maximizes the compliance of optimized speeds with the historical data. Assuming that the historical speed profiles are reliable to trust for actual operational speeds based on other ships' experience, maximizing the compliance of optimized speeds with these historical data offers some degree of avoiding risks. Three MOEAs tackled the problem and provided the stakeholder with a Pareto front which reflects the trade-off among the three objectives. Geohash granularity and dimensionality reduction techniques were evaluated and discussed for the model. G-S-VSRPis a large-scale optimization problem and suffers from the curse of dimensionality (i.e. problems are difficult to solve due to exponential growth in the size of the multi-dimensional solution space), however, due to a special characteristic of the problem instance, a large number of function evaluations in MOEAs can still find a good set of solutions.
Finally, when the compliance objective in G-S-VSRP is changed to minimization, the regular MOEAs perform poorly due to the curse of dimensionality. We focus on improving the performance of the large-scale G-S-VSRP through a novel distributed multiobjective cooperative coevolution algorithm (DMOCCA). The proposed DMOCCA improves the quality of performance metrics compared to the regular MOEAs (i.e. NSGAII, NSGAIII, and GDE3). Additionally, the DMOCCA results in speedup when running on a cluster.
|
406 |
Vybrané problémy technologické realizace evropské ochrany osobních údajů / Selected issues in technological realization of European data protectionKubica, Jan January 2019 (has links)
This thesis focuses on the legal regulation of selected aspects of the personal data protection at the European level. Fuelled by the technological progress, this area of legal regulation is becoming increasingly important, as the usage of personal data can be source of both innovation and economic progress, but it also has the potential to negatively impact individuals` rights ("chilling effect"). The thesis analyses the usage of big data and automated individual decision making; both phenomena are assessed through principles contained in GDPR. The aim of the thesis is to, as far as these two phenomena are concerned, evaluate functionality and perspectives of the European regulation. The thesis is, apart from the introduction and the conclusion, divided into three chapters. The first part briefly introduces the concept of the right to the protection of personal data and the fundamental legal framework of the European regulation. This chapter is followed by a chapter focused on the big data, in which, after a necessary technical introduction is made, current practices of data controllers are contrasted with corresponding principles of data protection regulation. Particular attention is also paid to the pitfalls of anonymization. At the end of this chapter, it is concluded that all relevant...
|
407 |
Rediseño del proceso de creación de propuestas de negocios mediante gestión del conocimiento y la aplicación de un modelo de clasificación de minería de datosLlanos Soto, Renato Jorge Emilio January 2017 (has links)
Magíster en Ingeniería de Negocios con Tecnologías de Información / El conocimiento es un recurso que se encuentra en las personas, en los objetos que ellas utilizan, en el medio en que ellas se mueven y en los procesos de las organizaciones a las cuales pertenecen, permitiendo actuar e interpretar sobre lo que sucede en el entorno.
El mercado minero actualmente vive un complejo panorama económico, dado principalmente, por una importante disminución de la inversión a nivel mundial en proyectos que repercute directamente a las empresas que suministran servicios a la minería en Chile. Debido a esto se ha generado una disminución en la demanda de servicios en el mercado, dando como resultado una disminución en la oferta de propuestas de proyectos y una mayor competencia por ganarlas.
Empresa de Ingeniería Minera se dedica a la extracción y refinación de petróleo-gas, energía y minería en la mayor parte del planeta. La organización busca seguir siendo competitiva y mantenerse vigente en este mercado cambiante y saturado. Sus ingresos están dados principalmente por la cantidad de nuevos proyectos que se realicen de forma exitosa, por lo que generar un posible aumento la fuente de ingresos de la organización es un tema clave, y es precisamente este punto el que se busca apoyar con el estudio. Bajo este contexto se buscará generar una mayor posibilidad para la organización de obtener nuevos negocios en base a un análisis de la situación actual de la organización y la aplicación de un rediseño de procesos apoyado por un modelo de clasificación de minería de datos que permita mejorar la gestión del departamento con los recursos disponibles, aprovechar la información y el conocimiento existente para aumentar la posibilidad de ganar más proyectos y por ende aumentar las utilidades de la organización en relación al año anterior.
|
408 |
A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONSChung Ching Cheung (7027808) 13 August 2019 (has links)
<p>A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.</p>
|
409 |
Tapping the Untapped Potential of Big Data to Assess the Type of Organization-Stakeholder Relationship on Social MediaDevin T Knighton (6997697) 14 August 2019 (has links)
Social media is impacting the practice of public relations inmany different ways, but the focus of this dissertation is on the power of big data from social media to identify and assess the relationship that stakeholders have with the organization. Social media analytics have tended to measure reactions to messages, rather than the strength of the relationship, even though public relations is responsible for building strong relationships with the organization’s stakeholders. Yet, social media provides insight into the conversations that stakeholders have with other stakeholders about the organization and thus can reveal insight into the quality of the relationship they have with the organization.<div><br></div><div>This dissertation takes a networked approach to understandthe strength of the relationship that the organization has with its stakeholders, meaning it acknowledges that the relationships two entities have with each other are influenced by the relationships those entities have with others in common. In this case, the relationship that a stakeholder has with the organizationis influenced by the relationship the stakeholder has with other stakeholders. Thus, one way to study the relationship that a stakeholder has with the organization is to look at the conversation and the postings on social media among the various stakeholders. The ultimate aim of the dissertation is to show how the relationship can be assessed, so the organization can create strategies that develop mutually beneficial relationships over time.<br></div><div><br></div><div>The context for the study is based on two major events where companies deliberately gather together their stakeholders to interact in person and onsocial media about issues and products related to the organization’sfuture. The first event is Adobe Creative Max, which Adobe hosts each year for creative professionals. The second context for the study is Dreamforce, which is hosted by Salesforce.com and includes so many attendees that the company has to bring in cruise ships to dock in the San Francisco Bay during the event since all the hotels in the area sell out far in advance. These two events provide a specific situation where stakeholders interact with other stakeholders outside of a crisis, which represents the majority of day-to-day public relations practice. Twitter data was collected during for each week of each conference, and all company tweets were filtered out of the data sample. Atext-mining approach was then used to examine the conversations among the stakeholders at the events.<br></div><div><br></div><div>Findings indicate that the strongest relationship was developed by Salesforce.com with its stakeholders at the Dreamforce 2018 event in large part because ofthe CEO’s keynote andthe organizational commitment to social justice and sustainability. Granted, Salesforce hadalready worked to develop a culture among employees and customers based on the concept, “family,”or “Ohana.” However, the text of the conversations reveal that the focus at this conference was on societal issues presented by the CEO. In contrast, the findings from the Adobe conference suggest the organization has a transactional relationship with its stakeholders, in part because the CEO keynote focused heavily on products and technology. The implications of these findings indicate that big data from social media can be used to assess relationships, especially when social media data represents conversations and interactions among stakeholders. The findings also show the influence of CEO communications on the relationship and the vital role that public relations practitioners play in setting that CEO communications agenda.</div>
|
410 |
Identifying and Evaluating Early Stage Fintech Companies: Working with Consumer Internet Data and Analytic ToolsDymov, Khasan 24 January 2018 (has links)
The purpose of this project is to work as an interdisciplinary team whose primary role is to mentor a team of WPI undergraduate students completing their Major Qualifying Project (MQP) in collaboration with Vestigo Ventures, LLC. (“Vestigo Ventures�) and Cogo Labs. We worked closely with the project sponsors at Vestigo Ventures and Cogo Labs to understand each sponsor’s goals and desires, and then translated those thoughts into actionable items and concrete deliverables to be completed by the undergraduate student team. As a graduate student team with a diverse set of educational backgrounds and a range of academic and professional experiences, we provided two primary functions throughout the duration of this project. The first function was to develop a roadmap for each individual project, with concrete steps, justification, goals and deliverables. The second function was to provide the undergraduate team with clarification and assistance throughout the implementation and completion of each project, as well as provide our opinions and thoughts on any proposed changes. The two teams worked together in lock-step in order to provide the project sponsors with a complete set of deliverables, with the undergraduate team primarily responsible for implementation and final delivery of each completed project.
|
Page generated in 0.0847 seconds