Global ETD Search

401	Distributed Feature Selection in Large n and Large p Regression Problems Wang, Xiangyu January 2016 (has links) <p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation Statistics Computer science Bayesian variable selection Big data Distributed Embarrassingly Parallel Feature selection Partition
402	Maritime Transportation Optimization Using Evolutionary Algorithms in the Era of Big Data and Internet of Things Cheraghchi, Fatemeh 19 July 2019 (has links) With maritime industry carrying out nearly 90% of the volume of global trade, the algorithms and solutions to provide quality of services in maritime transportation are of great importance to both academia and the industry. This research investigates an optimization problem using evolutionary algorithms and big data analytics to address an important challenge in maritime disruption management, and illustrates how it can be engaged with information technologies and Internet of Things. Accordingly, in this thesis, we design, develop and evaluate methods to improve decision support systems (DSSs) in maritime supply chain management. We pursue three research goals in this thesis. First, the Vessel Schedule recovery Problem (VSRP) is reformulated and a bi-objective optimization approach is proposed. We employ bi-objective evolutionary algorithms (MOEAs) to solve optimization problems. An optimal Pareto front provides a valuable trade-off between two objectives (minimizing delay and minimizing financial loss) for a stakeholder in the freight ship company. We evaluate the problem in three domains, namely scalability analysis, vessel steaming policies, and voyage distance analysis, and statistically validate their performance significance. According to the experiments, the problem complexity varies in different scenarios, while NSGAII performs better than other MOEAs in all scenarios. In the second work, a new data-driven VSRP is proposed, which benefits from the available Automatic Identification System (AIS) data. In the new formulation, the trajectory between the port calls is divided and encoded into adjacent geohashed regions. In each geohash, the historical speed profiles are extracted from AIS data. This results in a large-scale optimization problem called G-S-VSRP with three objectives (i.e., minimizing loss, delay, and maximizing compliance) where the compliance objective maximizes the compliance of optimized speeds with the historical data. Assuming that the historical speed profiles are reliable to trust for actual operational speeds based on other ships' experience, maximizing the compliance of optimized speeds with these historical data offers some degree of avoiding risks. Three MOEAs tackled the problem and provided the stakeholder with a Pareto front which reflects the trade-off among the three objectives. Geohash granularity and dimensionality reduction techniques were evaluated and discussed for the model. G-S-VSRPis a large-scale optimization problem and suffers from the curse of dimensionality (i.e. problems are difficult to solve due to exponential growth in the size of the multi-dimensional solution space), however, due to a special characteristic of the problem instance, a large number of function evaluations in MOEAs can still find a good set of solutions. Finally, when the compliance objective in G-S-VSRP is changed to minimization, the regular MOEAs perform poorly due to the curse of dimensionality. We focus on improving the performance of the large-scale G-S-VSRP through a novel distributed multiobjective cooperative coevolution algorithm (DMOCCA). The proposed DMOCCA improves the quality of performance metrics compared to the regular MOEAs (i.e. NSGAII, NSGAIII, and GDE3). Additionally, the DMOCCA results in speedup when running on a cluster. multiobjective optimization evolutionary algorithms vessel schedule recovery problem big data distributed algorithms cooperative coevolution Apache Spark
403	Vybrané problémy technologické realizace evropské ochrany osobních údajů / Selected issues in technological realization of European data protection Kubica, Jan January 2019 (has links) This thesis focuses on the legal regulation of selected aspects of the personal data protection at the European level. Fuelled by the technological progress, this area of legal regulation is becoming increasingly important, as the usage of personal data can be source of both innovation and economic progress, but it also has the potential to negatively impact individuals` rights ("chilling effect"). The thesis analyses the usage of big data and automated individual decision making; both phenomena are assessed through principles contained in GDPR. The aim of the thesis is to, as far as these two phenomena are concerned, evaluate functionality and perspectives of the European regulation. The thesis is, apart from the introduction and the conclusion, divided into three chapters. The first part briefly introduces the concept of the right to the protection of personal data and the fundamental legal framework of the European regulation. This chapter is followed by a chapter focused on the big data, in which, after a necessary technical introduction is made, current practices of data controllers are contrasted with corresponding principles of data protection regulation. Particular attention is also paid to the pitfalls of anonymization. At the end of this chapter, it is concluded that all relevant...
404	Rediseño del proceso de creación de propuestas de negocios mediante gestión del conocimiento y la aplicación de un modelo de clasificación de minería de datos Llanos Soto, Renato Jorge Emilio January 2017 (has links) Magíster en Ingeniería de Negocios con Tecnologías de Información / El conocimiento es un recurso que se encuentra en las personas, en los objetos que ellas utilizan, en el medio en que ellas se mueven y en los procesos de las organizaciones a las cuales pertenecen, permitiendo actuar e interpretar sobre lo que sucede en el entorno. El mercado minero actualmente vive un complejo panorama económico, dado principalmente, por una importante disminución de la inversión a nivel mundial en proyectos que repercute directamente a las empresas que suministran servicios a la minería en Chile. Debido a esto se ha generado una disminución en la demanda de servicios en el mercado, dando como resultado una disminución en la oferta de propuestas de proyectos y una mayor competencia por ganarlas. Empresa de Ingeniería Minera se dedica a la extracción y refinación de petróleo-gas, energía y minería en la mayor parte del planeta. La organización busca seguir siendo competitiva y mantenerse vigente en este mercado cambiante y saturado. Sus ingresos están dados principalmente por la cantidad de nuevos proyectos que se realicen de forma exitosa, por lo que generar un posible aumento la fuente de ingresos de la organización es un tema clave, y es precisamente este punto el que se busca apoyar con el estudio. Bajo este contexto se buscará generar una mayor posibilidad para la organización de obtener nuevos negocios en base a un análisis de la situación actual de la organización y la aplicación de un rediseño de procesos apoyado por un modelo de clasificación de minería de datos que permita mejorar la gestión del departamento con los recursos disponibles, aprovechar la información y el conocimiento existente para aumentar la posibilidad de ganar más proyectos y por ende aumentar las utilidades de la organización en relación al año anterior. Industria minera - Aspectos económicos Administración del conocimiento Minería de datos Big data
405	A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONS Chung Ching Cheung (7027808) 13 August 2019 (has links) <p>A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.</p> Statistics subsampling general estimating equations a-optimality big data High Dimensional Data
406	Tapping the Untapped Potential of Big Data to Assess the Type of Organization-Stakeholder Relationship on Social Media Devin T Knighton (6997697) 14 August 2019 (has links) Social media is impacting the practice of public relations inmany different ways, but the focus of this dissertation is on the power of big data from social media to identify and assess the relationship that stakeholders have with the organization. Social media analytics have tended to measure reactions to messages, rather than the strength of the relationship, even though public relations is responsible for building strong relationships with the organization’s stakeholders. Yet, social media provides insight into the conversations that stakeholders have with other stakeholders about the organization and thus can reveal insight into the quality of the relationship they have with the organization.<div><br></div><div>This dissertation takes a networked approach to understandthe strength of the relationship that the organization has with its stakeholders, meaning it acknowledges that the relationships two entities have with each other are influenced by the relationships those entities have with others in common. In this case, the relationship that a stakeholder has with the organizationis influenced by the relationship the stakeholder has with other stakeholders. Thus, one way to study the relationship that a stakeholder has with the organization is to look at the conversation and the postings on social media among the various stakeholders. The ultimate aim of the dissertation is to show how the relationship can be assessed, so the organization can create strategies that develop mutually beneficial relationships over time.<br></div><div><br></div><div>The context for the study is based on two major events where companies deliberately gather together their stakeholders to interact in person and onsocial media about issues and products related to the organization’sfuture. The first event is Adobe Creative Max, which Adobe hosts each year for creative professionals. The second context for the study is Dreamforce, which is hosted by Salesforce.com and includes so many attendees that the company has to bring in cruise ships to dock in the San Francisco Bay during the event since all the hotels in the area sell out far in advance. These two events provide a specific situation where stakeholders interact with other stakeholders outside of a crisis, which represents the majority of day-to-day public relations practice. Twitter data was collected during for each week of each conference, and all company tweets were filtered out of the data sample. Atext-mining approach was then used to examine the conversations among the stakeholders at the events.<br></div><div><br></div><div>Findings indicate that the strongest relationship was developed by Salesforce.com with its stakeholders at the Dreamforce 2018 event in large part because ofthe CEO’s keynote andthe organizational commitment to social justice and sustainability. Granted, Salesforce hadalready worked to develop a culture among employees and customers based on the concept, “family,”or “Ohana.” However, the text of the conversations reveal that the focus at this conference was on societal issues presented by the CEO. In contrast, the findings from the Adobe conference suggest the organization has a transactional relationship with its stakeholders, in part because the CEO keynote focused heavily on products and technology. The implications of these findings indicate that big data from social media can be used to assess relationships, especially when social media data represents conversations and interactions among stakeholders. The findings also show the influence of CEO communications on the relationship and the vital role that public relations practitioners play in setting that CEO communications agenda.</div> Communication Studies Public Relations Social Media Big Data Social Networks Events
407	Identifying and Evaluating Early Stage Fintech Companies: Working with Consumer Internet Data and Analytic Tools Dymov, Khasan 24 January 2018 (has links) The purpose of this project is to work as an interdisciplinary team whose primary role is to mentor a team of WPI undergraduate students completing their Major Qualifying Project (MQP) in collaboration with Vestigo Ventures, LLC. (â€œVestigo Venturesâ€�) and Cogo Labs. We worked closely with the project sponsors at Vestigo Ventures and Cogo Labs to understand each sponsorâ€™s goals and desires, and then translated those thoughts into actionable items and concrete deliverables to be completed by the undergraduate student team. As a graduate student team with a diverse set of educational backgrounds and a range of academic and professional experiences, we provided two primary functions throughout the duration of this project. The first function was to develop a roadmap for each individual project, with concrete steps, justification, goals and deliverables. The second function was to provide the undergraduate team with clarification and assistance throughout the implementation and completion of each project, as well as provide our opinions and thoughts on any proposed changes. The two teams worked together in lock-step in order to provide the project sponsors with a complete set of deliverables, with the undergraduate team primarily responsible for implementation and final delivery of each completed project. algorithm data analytics data science fintech big data Cogo Labs Cogo financial math Vestigo Vestigo Ventures
408	Identifying and Evaluating Early Stage Fintech Companies: Working with Consumer Internet Data and Analytic Tools Shoop, Alexander 24 January 2018 (has links) The purpose of this project is to work as an interdisciplinary team whose primary role is to mentor a team of WPI undergraduate students completing their Major Qualifying Project (MQP) in collaboration with Vestigo Ventures, LLC. (â€œVestigo Venturesâ€�) and Cogo Labs. We worked closely with the project sponsors at Vestigo Ventures and Cogo Labs to understand each sponsorâ€™s goals and desires, and then translated those thoughts into actionable items and concrete deliverables to be completed by the undergraduate student team. As a graduate student team with a diverse set of educational backgrounds and a range of academic and professional experiences, we provided two primary functions throughout the duration of this project. The first function was to develop a roadmap for each individual project, with concrete steps, justification, goals and deliverables. The second function was to provide the undergraduate team with clarification and assistance throughout the implementation and completion of each project, as well as provide our opinions and thoughts on any proposed changes. The two teams worked together in lock-step in order to provide the project sponsors with a complete set of deliverables, with the undergraduate team primarily responsible for implementation and final delivery of each completed project. big data data science data analytics algorithm vestigo ventures cogo labs financial math fintech
409	Revenue Generation in Data-driven Healthcare : An exploratory study of how big data solutions can be integrated into the Swedish healthcare system Jonsson, Hanna, Mazomba, Luyolo January 2019 (has links) Abstract The purpose of this study is to investigate how big data solutions in the Swedish healthcare system can generate a revenue. As technology continues to evolve, the use of big data is beginning to transform processes in many different industries, making them more efficient and effective. The opportunities presented by big data have been researched to a large extent in commercial fields, however, research in the use of big data in healthcare is scarce and this is particularly true in the case of Sweden. Furthermore, there is a lack in research that explores the interface between big data, healthcare and revenue models. The interface between these three fields of research is important as innovation and the integration of big data in healthcare could be affected by the ability of companies to generate a revenue from developing such innovations or solutions. Thus, this thesis aims to fill this gap in research and contribute to the limited body of knowledge that exists on this topic. The study conducted in this thesis was done via qualitative methods, in which a literature search was done and interviews were conducted with individuals who hold managerial positions at Region Västerbotten. The purpose of conducting these interviews was to establish a better understanding of the Swedish healthcare system and how its structure has influenced the use, or lack thereof, of big data in the healthcare delivery process, as well as, how this structure enables the generation of revenue through big data solutions. The data collected was analysed using the grounded theory approach which includes the coding and thematising of the empirical data in order to identify the key areas of discussion. The findings revealed that the current state of the Swedish healthcare system does not present an environment in which big data solutions that have been developed for the system can thrive and generate a revenue. However, if action is taken to make some changes to the current state of the system, then revenue generation may be possible in the future. The findings from the data also identified key barriers that need to be overcome in order to increase the integration of big data into the healthcare system. These barriers included the (i) lack of big data knowledge and expertise, (ii) data protection regulations, (iii) national budget allocation and the (iv) lack of structured data. Through collaborative work between actors in both the public and private sectors, these barriers can be overcome and Sweden could be on its way to transforming its healthcare system with the use of big data solutions, thus, improving the quality of care provided to its citizens. Key words: big data, healthcare, Swedish healthcare system, AI, revenue models, data-driven revenue models big data healthcare revenue models Swedish healthcare system AI data-driven revenue models Business Administration Företagsekonomi
410	Big data analysis of Customers’ information: A case study of Swedish Energy Company’s strategic communication Afzal, Samra January 2019 (has links) Big data analysis and inbound marketing are interlinked and can play a significant role in the identification of target audience and in the production of communication content as per the needs of target audience for strategic communication campaigns. By introducing and bringing the marketing concepts of big data analysis and inbound marketing into the field of strategic communication this quantitative study attempts to fill the gap in the limited body of knowledge of strategic communication research and practice. This study has used marketing campaigns as case studies to introduce a new strategic communication model by introducing the big data analysis and inbound marketing strategy into the three staged model of strategic communication presented by Gulbrandsen, I. T., & Just, S. N. in 2016. Big data driven campaigns are used to explain the procedure of target audience selection, key concepts of big data analysis, future opportunities, practical applications of big data for strategic communication practitioners and researchers by identifying the need for more academic research and practical use of big data analysis and inbound marketing in the strategic communication area. The study shows that big data analysis has potential to contribute in the field of strategic and target oriented communication. Inbound marketing and big data analysis has been used and considered as marketing strategy but this study is an attempt to shift the attention towards its role in strategic communication so there is a need to study big data analysis and inbound marketing with an open mind without confining it with some particular fields. Big Data Strategic Communication Inbound marketing Inbound communication Communication Studies Kommunikationsvetenskap

Search results