Global ETD Search

131	Using a Scalable Feature Selection Approach For Big Data Regressions Qingdong Cheng (6922766) 13 August 2019 (has links) Logistic regression is a widely used statistical method in data analysis and machine learning. When the capacity of data is large, it is time-consuming and even infeasible to perform big data machine learning using the traditional approach. Therefore, it is crucial to come up with an efficient way to evaluate feature combinations and update learning models. With the approach proposed by Yang, Wang, Xu, and Zhang (2018) a system can be represented using small enough matrices, which can be hosted in memory. These working sufficient statistics matrices can be applied in updating models in logistic regression. This study applies the working sufficient statistics approach in logistic regression machine learning to examine how this new method improves the performance. This study investigated the difference between the performance of this new working sufficient statistics approach and performance of the traditional approach on Spark\rq s machine learning package. The experiments showed that the working sufficient statistics method could improve the performance of training the logistic regression models when the input size was large. Applied Computer Science Statistics Spark Big data Logistic Regression
132	Att skapa och fånga värde med stöd av Big data / To create and capture value through the use of Big data Mazuga, Alicja, Jurevica, Kristine January 2019 (has links) Big data är ett omtalad ämne, allt fler organisationer vill investera i och använda Big data som beslutsstöd. Därför har syftet med denna studie varit att få en djupare förståelse för begreppet Big data och mer konkret att undersöka hur Big data påverkar värdeskapande och värdefångst i affärsmodeller. För att undersöka syftet har följande forskningsfråga ställts: Hur påverkar Big data värdeskapande och värdefångst i affärsmodeller? En multipel fallstudiedesign av en kvalitativ karaktär har tillämpats för att uppfylla studiens syfte och besvara forskningsfrågan. Insamlingen av data har skett genom semistrukturerade intervjuer med fyra respondenter från tre olika organisationer. Den insamlade data har sedan analyserats med hjälp av tematisk analys. Resultatet visar att Big data analyser kan skapa stora affärsvärden. Big data är en kraftfull tillgång för organisationer som kan använda dataanalyser som underlag för att fatta väl underbyggda beslut vilket kan leda till utveckling och effektivisering av organisationen. Dock medför arbete med Big data utmaningar gällande kommunikation mellan datadrivna enheter och människor men även att hitta vilka källor som faktiskt levererar värde. Värdeskapande och värdefångst i affärsmodeller är en kontinuerlig process som organisationer strävar efter. Studien har visat att värdeskapande med stöd av Big data handlar om att tillfredsställa kunders behov och skapa långvariga kundrelationer genom att med hjälp av dataanalys identifiera kundbeteende, hitta nya trender och utveckla nya produkter. För att med stöd av Big data öka värdeskapande måste organisationer utföra olika typer av aktiviteter, bland annat hitta källor som levererar väsentlig information, investera i rätt verktyg och medarbetare som kan hantera och bearbeta de stora datavolymerna. Hur mycket värde organisationer kan fånga, det vill säga hur mycket intäkter som genereras från försäljningen beror på organisationens tillgång till resurser, kunskap och om samtliga medarbetare i organisationen integreras i att använda Big data. Genom att integrera medarbetarna i dataanalys kan det bidra till att värdeskapande ökar då deras kunskap och erfarenhet ökar chansen till att finna dolda mönster, nya trender och möjligheten att förutspå saker som ännu inte skett vilket med tiden kommer tillfredsställa kunderna, skapa värde och generera intäkter. Företagsekonomi Big data Affärsmodell Värdeskapande Värdefångst Economics and Business Ekonomi och näringsliv
133	GDPR - Så påverkas detaljhandelns datahantering : en studie av hur GDPR påverkat detaljhandelns datahantering Stafilidis, Dennis, Sjögren, Ludwig January 2019 (has links) Digitaliseringens framfart har inneburit att användningen av Big Data-analyser har ökat. I takt med digitaliseringens framfart har kraven på datasäkerhet och skydd av personlig integritet ökat. GDPR trädde i kraft 2018 och ställer hårdare krav på hantering av personuppgifter och kunddata. GDPR syftar till att skydda människors integritet och personuppgifter. En av de branscher som hanterar stora mängder data och använder sig av Big Data-analyser för att nå insikter om sina kunder är detaljhandeln. Men för att användningen av Big Data-analyser skall nå sin fulla potential måste den användas upprepade gånger för olika ändamål, medan GDPR föreskriver att kunddata inte får användas för olika syften och ändamål. Syftet med studien är att undersöka och beskriva hur detaljhandelsföretag anpassar sin datahantering till de krav som GDPR ställer. För att undersöka frågeställningen har vi använt oss av en kvalitativ ansats. Vid insamlingen av data har vi genomfört intervjuer, vilka sedan har analyserats genom en tematisk analys. Resultatet i vår studie visar att GDPR har påverkat detaljhandelns hantering av kunddata. Insamlingen har påverkats genom att den blivit mer strikt och genom att inköp av extern kunddata har upphört. Analys av kunddata har påverkats genom ytterligare steg i processen vid behandling samt genom en restriktivare tillgång till databaser och data som används för analys. Aggregering av kunddata har förändrats genom att datakällorna som används har förändrats. Lagringen av kunddata har förändrats då en integrationslösning har skapats som möjliggör radering av lagrad kunddata i olika databaser. GDPR Big Data Detaljhandeln Information Systems
134	Interpreting "Big Data": Rock Star Expertise, Analytical Distance, and Self-Quantification Willis, Margaret Mary January 2015 (has links) Thesis advisor: Natalia Sarkisian / The recent proliferation of technologies to collect and analyze “Big Data” has changed the research landscape, making it easier for some to use unprecedented amounts of real-time data to guide decisions and build ‘knowledge.’ In the three articles of this dissertation, I examine what these changes reveal about the nature of expertise and the position of the researcher. In the first article, “Monopoly or Generosity? ‘Rock Stars’ of Big Data, Data Democrats, and the Role of Technologies in Systems of Expertise,” I challenge the claims of recent scholarship, which frames the monopoly of experts and the spread of systems of expertise as opposing forces. I analyze video recordings (N= 30) of the proceedings of two professional conferences about Big Data Analytics (BDA), and I identify distinct orientations towards BDA practice among presenters: (1) those who argue that BDA should be conducted by highly specialized “Rock Star” data experts, and (2) those who argue that access to BDA should be “democratized” to non-experts through the use of automated technology. While the “data democrats” ague that automating technology enhances the spread of the system of BDA expertise, they ignore the ways that it also enhances, and hides, the monopoly of the experts who designed the technology. In addition to its implications for practitioners of BDA, this work contributes to the sociology of expertise by demonstrating the importance of focusing on both monopoly and generosity in order to study power in systems of expertise, particularly those relying extensively on technology. Scholars have discussed several ways that the position of the researcher affects the production of knowledge. In “Distance Makes the Scholar Grow Fonder? The Relationship Between Analytical Distance and Critical Reflection on Methods in Big Data Analytics,” I pinpoint two types of researcher “distance” that have already been explored in the literature (experiential and interactional), and I identify a third type of distance—analytical distance—that has not been examined so far. Based on an empirical analysis of 113 articles that utilize Twitter data, I find that the analytical distance that authors maintain from the coding process is related to whether the authors include explicit critical reflections about their research in the article. Namely, articles in which the authors automate the coding process are significantly less likely to reflect on the reliability or validity of the study, even after controlling for factors such as article length and author’s discipline. These findings have implications for numerous research settings, from studies conducted by a team of scholars who delegate analytic tasks, to “big data” or “e-science” research that automates parts of the analytic process. Individuals who engage in self-tracking—collecting data about themselves or aspects of their lives for their own purposes—occupy a unique position as both researcher and subject. In the sociology of knowledge, previous research suggests that low experiential distance between researcher and subject can lead to more nuanced interpretations but also blind the researcher to his or her underlying assumptions. However, these prior studies of distance fail to explore what happens when the boundary between researcher and subject collapses in “N of one” studies. In “The Collapse of Experiential Distance and the Inescapable Ambiguity of Quantifying Selves,” I borrow from art and literary theories of grotesquerie—another instance of the collapse of boundaries—to examine the collapse of boundaries in self-tracking. Based on empirical analyses of video testimonies (N=102) and interviews (N=7) with members of the Quantified Self community of self-trackers, I find that ambiguity and multiplicity are integral facets of these data practices. I discuss the implications of these findings for the sociological study of researcher distance, and also the practical implications for the neoliberal turn that assigns responsibility to individuals to collect, analyze, and make the best use of personal data. / Thesis (PhD) — Boston College, 2015. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Sociology. big data expertise quantified self reflexivity self-tracking Twitter
135	Outlier Detection In Big Data Cao, Lei 29 March 2016 (has links) The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches. big data outlier detection data stream distributed algorithm data analytics
136	The City as Data Machine: Local Governance in the Age of Big Data Baykurt, Burcu January 2019 (has links) This dissertation is a study of the social dimensions and implications of the smart city, a new kind of urbanism that augments the city’s existing infrastructures with sensors, wireless communication, and software algorithms to generate unprecedented reams of real-time data. It investigates how smartness reshapes civic ties, and transforms the ways of seeing and governing urban centers long-plagued by racial and economic divides. How do the uneven adoption of smart technologies and data-driven practices affect the relationship between citizens and local government? What mediates the understanding and experience of urban inequalities in a data-driven city? In what ways does data-driven local governance address or exacerbate pervasive divides? The dissertation addresses these questions through three years of ethnographic fieldwork in Kansas City, where residents and public officials have partnered with Google and Cisco to test a gigabit internet service and a smart city program respectively. I show that the foray of tech companies into cities not only changes how urban problems are identified, but also reproduces civic divides. Young, middle-class, white residents embrace the smart city with the goal of turning the city’s problems into an economic opportunity, while already-vulnerable residents are reluctant to adopt what they perceive as surveillance technologies. This divide widens when data-driven practices of the smart city compel public officials and entrepreneurial residents to feign deliberate ignorance against longstanding issues and familiar solutions, or explore spurious connections between different datasets due to their assumptions about how creative breakthroughs surface in the smart city. These enthusiasts hope to discover connections they did not know existed, but their practices perpetuate existing stereotypes and miss underlying patterns in urban inequalities. By teasing out the intertwined relationships among tech giants, federal/local governments, local entrepreneurial groups, civic tech organizations, and nonprofits, this research demonstrates how the interests and cultural techniques of the contemporary tech industry seep into age-old practices of classification, record keeping, and commensuration in governance. I find that while these new modes of knowledge production in local government restructure the ways public officials and various publics see the city, seeing like a city also shapes the possibilities and limits of governing by data. Communication Sociology City planning--Technological innovations Big data Technology and state
137	Comparative Geospatial Analysis of Twitter Sentiment Data during the 2008 and 2012 U.S. Presidential Elections Gordon, Josef 10 October 2013 (has links) The goal of this thesis is to assess and characterize the representativeness of sampled data that is voluntarily submitted through social media. The case study vehicle used is Twitter data associated with the 2012 Presidential election, which were in turn compared to similarly collected 2008 Presidential election Twitter data in order to ascertain the representative statewide changes in the pro-Democrat bias of sentiment-derived Twitter data mentioning either of the Republican or Democrat Presidential candidates. The results of the comparative analysis show that the MAE lessened by nearly half - from 13.1% in 2008 to 7.23% in 2012 - which would initially suggest a less biased sample. However, the increase in the strength of the positive correlation between tweets per county and population density actually suggests a much more geographically biased sample. Big data Election Geographic Information Systems (GIS) Social media Twitter
138	Is operational research in UK universities fit-for-purpose for the growing field of analytics? Mortenson, Michael J. January 2018 (has links) Over the last decade considerable interest has been generated into the use of analytical methods in organisations. Along with this, many have reported a significant gap between organisational demand for analytical-trained staff, and the number of potential recruits qualified for such roles. This interest is of high relevance to the operational research discipline, both in terms of raising the profile of the field, as well as in the teaching and training of graduates to fill these roles. However, what is less clear, is the extent to which operational research teaching in universities, or indeed teaching on the various courses labelled as analytics , are offering a curriculum that can prepare graduates for these roles. It is within this space that this research is positioned, specifically seeking to analyse the suitability of current provisions, limited to master s education in UK universities, and to make recommendations on how curricula may be developed. To do so, a mixed methods research design, in the pragmatic tradition, is presented. This includes a variety of research instruments. Firstly, a computational literature review is presented on analytics, assessing (amongst other things) the amount of research into analytics from a range of disciplines. Secondly, a historical analysis is performed of the literature regarding elements that can be seen as the pre-cursor of analytics, such as management information systems, decision support systems and business intelligence. Thirdly, an analysis of job adverts is included, utilising an online topic model and correlations analyses. Fourthly, online materials from UK universities concerning relevant degrees are analysed using a bagged support vector classifier and a bespoke module analysis algorithm. Finally, interviews with both potential employers of graduates, and also academics involved in analytics courses, are presented. The results of these separate analyses are synthesised and contrasted. The outcome of this is an assessment of the current state of the market, some reflections on the role operational research make have, and a framework for the development of analytics curricula. The principal contribution of this work is practical; providing tangible recommendations on curricula design and development, as well as to the operational research community in general in respect to how it may react to the growth of analytics. Additional contributions are made in respect to methodology, with a novel, mixed-method approach employed, and to theory, with insights as to the nature of how trends develop in both the jobs market and in academia. It is hoped that the insights here, may be of value to course designers seeking to react to similar trends in a wide range of disciplines and fields.
139	Biostatistical and meta-research approaches to assess diagnostic test use O'Sullivan, Jack William January 2018 (has links) The aim of this thesis was to assess test use from primary care. Test use is an essential part of general practice, yet there is surprisingly little data exploring and quantifying its activity. My overarching hypothesis was that test use from primary care is sub-optimal, specifically that tests are overused (overtesting) - ordered when they will lead to no patient benefit, and underused (undertesting) - not ordered when they would lead to patient benefit. Previous metrics used to identify potential over and undertesting have been categorised into direct and indirect measures. Indirect measures take a population-level approach and are 'unexpected variation' in healthcare resource use, such as geographical variation. Direct measures consider individual patient data and directly compare resource use with an appropriateness criterion (such as a guideline). In this thesis, I examined three indirect measures: temporal change in test use, between-practice variation in test use and variation between general practices in the proportion of test results that return an abnormal result. In chapter 3, I identified which tests have been subject to the greatest change in their use from 2000/1 to 2015/16 in UK primary care. In chapter 4, I identified the tests that had been subject to the greatest between-practice variation in their use in UK primary care. In chapter 5, I present a method to identify General Practices whose doctors order a lower proportion of tests that return a normal result. In chapter 6, I present a method to directly quantify over and undertesting; I conducted a systematic review of studies that measured the adherence of general practitioner's test use with guidelines. In chapter 7 I acknowledge that the use of guidelines to audit general practitioner's test use is flawed; guidelines are of varying quality and not designed to dictate clinical practice. In this chapter, I determine the quality and reporting of guidelines, the quality of the evidence underpinning their recommendations and explore the association between guideline quality and non-adherence. Overall, I have shown that most tests have increased substantially in use (MRI knee, vitamin D and MRI brain the most), there is marked between-practice variation in the use of many tests (drug monitoring, urine albumin and pelvic CT the most) and that some general practices order a significantly lower proportion of tests that return an abnormal result. I have also shown that there is marked variation in how often GPs follow guidelines, but guidelines based on highly quality evidence are adhered to significantly more frequently. Lastly, in my Discussion chapter, I discuss the implications of my thesis, how it fits into the wider literature and an idea for a proposed step-wise approach to systematically identify overtesting.
140	A scalable data store and analytic platform for real-time monitoring of data-intensive scientific infrastructure Suthakar, Uthayanath January 2017 (has links) Monitoring data-intensive scientific infrastructures in real-time such as jobs, data transfers, and hardware failures is vital for efficient operation. Due to the high volume and velocity of events that are produced, traditional methods are no longer optimal. Several techniques, as well as enabling architectures, are available to support the Big Data issue. In this respect, this thesis complements existing survey work by contributing an extensive literature review of both traditional and emerging Big Data architecture. Scalability, low-latency, fault-tolerance, and intelligence are key challenges of the traditional architecture. However, Big Data technologies and approaches have become increasingly popular for use cases that demand the use of scalable, data intensive processing (parallel), and fault-tolerance (data replication) and support for low-latency computations. In the context of a scalable data store and analytics platform for monitoring data-intensive scientific infrastructure, Lambda Architecture was adapted and evaluated on the Worldwide LHC Computing Grid, which has been proven effective. This is especially true for computationally and data-intensive use cases. In this thesis, an efficient strategy for the collection and storage of large volumes of data for computation is presented. By moving the transformation logic out from the data pipeline and moving to analytics layers, it simplifies the architecture and overall process. Time utilised is reduced, untampered raw data are kept at storage level for fault-tolerance, and the required transformation can be done when needed. An optimised Lambda Architecture (OLA), which involved modelling an efficient way of joining batch layer and streaming layer with minimum code duplications in order to support scalability, low-latency, and fault-tolerance is presented. A few models were evaluated; pure streaming layer, pure batch layer and the combination of both batch and streaming layers. Experimental results demonstrate that OLA performed better than the traditional architecture as well the Lambda Architecture. The OLA was also enhanced by adding an intelligence layer for predicting data access pattern. The intelligence layer actively adapts and updates the model built by the batch layer, which eliminates the re-training time while providing a high level of accuracy using the Deep Learning technique. The fundamental contribution to knowledge is a scalable, low-latency, fault-tolerant, intelligent, and heterogeneous-based architecture for monitoring a data-intensive scientific infrastructure, that can benefit from Big Data, technologies and approaches.

Search results