• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 152
  • 16
  • 8
  • 6
  • 4
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 265
  • 265
  • 111
  • 64
  • 56
  • 47
  • 45
  • 44
  • 41
  • 41
  • 38
  • 31
  • 29
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Data Mining Academic Emails to Model Employee Behaviors and Analyze Organizational Structure

Straub, Kayla Marie 06 June 2016 (has links)
Email correspondence has become the predominant method of communication for businesses. If not for the inherent privacy concerns, this electronically searchable data could be used to better understand how employees interact. After the Enron dataset was made available, researchers were able to provide great insight into employee behaviors based on the available data despite the many challenges with that dataset. The work in this thesis demonstrates a suite of methods to an appropriately anonymized academic email dataset created from volunteers' email metadata. This new dataset, from an internal email server, is first used to validate feature extraction and machine learning algorithms in order to generate insight into the interactions within the center. Based solely on email metadata, a random forest approach models behavior patterns and predicts employee job titles with $96%$ accuracy. This result represents classifier performance not only on participants in the study but also on other members of the center who were connected to participants through email. Furthermore, the data revealed relationships not present in the center's formal operating structure. The culmination of this work is an organic organizational chart, which contains a fuller understanding of the center's internal structure than can be found in the official organizational chart. / Master of Science
12

CrowdCloud: Combining Crowdsourcing with Cloud Computing for SLO Driven Big Data Analysis

Flatt, Taylor 01 December 2017 (has links)
The evolution of structured data from simple rows and columns on a spreadsheet to more complex unstructured data such as tweets, videos, voice, and others, has resulted in a need for more adaptive analytical platforms. It is estimated that upwards of 80% of data on the Internet today is unstructured. There is a drastic need for crowdsourcing platforms to perform better in the wake of the tsunami of data. We investigated the employment of a monitoring service which would allow the system take corrective action in the event the results were trending in away from meeting the accuracy, budget, and time SLOs. Initial implementation and system validation has shown that taking corrective action generally leads to a better success rate of reaching the SLOs. Having a system which can dynamically adjust internal parameters in order to perform better can lead to more harmonious interactions between humans and machine algorithms and lead to more efficient use of resources.
13

The dynamic management revolution of Big Data : A case study of Åhlen’s Big Data Analytics operation

Rystadius, Gustaf, Monell, David, Mautner, Linus January 2020 (has links)
Background: The implementation of Big Data Analytics (BDA) has drastically increased within several sectors such as retailing. Due to its rapidly altering environment, companies have to adapt and modify their business strategies and models accordingly. The concepts of ambidexterity and agility are said to act as mediators to these changes in relation to a company’s capabilities within BDA. Problem: Research within the respective fields of dynamic mediators and BDAC have been conducted, but the investigation of specific traits of these mediators, their interconnection and its impact on BDAC is scant. This actuality is seen as a surprise from scholars, calling for further empirical investigation.  Purpose: This paper sought to empirically investigate what specific traits of ambidexterity and agility that emerged within the case company of Åhlen’s BDA-operation, and how these traits are interconnected. It further studied how these traits and their interplay impacts the firm's talent and managerial BDAC. Method: A qualitative case study on the retail firm Åhlens was conducted with three participants central to the firm's BDA-operation. Semi-structured interviews were conducted with questions derived from the conceptual framework based upon reviewed literature and pilot interviews. The data was then analyzed and matched to literature using a thematic analysis approach.  Results: Five ambidextrous traits and three agile traits were found within Åhlen’s BDA-operation. Analysis of these traits showcased a clear positive impact on Åhlen’s BDAC, when properly interconnected. Further, it was found that in absence of such interplay, the dynamic mediators did not have as positive impact and occasionally even disruptive effects on the firm’s BDAC. Hence it was concluded that proper connection between the mediators had to be present in order to successfully impact and enhance the capabilities.
14

How to capture that business value everyone talks about? : An exploratory case study on business value in agile big data analytics organizations

Svenningsson, Philip, Drubba, Maximilian January 2020 (has links)
Background: Big data analytics has been referred to as a hype the past decade, making manyorganizations adopt data-driven processes to stay competitive in their industries. Many of theorganizations adopting big data analytics use agile methodologies where the most importantoutcome is to maximize business value. Multiple scholars argue that big data analytics lead toincreased business value, however, there is a theoretical gap within the literature about how agileorganizations can capture this business value in a practically relevant way. Purpose: Building on a combined definition that capturing business value means being able todefine-, communicate- and measure it, the purpose of this thesis is to explore how agileorganizations capture business value from big data analytics, as well as find out what aspects ofvalue are relevant when defining it. Method: This study follows an abductive research approach by having a foundation in theorythrough the use of a qualitative research design. A single case study of Nike Inc. was conducted togenerate the primary data for this thesis where nine participants from different domains within theorganization were interviewed and the results were analysed with a thematic content analysis. Findings: The findings indicate that, in order for agile organizations to capture business valuegenerated from big data analytics, they need to (1) define the value through a synthezised valuemap, (2) establish a common language with the help of a business translator and agile methods,and (3), measure the business value before-, during- and after the development by usingindividually idenified KPIs derived from the business value definition.
15

Health Data Analytics: Data and Text Mining Approaches for Pharmacovigilance

Liu, Xiao, Liu, Xiao January 2016 (has links)
Pharmacovigilance is defined as the science and activities relating to the detection, assessment, understanding, and prevention of adverse drug events (WHO 2004). Post-approval adverse drug events are a major health concern. They attribute to about 700,000 emergency department visits, 120,000 hospitalizations, and $75 billion in medical costs annually (Yang et al. 2014). However, certain adverse drug events are preventable if detected early. Timely and accurate pharmacovigilance in the post-approval period is an urgent goal of the public health system. The availability of various sources of healthcare data for analysis in recent years opens new opportunities for the data-driven pharmacovigilance research. In an attempt to leverage the emerging healthcare big data, pharmacovigilance research is facing a few challenges. Most studies in pharmacovigilance focus on structured and coded data, and therefore miss important textual data from patient social media and clinical documents in EHR. Most prior studies develop drug safety surveillance systems using a single data source with only one data mining algorithm. The performance of such systems is hampered by the bias in data and the pitfalls of the data mining algorithms adopted. In my dissertation, I address two broad research questions: 1) How do we extract rich adverse drug event related information in textual data for active drug safety surveillance? 2) How do we design an integrated pharmacovigilance system to improve the decision-making process for drug safety regulatory intervention? To these ends, the dissertation comprises three essays. The first essay examines how to develop a high-performance information extraction framework for patient reports of adverse drug events in health social media. I found that medical entity extraction, drug-event relation extraction, and report source classification are necessary components for this task. In the second essay, I address the scalability issue of using social media for pharmacovigilance by proposing a distant supervision approach for information extraction. In the last essay, I develop a MetaAlert framework for pharmacovigilance with advanced text mining and data mining techniques to provide timely and accurate detection of adverse drug reactions. Models, frameworks, and design principles proposed in these essays advance not only pharmacovigilance research, but also more broadly contribute to health IT, business analytics, and design science research.
16

Estimating Bus Passengers' Origin-Destination of Travel Route Using Data Analytics on Wi-Fi and Bluetooth Signals

Jalali, Shahrzad 16 May 2019 (has links)
Accurate estimation of Origin and Destination (O-D) of passengers has been an essential objective for public transit agencies because knowledge of passengers’ flow enables them to forecast ridership, and plan for bus schedules, and bus routes. However, obtaining O-D information using traditional ways, such as conducting surveys, cannot fulfill today’s requirements of intelligent transportation and route planning in smart cities. Estimating bus passengers’ O-D using Wi-Fi and Bluetooth signals detected from their mobile devices is the primary objective of this project. For this purpose, we collected anonymized passengers’ data using SMATS TrafficBoxTM sensor provided by “SMATS Traffic Solutions” company. We then performed pre-processing steps including data cleaning, feature extraction, and data normalization, then, built various models using data mining techniques. The main challenge in this project was to distinguish between passengers’ and non-passengers’ signals since the sensor captures all signals in its surrounding environment including substantial noise from devices outside of the bus. To address this challenge, we applied Hierarchical and K-Means clustering algorithms to separate passengers from non-passengers’ signals automatically. By assigning GPS data to passengers’ signals, we could find commuters’ O-D. Moreover, we developed a second method based on an online analysis of sequential data, where specific thresholds were set to recognize passengers’ signals in real time. This method could create the O-D matrix online. Finally, in the validation phase, we compared the ground truth data with both estimated O-D matrices in both approaches and calculated their accuracy. Based on the final results, our proposed approaches can detect more than 20% of passengers (compared to 5% detection rate of traditional survey-based methods), and estimate the origin and destination of passengers with an accuracy of about 93%. With such promising results, these approaches are suitable alternatives for traditional and time-consuming ways of obtaining O-D data. This enables public transit companies to enhance their service offering by efficiently planning and scheduling the bus routes, improving ride comfort, and lowering operating costs of urban transportation.
17

Outlier Detection In Big Data

Cao, Lei 29 March 2016 (has links)
The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches.
18

Big Data Analytics and Engineering for Medicare Fraud Detection

Unknown Date (has links)
The United States (U.S.) healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions than non-fraudulent. Fraud is an extremely important issue for healthcare, as fraudulent activities within the U.S. healthcare system contribute to significant financial losses. In the U.S., the elderly population continues to rise, increasing the need for programs, such as Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected, draining resources and reducing the quality and accessibility of necessary healthcare services. In response, advanced data analytics have recently been explored to detect possible fraudulent activities. The Centers for Medicare and Medicaid Services (CMS) released several ‘Big Data’ Medicare claims datasets for different parts of their Medicare program to help facilitate this effort. In this dissertation, we employ three CMS Medicare Big Data datasets to evaluate the fraud detection performance available using advanced data analytics techniques, specifically machine learning. We use two distinct approaches, designated as anomaly detection and traditional fraud detection, where each have very distinct data processing and feature engineering. Anomaly detection experiments classify by provider specialty, determining whether outlier physicians within the same specialty signal fraudulent behavior. Traditional fraud detection refers to the experiments directly classifying physicians as fraudulent or non-fraudulent, leveraging machine learning algorithms to discriminate between classes. We present our novel data engineering approaches for both anomaly detection and traditional fraud detection including data processing, fraud mapping, and the creation of a combined dataset consisting of all three Medicare parts. We incorporate the List of Excluded Individuals and Entities database to identify real world fraudulent physicians for model evaluation. Regarding features, the final datasets for anomaly detection contain only claim counts for every procedure a physician submits while traditional fraud detection incorporates aggregated counts and payment information, specialty, and gender. Additionally, we compare cross-validation to the real world application of building a model on a training dataset and evaluating on a separate test dataset for severe class imbalance and rarity. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection
19

Data analytics, interpretation and machine learning for environmental forensics using peak mapping methods

Ghasemi Damavandi, Hamidreza 01 August 2016 (has links)
In this work our driving motivation is to develop mathematically robust and computationally efficient algorithms that will help chemists towards their goal of pattern matching. Environmental chemistry today broadly faces difficult computational and interpretational challenges for vast and ever-increasing data repositories. A driving factor behind these challenges are little known intricate relationships between constituent analytes that constitute complex mixtures spanning a range of target and non-target compounds. While the end of goal of different environment applications are diverse, computationally speaking, many data interpretation bottlenecks arise from lack of efficient algorithms and robust mathematical frameworks to identify, cluster and interpret compound peaks. There is a compelling need for compound-cognizant quantitative interpretation that accounts for the full informational range of gas chromatographic (and mass spectrometric) datasets. Traditional target-oriented analysis focus only on the dominant compounds of the chemical mixture, and thus are agnostic of the contribution of unknown non-target analytes. On the other extreme, statistical methods prevalent in chemometric interpretation ignore compound identity altogether and consider only the multivariate data statistics, and thus are agnostic of intrinsic relationships between the well-known target and unknown target analytes. Thus, both schools of thought (target-based or statistical) in current-day chemical data analysis and interpretation fall short of quantifying the complex interaction between major and minor compound peaks in molecular mixtures commonly encountered in environmental toxin studies. Such interesting insights would not be revealed via these standard techniques unless a deeper analysis of these patterns be taken into account in a quantitative mathematical framework that is at once compound-cognizant and comprehensive in its coverage of all peaks, major and minor. This thesis aims to meet this grand challenge using a combination of signal processing, pattern recognition and data engineering techniques. We focus on petroleum biomarker analysis and polychlorinated biphenyl (PCB) congener studies in human breastmilk as our target applications. We propose a novel approach to chemical data analytics and interpretation that bridges the gap between target-cognizant traditional analysis from environmental chemistry with compound-agnostic computational methods in chemometric data engineering. Specically, we propose computational methods for target-cognizant data analytics that also account for local unknown analytes allied to the established target peaks. The key intuition behind our methods are based on the underlying topography of the gas chromatigraphic landscape, and we extend recent peak mapping methods as well as propose novel peak clustering and peak neighborhood allocation methods to achieve our data analytic aims. Data-driven results based on a multitude of environmental applications are presented.
20

Data analytics for networked and possibly private sources

Wang, Ting 05 April 2011 (has links)
This thesis focuses on two grand challenges facing data analytical system designers and operators nowadays. First, how to fuse information from multiple autonomous, yet correlated sources and to provide consistent views of underlying phenomena? Second, how to respect externally imposed constraints (privacy concerns in particular) without compromising the efficacy of analysis? To address the first challenge, we apply a general correlation network model to capture the relationships among data sources, and propose Network-Aware Analysis (NAA), a library of novel inference models, to capture (i) how the correlation of the underlying sources is reflected as the spatial and/or temporal relevance of the collected data, and (ii) how to track causality in the data caused by the dependency of the data sources. We have also developed a set of space-time efficient algorithms to address (i) how to correlate relevant data and (ii) how to forecast future data. To address the second challenge, we further extend the concept of correlation network to encode the semantic (possibly virtual) dependencies and constraints among entities in question (e.g., medical records). We show through a set of concrete cases that correlation networks convey significant utility for intended applications, and meanwhile are often used as the steppingstone by adversaries to perform inference attacks. Using correlation networks as the pivot for analyzing privacy-utility trade-offs, we propose Privacy-Aware Analysis (PAA), a general design paradigm of constructing analytical solutions with theoretical backing for both privacy and utility.

Page generated in 0.0822 seconds