Spelling suggestions: "subject:" gig data"" "subject:" iig data""
441 |
Opportunities and challenges of Big Data Analytics in healthcare : An exploratory study on the adoption of big data analytics in the Management of Sickle Cell Anaemia.Saenyi, Betty January 2018 (has links)
Background: With increasing technological advancements, healthcare providers are adopting electronic health records (EHRs) and new health information technology systems. Consequently, data from these systems is accumulating at a faster rate creating a need for more robust ways of capturing, storing and processing the data. Big data analytics is used in extracting insight form such large amounts of medical data and is increasingly becoming a valuable practice for healthcare organisations. Could these strategies be applied in disease management? Especially in rare conditions like Sickle Cell Disease (SCD)? The study answers the following research questions;1. What Data Management practices are used in Sickle Cell Anaemia management?2. What areas in the management of sickle cell anaemia could benefit from use of big data Analytics?3. What are the challenges of applying big data analytics in the management of sickle cell anaemia?Purpose: The purpose of this research was to serve as pre-study in establishing the opportunities and challenges of applying big data analytics in the management of SCDMethod: The study adopted both deductive and inductive approaches. Data was collected through interviews based on a framework which was modified specifically for this study. It was then inductively analysed to answer the research questions.Conclusion: Although there is a lot of potential for big data analytics in SCD in areas like population health management, evidence-based medicine and personalised care, its adoption is not a surety. This is because of lack of interoperability between the existing systems and strenuous legal compliant processes in data acquisition.
|
442 |
Development of computational approaches for whole-genome sequence variation and deep phenotypingHaimel, Matthias January 2019 (has links)
The rare disease pulmonary arterial hypertension (PAH) results in high blood pressure in the lung caused by narrowing of lung arteries. Genes causative in PAH were discovered through family studies and very often harbour rare variants. However, the genetic cause in heritable (31%) and idiopathic (79%) PAH cases is not yet known but are speculated to be caused by rare variants. Advances in high-throughput sequencing (HTS) technologies made it possible to detect variants in 98% of the human genome. A drop in sequencing costs made it feasible to sequence 10,000 individuals including 1,250 subjects diagnosed with PAH and relatives as part of the NIHR Bioresource - Rare (BR-RD) disease study. This large cohort allows the genome-wide identification of rare variants to discover novel causative genes associated with PAH in a case-control study to advance our understanding of the underlying aetiology. In the first part of my thesis, I establish a phenotype capture system that allows research nurses to record clinical measurements and other patient related information of PAH patients recruited to the NIHR BR-RD study. The implemented extensions provide a programmatic data transfer and an automated data release pipeline for analysis ready data. The second part is dedicated to the discovery of novel disease genes in PAH. I focus on one well characterised PAH disease gene to establish variant filter strategies to enrich for rare disease causing variants. I apply these filter strategies to all known PAH disease genes and describe the phenotypic differences based on clinically relevant values. Genome-wide results from different filter strategies are tested for association with PAH. I describe the findings of the rare variant association tests and provide a detailed interrogation of two novel disease genes. The last part describes the data characteristics of variant information, available non SQL (NoSQL) implementations and evaluates the suitability and scalability of distributed compute frameworks to store and analyse population scale variation data. Based on the evaluation, I implement a variant analysis platform that incrementally merges samples, annotates variants and enables the analysis of 10,000 individuals in minutes. An incremental design for variant merging and annotation has not been described before. Using the framework, I develop a quality score to reduce technical variation and other biases. The result from the rare variant association test is compared with traditional methods.
|
443 |
Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data MassiveAdjout Rehab, Moufida 01 April 2016 (has links)
Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction.
|
444 |
Extreme Learning Machines: novel extensions and application to Big DataAkusok, Anton 01 May 2016 (has links)
Extreme Learning Machine (ELM) is a recently discovered way of training Single Layer Feed-forward Neural Networks with an explicitly given solution, which exists because the input weights and biases are generated randomly and never change. The method in general achieves performance comparable to Error Back-Propagation, but the training time is up to 5 orders of magnitude smaller. Despite a random initialization, the regularization procedures explained in the thesis ensure consistently good results.
While the general methodology of ELMs is well developed, the sheer speed of the method enables its un-typical usage for state-of-the-art techniques based on repetitive model re-training and re-evaluation. Three of such techniques are explained in the third chapter: a way of visualizing high-dimensional data onto a provided fixed set of visualization points, an approach for detecting samples in a dataset with incorrect labels (mistakenly assigned, mistyped or a low confidence), and a way of computing confidence intervals for ELM predictions. All three methods prove useful, and allow even more applications in the future.
ELM method is a promising basis for dealing with Big Data, because it naturally deals with the problem of large data size. An adaptation of ELM to Big Data problems, and a corresponding toolbox (published and freely available) are described in chapter 4. An adaptation includes an iterative solution of ELM which satisfies a limited computer memory constraints and allows for a convenient parallelization. Other tools are GPU-accelerated computations and support for a convenient huge data storage format. The chapter also provides two real-world examples of dealing with Big Data using ELMs, which present other problems of Big Data such as veracity and velocity, and solutions to them in the particular problem context.
|
445 |
Fast demand response with datacenter loads: a green dimension of big dataMcClurg, Josiah 01 August 2017 (has links)
Demand response is one of the critical technologies necessary for allowing large-scale penetration of intermittent renewable energy sources in the electric grid. Data centers are especially attractive candidates for providing flexible, real-time demand response services to the grid because they are capable of fast power ramp-rates, large dynamic range, and finely-controllable power consumption. This thesis makes a contribution toward implementing load shaping with server clusters through a detailed experimental investigation of three broadly-applicable datacenter workload scenarios. We experimentally demonstrate the eminent feasibility of datacenter demand response with a distributed video transcoding application and a simple distributed power controller. We also show that while some software power capping interfaces performed better than others, all the interfaces we investigated had the high dynamic range and low power variance required to achieve high quality power tracking. Our next investigation presents an empirical performance evaluation of algorithms that replace arithmetic operations with low-level bit operations for power-aware Big Data processing. Specifically, we compare two different data structures in terms of execution time and power efficiency: (a) a baseline design using arrays, and (b) a design using bit-slice indexing (BSI) and distributed BSI arithmetic. Across three different datasets and three popular queries, we show that the bit-slicing queries consistently outperform the array algorithm in both power efficiency and execution time. In the context of datacenter power shaping, this performance optimization enables additional power flexibility -- achieving the same or greater performance than the baseline approach, even under power constraints. The investigation of read-optimized index queries leads up to an experimental investigation of the tradeoffs among power constraint, query freshness, and update aggregation size in a dynamic big data environment. We compare several update strategies, presenting a bitmap update optimization that allows improved performance over both a baseline approach and an existing state-of-the-art update strategy. Performing this investigation in the context of load shaping, we show that read-only range queries can be served without performance impact under power cap, and index updates can be tuned to provide a flexible base load. This thesis concludes with a brief discussion of control implementation and summary of our findings.
|
446 |
An Exploratory Statistical Method For Finding Interactions In A Large Dataset With An Application Toward Periodontal DiseasesLambert, Joshua 01 January 2017 (has links)
It is estimated that Periodontal Diseases effects up to 90% of the adult population. Given the complexity of the host environment, many factors contribute to expression of the disease. Age, Gender, Socioeconomic Status, Smoking Status, and Race/Ethnicity are all known risk factors, as well as a handful of known comorbidities. Certain vitamins and minerals have been shown to be protective for the disease, while some toxins and chemicals have been associated with an increased prevalence. The role of toxins, chemicals, vitamins, and minerals in relation to disease is believed to be complex and potentially modified by known risk factors. A large comprehensive dataset from 1999-2003 from the National Health and Nutrition Examination Survey (NHANES) contains full and partial mouth examinations on subjects for measurement of periodontal diseases as well as patient demographic information and approximately 150 environmental variables. In this dissertation, a Feasible Solution Algorithm (FSA) will be used to investigate statistical interactions of these various chemical and environmental variables related to periodontal disease. This sequential algorithm can be used on traditional statistical modeling methods to explore two and three way interactions related to the outcome of interest. FSA can also be used to identify unique subgroups of patients where periodontitis is most (or least) prevalent. In this dissertation, FSA is used to explore the NHANES data and suggest interesting relationships between the toxins, chemicals, vitamins, minerals and known risk factors that have not been previously identified.
|
447 |
ESSAYS ON EXTERNAL FORCES IN CAPITAL MARKETSPainter, Marcus 01 January 2019 (has links)
In the first chapter, I find counties more likely to be affected by climate change pay more in underwriting fees and initial yields to issue long-term municipal bonds compared to counties unlikely to be affected by climate change. This difference disappears when comparing short-term municipal bonds, implying the market prices climate change risks for long-term securities only. Higher issuance costs for climate risk counties are driven by bonds with lower credit ratings. Investor attention is a driving factor, as the difference in issuance costs on bonds issued by climate and non-climate affected counties increases after the release of the 2006 Stern Review on climate change. In the second chapter, I document the investment value of alternative data and examine how market participants react to the data's dissemination. Using satellite images of parking lots of US retailers, I find a long-short trading strategy based on growth in car count earns an alpha of 1.6% per month. I then show that, after the release of satellite data, hedge fund trades are more sensitive to growth in car count and are more profitable in affected stocks. Conversely, individual investor demand becomes less sensitive to growth in car count and less profitable in affected stocks. Further, the increase in information asymmetry between investors due to the availability of alternative data leads to a decrease in the liquidity of affected firms.
|
448 |
DATA COLLECTION FRAMEWORK AND MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS OF CYBER SECURITY ATTACKSUnknown Date (has links)
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection
|
449 |
Organizational Success in the Big Data Era: Development of the Albrecht Data-Embracing Climate Scale (ADEC)Albrecht, Lauren Rebecca 01 September 2016 (has links)
In today’s information age, technological advances in virtually every industry allow organizations, both big and small, to create and store more data than ever before. Though data are highly abundant, they are still often underutilized resources with regard to improving organizational performance. The popularity and intrigue around big data specifically has opened up new opportunities to study how organizations embrace evidence and use it to improve their business. Generally, the focus of big data has mainly been on specific technologies, techniques, or its use in everyday life; however, what has been critically missing from the conversation is the consideration of culture and climate to support effective data use in organizations. Currently, many organizations want to develop a data-embracing climate or create changes to make their existing climates more data-informed. The purpose of this project was to develop a scale to assess the current state of data usage in organizations, which can be used to help organizations measure how well they manage, share, and use data to make informed decisions. I defined the phenomena of a data-embracing climate based on reviewing a broad range of business, computer science, and industrial-organizational psychology literature. Using this definition, I developed a scale to measure this newly defined construct by first conducting an exploratory factor analysis, then an item retranslation task, and finally a confirmatory factor analysis. This research provides support for the reliability and validity of the Albrecht Data-Embracing Climate Scale (ADEC); however, the future of this new area of research could benefit by replicating the results of this study and gaining support for the new construct. Implications for science and practice are discussed. I sought to make a valuable contribution to the field of I-O psychology and to make a useful instrument for researchers and practitioners in multiple and diverse fields. I hope others will benefit from this scale to measure how organizations use evidence from data to make informed decisions and gain a competitive advantage beyond intuition alone. Do not cite without express permission from the author.
|
450 |
Essays in History and Spatial Economics with Big DataLee, Sun Kyoung January 2019 (has links)
This dissertation contains three essays in History and Spatial Economics with Big Data. As a part of my dissertation, I develop a modern machine-learning based approach to connect large datasets. Merging several massive databases and matching the records within them presents challenges — some straightforward and others more complex. I employ artificial intelligence and machine learning technologies to link and then analyze massive amounts of historical US federal census, Department of Labor, and Bureau of Labor Statistics data.
The transformation of the US economy during this time period was remarkable, from a rural economy at the beginning of the 19th century to an industrial nation by the end. More strikingly, after lagging behind the technological frontier for most of the nineteenth century, the United States entered the twenty-first century as the global technology leader and the richest nation in the world. Results from this dissertation reveal how people lived and how the business operated. It tells us the past that led us to where we are now in terms of people, geography, prices and wages, wealth, revenue, output, capital, numbers, and types of workers, urbanization, migration, and industrialization.
As a part of this endeavor, the first chapter studies how the benefits of improving urban mass transit infrastructures in cities are shared across workers with different skills. It exploits a unique historical setting to estimate the impact of urban transportation infrastructure: the introduction of mass-public transit infrastructure in the late nineteenth and twentieth-century New York City. I linked individual-level US census data to investigate how urban transit infrastructure differentially affects the welfare of workers with heterogenous skill. My second chapter measures immigrants' role in the US rise as an economic power. Especially, this chapter focuses on a potential mechanism by which immigrants might have spurred economic prosperity: the transfer of new knowledge. This is the first project to use advances in quantitative spatial theory along with advanced big-data techniques to understand the contribution of immigrants to the process of U.S. economic growth. The key benefit of this approach is to link modern theory with massive amounts of microeconomic data about individual immigrants—their locations and occupations—to address questions that are extremely difficult to assess otherwise. Specifically, the dataset will help the researchers understand the extent to which the novel ideas and expertise immigrants brought to U.S. shores drove the nation’s emergence as an industrial and technological powerhouse.
My third chapter exploits advances in data digitization and machine learning to study intergenerational mobility in the United States before World War II. Using machine learning techniques, I construct a massive database for multiple generations of fathers and sons. This allows us to identify “land of opportunities": locations and times in American history where kids had chances to move up in the income ladder. I find that intergenerational mobility elasticities were relatively stable during 1880-1940; there are regional disparities in terms of giving kids opportunities to move up, and the geographic disparities of intergenerational mobility have evolved over time.
|
Page generated in 0.0614 seconds