Spelling suggestions: "subject:" gig data"" "subject:" iig data""
441 |
Big Data : le nouvel enjeu de l'apprentissage à partir des données massives / Big Data : the new challenge Learning from data MassiveAdjout Rehab, Moufida 01 April 2016 (has links)
Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing. / In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction.
|
442 |
Extreme Learning Machines: novel extensions and application to Big DataAkusok, Anton 01 May 2016 (has links)
Extreme Learning Machine (ELM) is a recently discovered way of training Single Layer Feed-forward Neural Networks with an explicitly given solution, which exists because the input weights and biases are generated randomly and never change. The method in general achieves performance comparable to Error Back-Propagation, but the training time is up to 5 orders of magnitude smaller. Despite a random initialization, the regularization procedures explained in the thesis ensure consistently good results.
While the general methodology of ELMs is well developed, the sheer speed of the method enables its un-typical usage for state-of-the-art techniques based on repetitive model re-training and re-evaluation. Three of such techniques are explained in the third chapter: a way of visualizing high-dimensional data onto a provided fixed set of visualization points, an approach for detecting samples in a dataset with incorrect labels (mistakenly assigned, mistyped or a low confidence), and a way of computing confidence intervals for ELM predictions. All three methods prove useful, and allow even more applications in the future.
ELM method is a promising basis for dealing with Big Data, because it naturally deals with the problem of large data size. An adaptation of ELM to Big Data problems, and a corresponding toolbox (published and freely available) are described in chapter 4. An adaptation includes an iterative solution of ELM which satisfies a limited computer memory constraints and allows for a convenient parallelization. Other tools are GPU-accelerated computations and support for a convenient huge data storage format. The chapter also provides two real-world examples of dealing with Big Data using ELMs, which present other problems of Big Data such as veracity and velocity, and solutions to them in the particular problem context.
|
443 |
Fast demand response with datacenter loads: a green dimension of big dataMcClurg, Josiah 01 August 2017 (has links)
Demand response is one of the critical technologies necessary for allowing large-scale penetration of intermittent renewable energy sources in the electric grid. Data centers are especially attractive candidates for providing flexible, real-time demand response services to the grid because they are capable of fast power ramp-rates, large dynamic range, and finely-controllable power consumption. This thesis makes a contribution toward implementing load shaping with server clusters through a detailed experimental investigation of three broadly-applicable datacenter workload scenarios. We experimentally demonstrate the eminent feasibility of datacenter demand response with a distributed video transcoding application and a simple distributed power controller. We also show that while some software power capping interfaces performed better than others, all the interfaces we investigated had the high dynamic range and low power variance required to achieve high quality power tracking. Our next investigation presents an empirical performance evaluation of algorithms that replace arithmetic operations with low-level bit operations for power-aware Big Data processing. Specifically, we compare two different data structures in terms of execution time and power efficiency: (a) a baseline design using arrays, and (b) a design using bit-slice indexing (BSI) and distributed BSI arithmetic. Across three different datasets and three popular queries, we show that the bit-slicing queries consistently outperform the array algorithm in both power efficiency and execution time. In the context of datacenter power shaping, this performance optimization enables additional power flexibility -- achieving the same or greater performance than the baseline approach, even under power constraints. The investigation of read-optimized index queries leads up to an experimental investigation of the tradeoffs among power constraint, query freshness, and update aggregation size in a dynamic big data environment. We compare several update strategies, presenting a bitmap update optimization that allows improved performance over both a baseline approach and an existing state-of-the-art update strategy. Performing this investigation in the context of load shaping, we show that read-only range queries can be served without performance impact under power cap, and index updates can be tuned to provide a flexible base load. This thesis concludes with a brief discussion of control implementation and summary of our findings.
|
444 |
An Exploratory Statistical Method For Finding Interactions In A Large Dataset With An Application Toward Periodontal DiseasesLambert, Joshua 01 January 2017 (has links)
It is estimated that Periodontal Diseases effects up to 90% of the adult population. Given the complexity of the host environment, many factors contribute to expression of the disease. Age, Gender, Socioeconomic Status, Smoking Status, and Race/Ethnicity are all known risk factors, as well as a handful of known comorbidities. Certain vitamins and minerals have been shown to be protective for the disease, while some toxins and chemicals have been associated with an increased prevalence. The role of toxins, chemicals, vitamins, and minerals in relation to disease is believed to be complex and potentially modified by known risk factors. A large comprehensive dataset from 1999-2003 from the National Health and Nutrition Examination Survey (NHANES) contains full and partial mouth examinations on subjects for measurement of periodontal diseases as well as patient demographic information and approximately 150 environmental variables. In this dissertation, a Feasible Solution Algorithm (FSA) will be used to investigate statistical interactions of these various chemical and environmental variables related to periodontal disease. This sequential algorithm can be used on traditional statistical modeling methods to explore two and three way interactions related to the outcome of interest. FSA can also be used to identify unique subgroups of patients where periodontitis is most (or least) prevalent. In this dissertation, FSA is used to explore the NHANES data and suggest interesting relationships between the toxins, chemicals, vitamins, minerals and known risk factors that have not been previously identified.
|
445 |
ESSAYS ON EXTERNAL FORCES IN CAPITAL MARKETSPainter, Marcus 01 January 2019 (has links)
In the first chapter, I find counties more likely to be affected by climate change pay more in underwriting fees and initial yields to issue long-term municipal bonds compared to counties unlikely to be affected by climate change. This difference disappears when comparing short-term municipal bonds, implying the market prices climate change risks for long-term securities only. Higher issuance costs for climate risk counties are driven by bonds with lower credit ratings. Investor attention is a driving factor, as the difference in issuance costs on bonds issued by climate and non-climate affected counties increases after the release of the 2006 Stern Review on climate change. In the second chapter, I document the investment value of alternative data and examine how market participants react to the data's dissemination. Using satellite images of parking lots of US retailers, I find a long-short trading strategy based on growth in car count earns an alpha of 1.6% per month. I then show that, after the release of satellite data, hedge fund trades are more sensitive to growth in car count and are more profitable in affected stocks. Conversely, individual investor demand becomes less sensitive to growth in car count and less profitable in affected stocks. Further, the increase in information asymmetry between investors due to the availability of alternative data leads to a decrease in the liquidity of affected firms.
|
446 |
DATA COLLECTION FRAMEWORK AND MACHINE LEARNING ALGORITHMS FOR THE ANALYSIS OF CYBER SECURITY ATTACKSUnknown Date (has links)
The integrity of network communications is constantly being challenged by more sophisticated intrusion techniques. Attackers are shifting to stealthier and more complex forms of attacks in an attempt to bypass known mitigation strategies. Also, many detection methods for popular network attacks have been developed using outdated or non-representative attack data. To effectively develop modern detection methodologies, there exists a need to acquire data that can fully encompass the behaviors of persistent and emerging threats. When collecting modern day network traffic for intrusion detection, substantial amounts of traffic can be collected, much of which consists of relatively few attack instances as compared to normal traffic. This skewed distribution between normal and attack data can lead to high levels of class imbalance. Machine learning techniques can be used to aid in attack detection, but large levels of imbalance between normal (majority) and attack (minority) instances can lead to inaccurate detection results. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection
|
447 |
Organizational Success in the Big Data Era: Development of the Albrecht Data-Embracing Climate Scale (ADEC)Albrecht, Lauren Rebecca 01 September 2016 (has links)
In today’s information age, technological advances in virtually every industry allow organizations, both big and small, to create and store more data than ever before. Though data are highly abundant, they are still often underutilized resources with regard to improving organizational performance. The popularity and intrigue around big data specifically has opened up new opportunities to study how organizations embrace evidence and use it to improve their business. Generally, the focus of big data has mainly been on specific technologies, techniques, or its use in everyday life; however, what has been critically missing from the conversation is the consideration of culture and climate to support effective data use in organizations. Currently, many organizations want to develop a data-embracing climate or create changes to make their existing climates more data-informed. The purpose of this project was to develop a scale to assess the current state of data usage in organizations, which can be used to help organizations measure how well they manage, share, and use data to make informed decisions. I defined the phenomena of a data-embracing climate based on reviewing a broad range of business, computer science, and industrial-organizational psychology literature. Using this definition, I developed a scale to measure this newly defined construct by first conducting an exploratory factor analysis, then an item retranslation task, and finally a confirmatory factor analysis. This research provides support for the reliability and validity of the Albrecht Data-Embracing Climate Scale (ADEC); however, the future of this new area of research could benefit by replicating the results of this study and gaining support for the new construct. Implications for science and practice are discussed. I sought to make a valuable contribution to the field of I-O psychology and to make a useful instrument for researchers and practitioners in multiple and diverse fields. I hope others will benefit from this scale to measure how organizations use evidence from data to make informed decisions and gain a competitive advantage beyond intuition alone. Do not cite without express permission from the author.
|
448 |
Essays in History and Spatial Economics with Big DataLee, Sun Kyoung January 2019 (has links)
This dissertation contains three essays in History and Spatial Economics with Big Data. As a part of my dissertation, I develop a modern machine-learning based approach to connect large datasets. Merging several massive databases and matching the records within them presents challenges — some straightforward and others more complex. I employ artificial intelligence and machine learning technologies to link and then analyze massive amounts of historical US federal census, Department of Labor, and Bureau of Labor Statistics data.
The transformation of the US economy during this time period was remarkable, from a rural economy at the beginning of the 19th century to an industrial nation by the end. More strikingly, after lagging behind the technological frontier for most of the nineteenth century, the United States entered the twenty-first century as the global technology leader and the richest nation in the world. Results from this dissertation reveal how people lived and how the business operated. It tells us the past that led us to where we are now in terms of people, geography, prices and wages, wealth, revenue, output, capital, numbers, and types of workers, urbanization, migration, and industrialization.
As a part of this endeavor, the first chapter studies how the benefits of improving urban mass transit infrastructures in cities are shared across workers with different skills. It exploits a unique historical setting to estimate the impact of urban transportation infrastructure: the introduction of mass-public transit infrastructure in the late nineteenth and twentieth-century New York City. I linked individual-level US census data to investigate how urban transit infrastructure differentially affects the welfare of workers with heterogenous skill. My second chapter measures immigrants' role in the US rise as an economic power. Especially, this chapter focuses on a potential mechanism by which immigrants might have spurred economic prosperity: the transfer of new knowledge. This is the first project to use advances in quantitative spatial theory along with advanced big-data techniques to understand the contribution of immigrants to the process of U.S. economic growth. The key benefit of this approach is to link modern theory with massive amounts of microeconomic data about individual immigrants—their locations and occupations—to address questions that are extremely difficult to assess otherwise. Specifically, the dataset will help the researchers understand the extent to which the novel ideas and expertise immigrants brought to U.S. shores drove the nation’s emergence as an industrial and technological powerhouse.
My third chapter exploits advances in data digitization and machine learning to study intergenerational mobility in the United States before World War II. Using machine learning techniques, I construct a massive database for multiple generations of fathers and sons. This allows us to identify “land of opportunities": locations and times in American history where kids had chances to move up in the income ladder. I find that intergenerational mobility elasticities were relatively stable during 1880-1940; there are regional disparities in terms of giving kids opportunities to move up, and the geographic disparities of intergenerational mobility have evolved over time.
|
449 |
Value as a Motivating Factor for Collaboration : The case of a collaborative network for wind asset owners for potential big data sharingKenjangada Kariappa, Ganapathy, Bjersér, Marcus January 2019 (has links)
The world's need for energy is increasing while we realize the consequences of existing unsustainable methods for energy production. Wind power is a potential partial solution, but it is a relatively new source of energy. Advances in technology and innovation can be one solution, but the wind energy industry is embracing them too slow due to, among other reasons, lack of incentives in terms of the added value provided. Collaboration and big data may possibly provide a key to overcome this. However, to our knowledge, this research area has received little attention, especially in the context of the wind energy industry. The purpose of this study is to explore value as a motivating factor for potential big data collaboration via a collaborative network. This will be explored within the context of big data collaboration, and the collaborative network for wind asset owners O2O WIND International. A cross sectional, multi-method qualitative single in-depth case study is conducted. The data collected and analyzed is based on four semi-structured interviews and a set of rich documentary secondary data on the 25 of the participants in the collaborative network in the form of 3866 pages and 124 web pages visited. The main findings are as follows. The 25 participants of the collaborative network were evaluated and their approach to three different types of value were visualized through a novel model: A three-dimensional value approach space. From this visualization clusters of participants resulting in 6 different approaches to value can be distinguished amongst the 25 participants. Furthermore, 14 different categories of value as the participants express are possible to create through the collaborative network has been identified. These values have been categorized based on fundamental types of value, their dimensions and four value processes. As well as analyzed for patterns and similarities amongst them. The classification results in a unique categorization of participants of a collaborative network. These categories prove as customer segments that the focal firm of the collaborative network can target. The interviews resulted in insights about the current state of the industry, existing and future market problems and needs as well as existing and future market opportunities. Then possible business model implications originating from our findings, for the focal firm behind the collaborative network O2O WIND International as well as the participants of the collaboration, has been discussed. We conclude that big data and collaborative networks has potential for value creation in the wind power sector, if the business model of those involved takes it into account. However, more future research is necessary, and suggestions are made.
|
450 |
A big data analytics framework to improve healthcare service delivery in South AfricaMgudlwa, Sibulela January 2018 (has links)
Thesis (MTech (Information Technology))--Cape Peninsula University of Technology, 2018. / Healthcare facilities in South Africa accumulate big data, daily. However, this data is not being utilised to its full potential. The healthcare sector still uses traditional methods to store, process, and analyse data. Currently, there are no big data analytics tools being used in the South African healthcare environment.
This study was conducted to establish what factors hinder the effective use of big data in the South African healthcare environment. To fulfil the objectives of this research, qualitative methods were followed. Using the case study method, two healthcare organisations were selected as cases. This enabled the researcher to find similarities between the cases which drove them towards generalisation. The data collected in this study was analysed using the Actor-Network Theory (ANT). Through the application of ANT, the researcher was able to uncover the influencing factors behind big data analytics in the healthcare environment. ANT was essential to the study as it brought out the different interactions that take place between human and non-human actors, resulting in big data. From the analysis, findings were drawn and interpreted. The interpretation of findings led to the developed framework in Figure 5.5. This framework was developed to guide the healthcare sector of South Africa towards the selection of appropriate big data analytics tools.
The contribution of this study is in twofold; namely, theoretically and practically. Theoretically, the developed framework will act as a useful guide towards the selection of big data analytics tools. Practically, this guide can be used by South African healthcare practitioners to gain better understanding of big data analytics and how they can be used to improve healthcare service delivery.
|
Page generated in 0.0682 seconds