1 |
Distributed Ensemble Learning With Apache SparkLind, Simon January 2016 (has links)
No description available.
|
2 |
"Big Data" Management and Security Application to Telemetry Data ProductsKalibjian, Jeff 10 1900 (has links)
ITC/USA 2013 Conference Proceedings / The Forty-Ninth Annual International Telemetering Conference and Technical Exhibition / October 21-24, 2013 / Bally's Hotel & Convention Center, Las Vegas, NV / "Big Data" [1] and the security challenge of managing "Big Data" is a hot topic in the IT world. The term "Big Data" is used to describe very large data sets that cannot be processed by traditional database applications in "tractable" periods of time. Securing data in a conventional database is challenge enough; securing data whose size may exceed hundreds of terabytes or even petabytes is even more daunting! As the size of telemetry product and telemetry post-processed product continues to grow, "Big Data" management techniques and the securing of that data may have ever increasing application in the telemetry realm. After reviewing "Big Data", "Big Data" security and management basics, potential application to telemetry post-processed product will be explored.
|
3 |
Virtual wind sensors: improving wind forecasting using big data analyticsGray, Kevin Alan January 2016 (has links)
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of requirements for the degree of Master of Science. Johannesburg, 2016. / Wind sensors provide very accurate measurements, however it is not feasible to have
a network of wind sensors large enough to provide these accurate readings everywhere.
A “virtual” wind sensor uses existing weather forecasts, as well as historical weather
station data to predict what readings a regular wind sensor would provide. This study
attempts to develop a method using Big Data Analytics to predict wind readings for
use in “virtual” wind sensors. The study uses Random Forests and linear regression to
estimate wind direction and magnitude using various transformations of a Digital Elevation
Model, as well as data from the European Centre for Medium-Range Weather Forecasts.
The model is evaluated based on its accuracy when compared to existing high resolution
weather station data, to show a slight improvement in the estimation of wind direction
and magnitude over the forecast data. / LG2017
|
4 |
Approaching “Big Data” in Biological Research Imaging Spectroscopy with Novel CompressionChen, Yixuan 10 April 2014 (has links)
This research focuses on providing a fast and space efficient compression method to answer information queries on spectroscopic data. Our primary hypothesis was whether a conversion from decimal data to character/integer space could be done in a manner that enables use of succinct structures and provides good compression. This compression algorithm is motivated to handle queries on spectroscopic data that approaches limits of main computer memory.
The primary hypothesis is supported in that the new compression method can save 79.20% - 94.07% computer space on the average. The average of maximum error rates is also acceptable, being 0.05% - 1.36% depending on the subject that the data was collected from. Additionally, the data’s compression rate and entropy are negatively correlated; while compression rate and maximum error were positively correlated when the max error rates were performed on a natural logarithm transformation. The effects of different types of data sources on compression rate have been studied as well. Fungus datasets achieved highest compression rates, while mouse brain datasets obtained the lowest compression rates among four types of data sources. Finally, the effect of the studied compression algorithm and method on integrating spectral bands has been investigated in this study. The spectral integration for determining lipid, CH2 and dense core plaque obtained good image quality and the errors can be considered inconsequential except the case of determining creatine deposits. Despite the fact that creatine deposits are still recognizable in the reconstructed image, the image quality was reduced.
|
5 |
Charla sobre aplicaciones de Bigdata en el mercadoDíaz Huiza, César, Quezada Balcázar, César 12 September 2019 (has links)
Cesar Díaz Huiza (DMC Perú) / César Quezada Balcázar (DMC Perú) / En la charla se desarrollará la evolución, importancia de Bigdata en el mercado y su impacto en la economía.
|
6 |
Velká data: nová perspektiva pro řešení konfliktů / Big Data: A New Perspective on Conflict ResolutionŠerstka, Anastasija January 2021 (has links)
The thesis examines the role of big data in resolving modern conflicts. The study combines the concept of big data with conflict resolution theory and then applies them to three directions of conflict resolution: non-violent, violent, and conflict prevention. Each of the three groups is accompanied by a case study. This method allows a detailed understanding of various aspects related to the resolution of current conflicts using technology and big data analytics. The thesis examines empirical data associated with many innovative projects that have been implemented or are in the process of development for the resolution of ongoing conflicts - UN projects focused on big data collection, technology projects developed by the US state research centers, databases of large amounts of data related to conflicts. Based on the acquired knowledge, this work explores the big data analysis for conflict resolution, its forms, advantages, disadvantages and limitations. Big data perspectives on the resolution of modern conflicts, based on empirical analysis, are summarized in three groups: operational (real-time data collection and processing), tactical (real-time decision-making based on big data analysis outcomes), and strategic (data-driven strategic advantage). The thesis concludes that the main advantage of...
|
7 |
Towards a big data analytics platform with Hadoop/MapReduce framework using simulated patient data of a hospital systemChrimes, Dillon 28 November 2016 (has links)
Background: Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges. The study objective was high performance establishment of interactive BDA platform of hospital system.
Methods: A Hadoop/MapReduce framework formed the BDA platform with HBase (NoSQL database) using hospital-specific metadata and file ingestion. Query performance tested with Apache tools in Hadoop’s ecosystem.
Results: At optimized iteration, Hadoop distributed file system (HDFS) ingestion required three seconds but HBase required four to twelve hours to complete the Reducer of MapReduce. HBase bulkloads took a week for one billion (10TB) and over two months for three billion (30TB). Simple and complex query results showed about two seconds for one and three billion, respectively.
Interpretations: BDA platform of HBase distributed by Hadoop successfully under high performance at large volumes representing the Province’s entire data. Inconsistencies of MapReduce limited operational efficiencies. Importance of the Hadoop/MapReduce on representation of health informatics is further discussed. / Graduate / 0566 / 0769 / 0984 / dillon.chrimes@viha.ca
|
8 |
Indexing and analysis of very large masses of time series / Indexation et analyse de très grandes masses de séries temporellesYagoubi, Djamel edine 19 March 2018 (has links)
Les séries temporelles sont présentes dans de nombreux domaines d'application tels que la finance, l'agronomie, la santé, la surveillance de la Terre ou la prévision météorologique, pour n'en nommer que quelques-uns. En raison des progrès de la technologie des capteurs, de telles applications peuvent produire des millions, voir des des milliards, de séries temporelles par jour, ce qui nécessite des techniques rapides d'analyse et de synthèse.Le traitement de ces énormes volumes de données a ouvert de nouveaux défis dans l'analyse des séries temporelles. En particulier, les techniques d'indexation ont montré de faibles performances lors du traitement des grands volumes des données.Dans cette thèse, nous abordons le problème de la recherche de similarité dans des centaines de millions de séries temporelles. Pour cela, nous devons d'abord développer des opérateurs de recherche efficaces, capables d'interroger une très grande base de données distribuée de séries temporelles avec de faibles temps de réponse. L'opérateur de recherche peut être implémenté en utilisant un index avant l'exécution des requêtes.L'objectif des indices est d'améliorer la vitesse des requêtes de similitude. Dans les bases de données, l'index est une structure de données basées sur des critères de recherche comme la localisation efficace de données répondant aux exigences. Les index rendent souvent le temps de réponse de l'opération de recherche sous linéaire dans la taille de la base de données. Les systèmes relationnels ont été principalement supportés par des structures de hachage, B-tree et des structures multidimensionnelles telles que R-tree, avec des vecteurs binaires jouant un rôle de support. De telles structures fonctionnent bien pour les recherches, et de manière adéquate pour les requêtes de similarité. Nous proposons trois solutions différentes pour traiter le problème de l'indexation des séries temporelles dans des grandes bases de données. Nos algorithmes nous permettent d'obtenir d'excellentes performances par rapport aux approches traditionnelles.Nous étudions également le problème de la détection de corrélation parallèle de toutes paires sur des fenêtres glissantes de séries temporelles. Nous concevons et implémentons une stratégie de calcul incrémental des sketchs dans les fenêtres glissantes. Cette approche évite de recalculer les sketchs à partir de zéro. En outre, nous développons une approche de partitionnement qui projette des sketchs vecteurs de séries temporelles dans des sous-vecteurs et construit une structure de grille distribuée. Nous utilisons cette méthode pour détecter les séries temporelles corrélées dans un environnement distribué. / Time series arise in many application domains such as finance, agronomy, health, earth monitoring, weather forecasting, to name a few. Because of advances in sensor technology, such applications may produce millions to trillions of time series per day, requiring fast analytical and summarization techniques.The processing of these massive volumes of data has opened up new challenges in time series data mining. In particular, it is to improve indexing techniques that has shown poor performances when processing large databases.In this thesis, we focus on the problem of parallel similarity search in such massive sets of time series. For this, we first need to develop efficient search operators that can query a very large distributed database of time series with low response times. The search operator can be implemented by using an index constructed before executing the queries. The objective of indices is to improve the speed of data retrieval operations. In databases, the index is a data structure, which based on search criteria, efficiently locates data entries satisfying the requirements. Indexes often make the response time of the lookup operation sublinear in the database size.After reviewing the state of the art, we propose three novel approaches for parallel indexing and queryin large time series datasets. First, we propose DPiSAX, a novel and efficient parallel solution that includes a parallel index construction algorithm that takes advantage of distributed environments to build iSAX-based indices over vast volumes of time series efficiently. Our solution also involves a parallel query processing algorithm that, given a similarity query, exploits the available processors of the distributed system to efficiently answer the query in parallel by using the constructed parallel index.Second, we propose RadiusSketch a random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers. RadiusSketch includes a parallel index construction algorithm that takes advantage of distributed environments to efficiently build sketch-based indices over very large databases of time series, and then query the databases in parallel.Third, we propose ParCorr, an efficient parallel solution for detecting similar time series across distributed data streams. ParCorr uses the sketch principle for representing the time series. Our solution includes a parallel approach for incremental computation of the sketches in sliding windows and a partitioning approach that projects sketch vectors of time series into subvectors and builds a distributed grid structure.Our solutions have been evaluated using real and synthetics datasets and the results confirm their high efficiency compared to the state of the art.
|
9 |
INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATAUnknown Date (has links)
Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection
|
10 |
Big data v sociologii / Big Data in SociologyLančová, Táňa January 2018 (has links)
The aim of this work is to provide a holistic view on Big Data in sociology and with this way to reflect the actual topic, which has not been systematically elaborated yet. This theses summarizes approaches to Big Data specification, which provides insight into complexity of this phenomenon. It describes attitudes of contemporary sociology of Big Data. It identifies Big data specifics, which lead to reasons, why Big Data have not been fully accepted by sociology yet. It provides comprehensive description of Big Data sources sorted by the owners and brings an overview of methods for Big Data analysis. It sorts and reflects Critical Data Studies and brings new topics. Key words: Big data, Big data analysis, methodology
|
Page generated in 0.166 seconds