Global ETD Search

1	Virtual wind sensors: improving wind forecasting using big data analytics Gray, Kevin Alan January 2016 (has links) A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of requirements for the degree of Master of Science. Johannesburg, 2016. / Wind sensors provide very accurate measurements, however it is not feasible to have a network of wind sensors large enough to provide these accurate readings everywhere. A “virtual” wind sensor uses existing weather forecasts, as well as historical weather station data to predict what readings a regular wind sensor would provide. This study attempts to develop a method using Big Data Analytics to predict wind readings for use in “virtual” wind sensors. The study uses Random Forests and linear regression to estimate wind direction and magnitude using various transformations of a Digital Elevation Model, as well as data from the European Centre for Medium-Range Weather Forecasts. The model is evaluated based on its accuracy when compared to existing high resolution weather station data, to show a slight improvement in the estimation of wind direction and magnitude over the forecast data. / LG2017 Wind forecasting Big data
2	Approaching “Big Data” in Biological Research Imaging Spectroscopy with Novel Compression Chen, Yixuan 10 April 2014 (has links) This research focuses on providing a fast and space efficient compression method to answer information queries on spectroscopic data. Our primary hypothesis was whether a conversion from decimal data to character/integer space could be done in a manner that enables use of succinct structures and provides good compression. This compression algorithm is motivated to handle queries on spectroscopic data that approaches limits of main computer memory. The primary hypothesis is supported in that the new compression method can save 79.20% - 94.07% computer space on the average. The average of maximum error rates is also acceptable, being 0.05% - 1.36% depending on the subject that the data was collected from. Additionally, the data’s compression rate and entropy are negatively correlated; while compression rate and maximum error were positively correlated when the max error rates were performed on a natural logarithm transformation. The effects of different types of data sources on compression rate have been studied as well. Fungus datasets achieved highest compression rates, while mouse brain datasets obtained the lowest compression rates among four types of data sources. Finally, the effect of the studied compression algorithm and method on integrating spectral bands has been investigated in this study. The spectral integration for determining lipid, CH2 and dense core plaque obtained good image quality and the errors can be considered inconsequential except the case of determining creatine deposits. Despite the fact that creatine deposits are still recognizable in the reconstructed image, the image quality was reduced. Image Compression Big Data
3	Charla sobre aplicaciones de Bigdata en el mercado Díaz Huiza, César, Quezada Balcázar, César 12 September 2019 (has links) Cesar Díaz Huiza (DMC Perú) / César Quezada Balcázar (DMC Perú) / En la charla se desarrollará la evolución, importancia de Bigdata en el mercado y su impacto en la economía. Economía Mercado Big data
4	Indexing and analysis of very large masses of time series / Indexation et analyse de très grandes masses de séries temporelles Yagoubi, Djamel edine 19 March 2018 (has links) Les séries temporelles sont présentes dans de nombreux domaines d'application tels que la finance, l'agronomie, la santé, la surveillance de la Terre ou la prévision météorologique, pour n'en nommer que quelques-uns. En raison des progrès de la technologie des capteurs, de telles applications peuvent produire des millions, voir des des milliards, de séries temporelles par jour, ce qui nécessite des techniques rapides d'analyse et de synthèse.Le traitement de ces énormes volumes de données a ouvert de nouveaux défis dans l'analyse des séries temporelles. En particulier, les techniques d'indexation ont montré de faibles performances lors du traitement des grands volumes des données.Dans cette thèse, nous abordons le problème de la recherche de similarité dans des centaines de millions de séries temporelles. Pour cela, nous devons d'abord développer des opérateurs de recherche efficaces, capables d'interroger une très grande base de données distribuée de séries temporelles avec de faibles temps de réponse. L'opérateur de recherche peut être implémenté en utilisant un index avant l'exécution des requêtes.L'objectif des indices est d'améliorer la vitesse des requêtes de similitude. Dans les bases de données, l'index est une structure de données basées sur des critères de recherche comme la localisation efficace de données répondant aux exigences. Les index rendent souvent le temps de réponse de l'opération de recherche sous linéaire dans la taille de la base de données. Les systèmes relationnels ont été principalement supportés par des structures de hachage, B-tree et des structures multidimensionnelles telles que R-tree, avec des vecteurs binaires jouant un rôle de support. De telles structures fonctionnent bien pour les recherches, et de manière adéquate pour les requêtes de similarité. Nous proposons trois solutions différentes pour traiter le problème de l'indexation des séries temporelles dans des grandes bases de données. Nos algorithmes nous permettent d'obtenir d'excellentes performances par rapport aux approches traditionnelles.Nous étudions également le problème de la détection de corrélation parallèle de toutes paires sur des fenêtres glissantes de séries temporelles. Nous concevons et implémentons une stratégie de calcul incrémental des sketchs dans les fenêtres glissantes. Cette approche évite de recalculer les sketchs à partir de zéro. En outre, nous développons une approche de partitionnement qui projette des sketchs vecteurs de séries temporelles dans des sous-vecteurs et construit une structure de grille distribuée. Nous utilisons cette méthode pour détecter les séries temporelles corrélées dans un environnement distribué. / Time series arise in many application domains such as finance, agronomy, health, earth monitoring, weather forecasting, to name a few. Because of advances in sensor technology, such applications may produce millions to trillions of time series per day, requiring fast analytical and summarization techniques.The processing of these massive volumes of data has opened up new challenges in time series data mining. In particular, it is to improve indexing techniques that has shown poor performances when processing large databases.In this thesis, we focus on the problem of parallel similarity search in such massive sets of time series. For this, we first need to develop efficient search operators that can query a very large distributed database of time series with low response times. The search operator can be implemented by using an index constructed before executing the queries. The objective of indices is to improve the speed of data retrieval operations. In databases, the index is a data structure, which based on search criteria, efficiently locates data entries satisfying the requirements. Indexes often make the response time of the lookup operation sublinear in the database size.After reviewing the state of the art, we propose three novel approaches for parallel indexing and queryin large time series datasets. First, we propose DPiSAX, a novel and efficient parallel solution that includes a parallel index construction algorithm that takes advantage of distributed environments to build iSAX-based indices over vast volumes of time series efficiently. Our solution also involves a parallel query processing algorithm that, given a similarity query, exploits the available processors of the distributed system to efficiently answer the query in parallel by using the constructed parallel index.Second, we propose RadiusSketch a random projection-based approach that scales nearly linearly in parallel environments, and provides high quality answers. RadiusSketch includes a parallel index construction algorithm that takes advantage of distributed environments to efficiently build sketch-based indices over very large databases of time series, and then query the databases in parallel.Third, we propose ParCorr, an efficient parallel solution for detecting similar time series across distributed data streams. ParCorr uses the sketch principle for representing the time series. Our solution includes a parallel approach for incremental computation of the sketches in sliding windows and a partitioning approach that projects sketch vectors of time series into subvectors and builds a distributed grid structure.Our solutions have been evaluated using real and synthetics datasets and the results confirm their high efficiency compared to the state of the art. Séries temporelles Big data Indexation Time series Big data Indexing
5	INVESTIGATING MACHINE LEARNING ALGORITHMS WITH IMBALANCED BIG DATA Unknown Date (has links) Recent technological developments have engendered an expeditious production of big data and also enabled machine learning algorithms to produce high-performance models from such data. Nonetheless, class imbalance (in binary classifications) between the majority and minority classes in big data can skew the predictive performance of the classification algorithms toward the majority (negative) class whereas the minority (positive) class usually holds greater value for the decision makers. Such bias may lead to adverse consequences, some of them even life-threatening, when the existence of false negatives is generally costlier than false positives. The size of the minority class can vary from fair to extraordinary small, which can lead to different performance scores for machine learning algorithms. Class imbalance is a well-studied area for traditional data, i.e., not big data. However, there is limited research focusing on both rarity and severe class imbalance in big data. / Includes bibliography. / Dissertation (Ph.D.)--Florida Atlantic University, 2019. / FAU Electronic Theses and Dissertations Collection Algorithms Machine learning Big data--Data processing Big data
6	A note on exploration of IoT generated big data using semantics Ranjan, R., Thakker, Dhaval, Haller, A., Buyya, R. 27 July 2017 (has links) Yes / Welcome to this special issue of the Future Generation Computer Systems (FGCS) journal. The special issue compiles seven technical contributions that significantly advance the state-of-the-art in exploration of Internet of Things (IoT) generated big data using semantic web techniques and technologies. Internet of Things (IoT) Big data Semantic web Multimedia big data
7	Visualization of multivariate process data for fault detection and diagnosis Wang, Ray Chen 02 October 2014 (has links) This report introduces the concept of three-dimensional (3D) radial plots for the visualization of multivariate large scale datasets in plant operations. A key concept of this representation of data is the introduction of time as the third dimension in a two dimensional radial plot, which allows for the display of time series data in any number of process variables. This report shows the ability of 3D radial plots to conduct systemic fault detection and classification in chemical processes through the use of confidence ellipses, which capture the desired operating region of process variables during a defined period of steady-state operation. Principal component analysis (PCA) is incorporated into the method to reduce multivariate interactions and the dimensionality of the data. The method is applied to two case studies with systemic faults present (compressor surge and column flooding) as well as data obtained from the Tennessee Eastman simulator, which contained localized faults. Fault classification using the interior angles of the radial plots is also demonstrated in the paper. / text Visualization Big data Fault detection
8	Impact analysis of characteristics in product development : Change in product property with respect to component generations Lindström, Frej, Andersson, Daniel January 2017 (has links) Scania has developed a unique modular product system which is an important successfactor, creating exibility and lies at the heart of their business model. R&Duse product and vehicle product properties to describe the product key factors. Theseproduct properties are both used during the development of new features and products,and also utilized by the project oce to estimate the total contribution of a project.Scania want to develop a new method to understand and be able to track and comparethe projects eect over time and also predict future vehicle improvements. In this thesis, we investigate how to quantify the impact on vehicle product propertiesand predict component improvements, based on data sources that have not beenutilized for these purposes before. The impact objective is ultimately to increase the understandingof the development process of heavy vehicles and the aim for this projectwas to provide statistical methods that can be used for investigative and predictivepurposes. First, with analysis of variance we statistically veried and quantied differencesin a product property between comparable vehicle populations with respectto component generations. Then, Random Forest and Articial Neural Networks wereimplemented to predict future eect on product property with respect to componentimprovements. We could see a dierence of approximately 10 % between the comparablecomponents of interest, which was more than the expected dierence. Theexpectations are based on performance measurements from a test environment. Theimplemented Random Forest model was not able to predict future eect based on theseperformance measures. Articial Neural Networks was able to capture structures fromthe test environment and its predictive performance and reliability was, under the givencircumstances, relatively good. Statistics big data Mathematics Matematik
9	ProGENitor : an application to guide your career Hauptli, Erich Jurg 20 January 2015 (has links) This report introduces ProGENitor; a system to empower individuals with career advice based on vast amounts of data. Specifically, it develops a machine learning algorithm that shows users how to efficiently reached specific career goals based upon the histories of other users. A reference implementation of this algorithm is presented, along with experimental results that show that it provides quality actionable intelligence to users. / text Graph theory Analytics Big data
10	Towards a big data analytics platform with Hadoop/MapReduce framework using simulated patient data of a hospital system Chrimes, Dillon 28 November 2016 (has links) Background: Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges. The study objective was high performance establishment of interactive BDA platform of hospital system. Methods: A Hadoop/MapReduce framework formed the BDA platform with HBase (NoSQL database) using hospital-specific metadata and file ingestion. Query performance tested with Apache tools in Hadoop’s ecosystem. Results: At optimized iteration, Hadoop distributed file system (HDFS) ingestion required three seconds but HBase required four to twelve hours to complete the Reducer of MapReduce. HBase bulkloads took a week for one billion (10TB) and over two months for three billion (30TB). Simple and complex query results showed about two seconds for one and three billion, respectively. Interpretations: BDA platform of HBase distributed by Hadoop successfully under high performance at large volumes representing the Province’s entire data. Inconsistencies of MapReduce limited operational efficiencies. Importance of the Hadoop/MapReduce on representation of health informatics is further discussed. / Graduate / 0566 / 0769 / 0984 / dillon.chrimes@viha.ca Big Data Big Data Analytics Big Data Tools Big Data Visualizations Hadoop Ecosystem Health Big Data Hospital Systems Interactive Big Data Patient Data Simulations

Search results