Global ETD Search

161	Gradient Boosting Machine and Artificial Neural Networks in R and H2O / Gradient Boosting Machine and Artificial Neural Networks in R and H2O Sabo, Juraj January 2016 (has links) Artificial neural networks are fascinating machine learning algorithms. They used to be considered unreliable and computationally very expensive. Now it is known that modern neural networks can be quite useful, but their computational expensiveness unfortunately remains. Statistical boosting is considered to be one of the most important machine learning ideas. It is based on an ensemble of weak models that together create a powerful learning system. The goal of this thesis is the comparison of these machine learning models on three use cases. The first use case deals with modeling the probability of burglary in the city of Chicago. The second use case is the typical example of customer churn prediction in telecommunication industry and the last use case is related to the problematic of the computer vision. The second goal of this thesis is to introduce an open-source machine learning platform called H2O. It includes, among other things, an interface for R and it is designed to run in standalone mode or on Hadoop. The thesis also includes the introduction into an open-source software library Apache Hadoop that allows for distributed processing of big data. Concretely into its open-source distribution Hortonworks Data Platform.
162	Performance Evaluation of LINQ to HPC and Hadoop for Big Data Sivasubramaniam, Ravishankar 01 January 2013 (has links) There is currently considerable enthusiasm around the MapReduce paradigm, and the distributed computing paradigm for analysis of large volumes of data. The Apache Hadoop is the most popular open source implementation of MapReduce model and LINQ to HPC is Microsoft's alternative to open source Hadoop. In this thesis, the performance of LINQ to HPC and Hadoop are compared using different benchmarks. To this end, we identified four benchmarks (Grep, Word Count, Read and Write) that we have run on LINQ to HPC as well as on Hadoop. For each benchmark, we measured each system’s performance metrics (Execution Time, Average CPU utilization and Average Memory utilization) for various degrees of parallelism on clusters of different sizes. Results revealed some interesting trade-offs. For example, LINQ to HPC performed better on three out of the four benchmarks (Grep, Read and Write), whereas Hadoop performed better on the Word Count benchmark. While more research that is extensive has focused on Hadoop, there are not many references to similar research on the LINQ to HPC platform, which is slowly evolving during the writing of this thesis. Thesis University of North Florida UNF performance LINQ HPC Hadoop Computer and Systems Architecture Computer Engineering Data Storage Systems Other Computer Engineering
163	Zpracování velkých dat z rozsáhlých IoT sítí / Big Data Processing from Large IoT Networks Benkő, Krisztián January 2019 (has links) The goal of this diploma thesis is to design and develop a system for collecting, processing and storing data from large IoT networks. The developed system introduces a complex solution able to process data from various IoT networks using Apache Hadoop ecosystem. The data are real-time processed and stored in a NoSQL database, but the data are also stored in the file system for a potential later processing. The system is optimized and tested using data from IQRF network. The data stored in the NoSQL database are visualized and the system periodically generates derived predictions. Users are connected to this system via an information system, which is able to automatically generate notifications when monitored values are out of range.
164	Geo-Locating Tweets with Latent Location Information Lee, Sunshin 13 February 2017 (has links) As part of our work on the NSF funded Integrated Digital Event Archiving and Library (IDEAL) project and the Global Event and Trend Archive Research (GETAR) project, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geo-locate those tweets whenever possible. Due to possible location ambiguity, finding a tweet's location often is challenging. Many distinct places have the same geoname, e.g., "Greenville" matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, "Water main break near White House" does not have any geonames but it is related to a location "1600 Pennsylvania Ave NW, Washington, DC 20500 USA" indicated by the key phrase 'White House'. To disambiguate tweet locations, we first extracted geospatial named entities (geonames) and predicted implicit state (e.g., Virginia or California) information from entities using machine learning algorithms including Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest (RF). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a machine learning (ML) approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford NER. Adding state information predicted by our classifiers increased the possibility to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (3 mid-size and 2 big-size collections about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geo-coordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding API. Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach with geonames and all LIWs that are present in tweet texts as features. / Ph. D. / As part of our work on the projects “Integrated Digital Event Archiving and Library (IDEAL)” and “Global Event and Trend Archive Research (GETAR),” funded by NSF, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geolocate those tweets whenever possible. Due to possible location ambiguity, finding a tweet’s location often is challenging. Many distinct places have the same geoname, e.g., “Greenville” matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, “Water main break near White House” does not have any geonames but it is related to a location “1600 Pennsylvania Ave NW, Washington, DC 20500 USA” indicated by the key phrase ‘White House’. To disambiguate tweet locations, we first extracted geonames, and then predicted implicit state (e.g., Virginia or California) information from entities using machine learning (ML) algorithms (wherein computers learn from examples what state is appropriate). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a ML approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford Named Entity Recognizer. Adding state information predicted by our classifiers increased the ability to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (in three mid-size and two big collections, about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geocoordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding Web application (that converts addresses into geographic coordinates). Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach wherein the features considered are the geonames and all LIWs that are present in the tweet texts. Classification Events Geo-coding Geo-locating Geo-parsing Google Geo-coding API Hadoop cluster Location Indicative Words (LIWs) Machine learning Naïve Bayes Named Entity Recognition Natural Language Processing
165	Distribuovaný repositář digitálních forenzních dat / Distributed Forensic Digital Data Repository Josefík, Martin January 2018 (has links) This work deals with the design of distributed repository aimed at storing digital forensic data. The theoretical part of the thesis describes digital forensics and what is its purpose. There are also explained Big data, suitable storages, their properties, advantages and disadvantages, in this part. The main part of the thesis deals with the design and implementation of distributed storage for digital forensic data. The design is also focused in suitable indexing of stored data, and supporting new types of digital forensic data. The performance of implemented system was evaluated for chosen type of digital forensic data PCAP files.

Page generated in 0.0256 seconds