Global ETD Search

1	Using statistical learning to predict survival of passengers on the RMS Titanic Whitley, Michael Aaron January 1900 (has links) Master of Science / Statistics / Christopher Vahl / When exploring data, predictive analytics techniques have proven to be effective. In this report, the efficiency of several predictive analytics methods are explored. During the time of this study, Kaggle.com, a data science competition website, had the predictive modeling competition, "Titanic: Machine Learning from Disaster" available. This competition posed a classification problem to build a predictive model to predict the survival of passengers on the RMS Titanic. The focus of our approach was on applying a traditional classification and regression tree algorithm. The algorithm is greedy and can over fit the training data, which consequently can yield non-optimal prediction accuracy. In efforts to correct such issues with using the classification and regression tree algorithm, we have implemented cost complexity pruning and ensemble methods such as bagging and random forests. However, no improvement was observed here which may be an artifact associated with the Titanic data and may not be representative of those methods’ performances. The decision trees and prediction accuracy of each method are presented and compared. Results indicate that the predictors sex/title, fare price, age, and passenger class are the most important variables in predicting survival of the passengers. Decision tree Ensemble Kaggle Titanic Statistics (0463)
2	Rozpoznávání druhu jídla s pomocí hlubokých neuronových sítí / Food classification using deep neural networks Kuvik, Michal January 2019 (has links) The aim of this thesis is to study problems of deep convolutional neural networks and the connected classification of images and to experiment with the architecture of particular network with the aim to get the most accurate results on the selected dataset. The thesis is divided into two parts, the first part theoretically outlines the properties and structure of neural networks and briefly introduces selected networks. The second part deals with experiments with this network, such as the impact of data augmentation, batch size and the impact of dropout layers on the accuracy of the network. Subsequently, all results are compared and discussed with the best result achieved an accuracy of 86, 44% on test data.
3	How can a module for sentiment analysis be designed to classify tweets about covid19 / Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19 Ly, Denny, Saad Abdul Malik, Tamara January 2021 (has links) The sentiment analysis of a text is getting more focus nowadays from different entities for a variety of reasons. Emotions mining (sentiment analysis) is a very interesting subject to explore thus the research question is How can a module for sentiment analysis be designed to classify tweets about Covid-19. The dataset used for this project was taken from Kaggle and preprocessed with various methods such as Bag of Words and term frequency-inverse document frequency. The models are based on the following algorithms: KNN, SVM, DT, and NB. Some models are also based on the combination of ML and Lexicon. The outcome of the experiment showed that the lexicon method with an accuracy of 87% exceeded the machine learning methods implemented in this thesis and the experiments done by the ML community in Kaggle. This implies that the traditional lexicon approach is still considered a fit choice in the sentiment analysis field. / På senaste tiden har sentimentanalyser av text fått ett större fokus. Känsloutvinning (Emotions mining) är ett väldigt intressant ämne att utforska, Forskningsfrågan är då Hur kan man designa en modul inom sentimentanalys för att klassificera tweets om covid19. Datasetet som används är hämtat från Kaggle och sedan preprocesserat med hjälp av olika metoder såsom Bag of Words och term frequency-inverse document frequency. Modellerna är baserad på följande algoritmer: KNN, SVM, DT, och NB. Vissa modeller är baserad på en kombination of ML och Lexicon. Slutresultatet av experimentet visade sig vara att lexikon metoden med en prestanda av 87% översteg maskin inlärningsmetoderna som utfördes i denna uppsatsen och övriga experiment från ML gemensamhet i kaggle. Detta antyder att lexikon metoden är fortfarande ett bra val inom sentimentanalys området. Sentiment Analysis Machine Learning Lexicon technique Kaggle Preprocessing Computer and Information Sciences Data- och informationsvetenskap
4	Comparison of Popular Data Processing Systems Nasr, Kamil January 2021 (has links) Data processing is generally defined as the collection and transformation of data to extract meaningful information. Data processing involves a multitude of processes such as validation, sorting summarization, aggregation to name a few. Many analytics engines exit today for largescale data processing, namely Apache Spark, Apache Flink and Apache Beam. Each one of these engines have their own advantages and drawbacks. In this thesis report, we used all three of these engines to process data from the Carbon Monoxide Daily Summary Dataset to determine the emission levels per area and unit of time. Then, we compared the performance of these 3 engines using different metrics. The results showed that Apache Beam, while offered greater convenience when writing programs, was slower than Apache Flink and Apache Spark. Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data processing framework overall. / Databehandling definieras generellt som insamling och omvandling av data för att extrahera meningsfull information. Databehandling involverar en mängd processer som validering, sorteringssammanfattning, aggregering för att nämna några. Många analysmotorer lämnar idag för storskalig databehandling, nämligen Apache Spark, Apache Flink och Apache Beam. Var och en av dessa motorer har sina egna fördelar och nackdelar. I den här avhandlingsrapporten använde vi alla dessa tre motorer för att bearbeta data från kolmonoxidens dagliga sammanfattningsdataset för att bestämma utsläppsnivåerna per område och tidsenhet. Sedan jämförde vi prestandan hos dessa 3 motorer med olika mått. Resultaten visade att Apache Beam, även om det erbjuds större bekvämlighet när man skriver program, var långsammare än Apache Flink och Apache Spark. Spark Runner in Beam var den snabbaste löparen och Apache Spark var den snabbaste databehandlingsramen totalt. Apache Spark Apache Flink Apache Beam Spark Runner Flink Runner Direct Runner Big Data Analytics Data Processing Systems Benchmarking Kaggle Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0336 seconds