Global ETD Search

41	Shluková analýza rozsáhlých souborů dat: nové postupy založené na metodě k-průměrů / Cluster analysis of large data sets: new procedures based on the method k-means Žambochová, Marta January 2005 (has links) Abstract Cluster analysis has become one of the main tools used in extracting knowledge from data, which is known as data mining. In this area of data analysis, data of large dimensions are often processed, both in the number of objects and in the number of variables, which characterize the objects. Many methods for data clustering have been developed. One of the most widely used is a k-means method, which is suitable for clustering data sets containing large number of objects. It is based on finding the best clustering in relation to the initial distribution of objects into clusters and subsequent step-by-step redistribution of objects belonging to the clusters by the optimization function. The aim of this Ph.D. thesis was a comparison of selected variants of existing k-means methods, detailed characterization of their positive and negative characte- ristics, new alternatives of this method and experimental comparisons with existing approaches. These objectives were met. I focused on modifications of the k-means method for clustering of large number of objects in my work, specifically on the algorithms BIRCH k-means, filtering, k-means++ and two-phases. I watched the time complexity of algorithms, the effect of initialization distribution and outliers, the validity of the resulting clusters. Two real data files and some generated data sets were used. The common and different features of method, which are under investigation, are summarized at the end of the work. The main aim and benefit of the work is to devise my modifications, solving the bottlenecks of the basic procedure and of the existing variants, their programming and verification. Some modifications brought accelerate the processing. The application of the main ideas of algorithm k-means++ brought to other variants of k-means method better results of clustering. The most significant of the proposed changes is a modification of the filtering algorithm, which brings an entirely new feature of the algorithm, which is the detection of outliers. The accompanying CD is enclosed. It includes the source code of programs written in MATLAB development environment. Programs were created specifically for the purpose of this work and are intended for experimental use. The CD also contains the data files used for various experiments.
42	Μηχανική μάθηση σε ανομοιογενή δεδομένα / Machine learning in imbalanced data sets Λυπιτάκη, Αναστασία Δήμητρα Δανάη 07 July 2015 (has links) Οι αλγόριθμοι μηχανικής μάθησης είναι επιθυμητό να είναι σε θέση να γενικεύσουν για οποιασδήποτε κλάση με ίδια ακρίβεια. Δηλαδή σε ένα πρόβλημα δύο κλάσεων - θετικών και αρνητικών περιπτώσεων - ο αλγόριθμος να προβλέπει με την ίδια ακρίβεια και τα θετικά και τα αρνητικά παραδείγματα. Αυτό είναι φυσικά η ιδανική κατάσταση. Σε πολλές εφαρμογές οι αλγόριθμοι καλούνται να μάθουν από ένα σύνολο στοιχείων, το οποίο περιέχει πολύ περισσότερα παραδείγματα από τη μια κλάση σε σχέση με την άλλη. Εν γένει, οι επαγωγικοί αλγόριθμοι είναι σχεδιασμένοι να ελαχιστοποιούν τα σφάλματα. Ως συνέπεια οι κλάσεις που περιέχουν λίγες περιπτώσεις μπορούν να αγνοηθούν κατά ένα μεγάλο μέρος επειδή το κόστος λανθασμένης ταξινόμησης της υπερ-αντιπροσωπευόμενης κλάσης ξεπερνά το κόστος λανθασμένης ταξινόμησης της μικρότερη κλάση. Το πρόβλημα των ανομοιογενών συνόλων δεδομένων εμφανίζεται και σε πολλές πραγματικές εφαρμογές όπως στην ιατρική διάγνωση, στη ρομποτική, στις διαδικασίες βιομηχανικής παραγωγής, στην ανίχνευση λαθών δικτύων επικοινωνίας, στην αυτοματοποιημένη δοκιμή του ηλεκτρονικού εξοπλισμού, και σε πολλές άλλες περιοχές. Η παρούσα διπλωματική εργασία με τίτλο ‘Μηχανική Μάθηση με Ανομοιογενή Δεδομένα’ (Machine Learning with Imbalanced Data) αναφέρεται στην επίλυση του προβλήματος αποδοτικής χρήσης αλγορίθμων μηχανικής μάθησης σε ανομοιογενή/ανισοκατανεμημένα δεδομένα. Η διπλωματική περιλαμβάνει μία γενική περιγραφή των βασικών αλγορίθμων μηχανικής μάθησης και των μεθόδων αντιμετώπισης του προβλήματος ανομοιογενών δεδομένων. Παρουσιάζεται πλήθος αλγοριθμικών τεχνικών διαχείρισης ανομοιογενών δεδομένων, όπως οι αλγόριθμοι AdaCost, Cost Senistive Boosting, Metacost και άλλοι. Παρατίθενται οι μετρικές αξιολόγησης των μεθόδων Μηχανικής Μάθησης σε ανομοιογενή δεδομένα, όπως οι καμπύλες διαχείρισης λειτουργικών χαρακτηριστικών (ROC curves), καμπύλες ακρίβειας (PR curves) και καμπύλες κόστους. Στο τελευταίο μέρος της εργασίας προτείνεται ένας υβριδικός αλγόριθμος που συνδυάζει τις τεχνικές OverBagging και Rotation Forest. Συγκρίνεται ο προτεινόμενος αλγόριθμος σε ένα σύνολο ανομοιογενών δεδομένων με άλλους αλγόριθμους και παρουσιάζονται τα αντίστοιχα πειραματικά αποτελέσματα που δείχνουν την καλύτερη απόδοση του προτεινόμενου αλγόριθμου. Τελικά διατυπώνονται τα συμπεράσματα της εργασίας και δίνονται χρήσιμες ερευνητικές κατευθύνσεις. / Machine Learning (ML) algorithms can generalize for every class with the same accuracy. In a problem of two classes, positive (true) and negative (false) cases-the algorithm can predict with the same accuracy the positive and negative examples that is the ideal case. In many applications ML algorithms are used in order to learn from data sets that include more examples from the one class in relationship with another class. In general inductive algorithms are designed in such a way that they can minimize the occurred errors. As a conclusion the classes that contain some cases can be ignored in a large percentage since the cost of the false classification of the super-represented class is greater than the cost of false classification of lower class. The problem of imbalanced data sets is occurred in many ‘real’ applications, such as medical diagnosis, robotics, industrial development processes, communication networks error detection, automated testing of electronic equipment and in other related areas. This dissertation entitled ‘Machine Learning with Imbalanced Data’ is referred to the solution of the problem of efficient use of ML algorithms with imbalanced data sets. The thesis includes a general description of basic ML algorithms and related methods for solving imbalanced data sets. A number of algorithmic techniques for handling imbalanced data sets is presented, such as Adacost, Cost Sensitive Boosting, Metacost and other algorithms. The evaluation metrics of ML methods for imbalanced datasets are presented, including the ROC (Receiver Operating Characteristic) curves, the PR (Precision and Recall) curves and cost curves. A new hybrid ML algorithm combining the OverBagging and Rotation Forest algorithms is introduced and the proposed algorithmic procedure is compared with other related algorithms by using the WEKA operational environment. Experimental results demonstrate the performance superiority of the proposed algorithm. Finally, the conclusions of this research work are presented and several future research directions are given. Ανομοιογενή δεδομένα Μηχανική μάθηση Εξόρυξη δεδομένων Σύνολα ταξινομητών Καμπύλη ROC Καμπύλη PRC Αλγόριθμος Bagging Αλγόριθμος Rotation forest 006.31 Machine learning Imbalanced data sets Data mining ROC curves PRC curves Bagging algorithm Rotation forest algorithm
43	En jämförelse mellan databashanterare med prestandatester och stora datamängder / A comparison between database management systems with performance testing and large data sets Brander, Thomas, Dakermandji, Christian January 2016 (has links) Företaget Nordicstation hanterar stora datamängder åt Swedbank där datalagringen sker i relationsdatabasen Microsoft SQL Server 2012 (SQL Server). Då det finns andra databashanterare designade för stora datavolymer är det oklart om SQL Server är den optimala lösningen för situationen. Detta examensarbete har tagit fram en jämförelse med hjälp av prestandatester, beträffande exekveringstiden av databasfrågor, mellan databaserna SQL Server, Cassandra och NuoDB vid hanteringen av stora datamängder. Cassandra är en kolumnbaserad databas designad för hantering av stora datavolymer, NuoDB är en minnesdatabas som använder internminnet som lagringsutrymme och är designad för skalbarhet. Resultaten togs fram i en virtuell servermiljö med Windows Server 2012 R2 på en testplattform skriven i Java. Jämförelsen visar att SQL Server var den databas mest lämpad för gruppering, sortering och beräkningsoperationer. Däremot var Cassandra bäst i skrivoperationer och NuoDB presterade bäst i läsoperationer. Analysen av resultatet visade att mindre access till disken ger kortare exekveringstid men den skalbara lösningen, NuoDB, lider av kraftiga prestandaförluster av att endast konfigureras med en nod. Nordicstation rekommenderas att uppgradera till Microsoft SQL Server 2014, eller senare, där möjlighet finns att spara tabeller i internminnet. / The company Nordicstation handles large amounts of data for Swedbank, where data is stored using the relational database Microsoft SQL Server 2012 (SQL Server). The existence of other databases designed for handling large amounts of data, makes it unclear if SQL Server is the best solution for this situation. This degree project describes a comparison between databases using performance testing, with regard to the execution time of database queries. The chosen databases were SQL Server, Cassandra and NuoDB. Cassandra is a column-oriented database designed for handling large amounts of data, NuoDB is a database that uses the main memory for data storage and is designed for scalability. The performance tests were executed in a virtual server environment with Windows Server 2012 R2 using an application written in Java. SQL Server was the database most suited for grouping, sorting and arithmetic operations. Cassandra had the shortest execution time for write operations while NuoDB performed best in read operations. This degree project concludes that minimizing disk operations leads to shorter execution times but the scalable solution, NuoDB, suffer severe performance losses when configured as a single-node. Nordicstation is recommended to upgrade to Microsoft SQL Server 2014, or later, because of the possibility to save tables in main memory. Database managment system Performance test Execution Time Large data sets Microsoft SQL Server Cassandra NuoDB Database queries Test environment Databashanterare Prestandatest Exekveringstid Stora datavolymer Microsoft SQL Server Cassandra NuoDB Databasfrågor Testmiljö Software Engineering Programvaruteknik
44	Inverkan av delmaterialensvariationer på betongensegenskaper / Effect of variations in the constituents on the properties of concrete Ghafori, Abbas, Estrada Bernuy, Gabriel January 2015 (has links) Vid betongframställning förekommer det spridningar i delmaterialens egenskaper som påverkar den färskaoch hårdnande betongen. Spridningarna i betongens delmaterial har studerats hos tre av Skanskas betongfabriker(Göteborg, Luleå och Norrköping), genom provuttag som analyserats hos Cementa Research.Provuttag har gjorts en gång per månad under ett års tid från fabrikerna. Delmaterialen som har analyseratsär ballast, cement, flytmedel och kalkfiller (endast hos Göteborg och Norrköping). Siktning av ballast0-8 mm har utförts med den traditionella siktningen. För kornstorlekar mindre än 0,25 mm, cement samtkalkfiller har lasersiktning använts.För att få en överskådlig bild över spridningarna hos delmaterialen har en analys utförts som illusteraravvikelserna med exakta siffror. Analysen har visat att den traditionella siktingen har mindre spridningjämfört med lasersiktning. Dessutom visar analysen att sättmåttet har större spridning jämfört med hållfastheten.För ballast 0-8 mm har minst spridning visats hos Luleå och störst hos Norrköping, däremot så har Luleåvisat störst spridning i ballast < 0,25 mm, cement, kalkfiller flytmedel och hållfasthet samtidigt som Göteborgvisat minst spridning i dessa och istället störst spridning i sättmått.För att få en överskådlig bild över vilka egenskapsförändringar som förväntas i betongen om respektivedelmaterial förändrats åt något håll har deskriptiv analys tillämpats parallellt med teoretisk analys. Dendeskriptiva analysen har avgränsats genom att undersöka hur förändringar i delmaterialen ballast, cement,kalkfiller och flytmedel påverkar sättmåttet och hållfastheten.Resultaten från den deskriptiva analysen har visat att en utökad mängd grövre ballast 0-8 mm ger upphovtill större sättmått och en utökad mängd finare ballast 0-8 mm ger högre hållfasthet för majoriteten avproverna. För ballast 0-8 mm < 0,25 mm har analysen visat att finare ballast < 0,25 mm ger upphov tillstörre sättmått. Hos Göteborg visar dessutom majoriteten av proverna högre hållfasthet för finare ballast< 0,25 mm.Prover från Göteborg har visat att grövre kalkfiller ger högre hållfasthet. Hos Norrköping visar dessutommajoriteten av proverna större sättmått för finare kalkfiller och högre hållfasthet för grövre kalkfiller.För cementet har analysen visat att majoriteten av proverna hos Luleå har gett upphov till större sättmåttför finare cement och högre hållfasthet för grövre cement. Hos Norrköping har analysen visat sammagällande hållfasthet, däremot tvärtom för sättmåttet, d.v.s. grövre cement har gett upphov till större sättmått.För flytmedel har majoriteten av proverna hos Luleå visat att högre torrhalt gett upphov till större sättmåttoch lägre torrhalt resulterat till högre hållfasthet. / During concrete production, property variations of the constituents occur that affect the fresh and hardenedconcrete. The variation in the constituents has been studied at three of Skanska’s concrete plants(Gothenburg, Luleå and Norrköping) through the samples analyzed at Cementa Research. Sampling atthese factories took place once per month over a one year period. The constituents that have been analyzedare aggregates, cement, superplasticizers and limestone filler (only in Gothenburg and Norrköping).Sieving of aggregates 0-8 mm has been conducted with traditional sieving. For grain sizes smaller than0.25 mm, cement and limestone filler laser sieving has been used.To get a clear picture of the variations in the constituents an analysis was performed that illustrated theexact figures of the discrepancies. Analyses show that the traditional sieving has less variation compared tolaser sighting. Moreover, the analysis shows that the slump has larger variations than the compressivestrength.Luleå showed the least variation for aggregates 0-8 mm while the largest variation was apparent at Norrkoping.However, Luleå has shown the largest variation in aggregates < 0.25 mm, cement, limestone filler,superplasticizers and compressive strength. Gothenburg on the other hand showed least variationamongst these while showing the largest variation in the slump.To better understand the property changes that are expected in the concrete of the respective the constituentsin either direction, both descriptive and theoretical analysis are applied simultaneously. The descriptiveanalysis has been limited to explore how changes in aggregates, cement, limestone fillers andsuperplasticizers affect the slump and compressive strength.The results of the descriptive analysis has shown that an increased amount of coarse aggregates 0-8 mmgives rise to larger slump and an increased amount of finer aggregates 0-8 mm gives greater compressivestrength to the majority of the samples. For aggregates 0-8 mm < 0.25 mm analysis has shown that fineraggregates < 0.25 mm give rise to greater slump. Analysis in Gothenburg also shows that the majority ofthe samples have higher compressive strength for the finer aggregates < 0.25 mm.Samples from Gothenburg have shown that coarser limestone filler provides higher compressive strength.Majority of the samples at Norrköping show greater slump for fine limestone filler and higher compressivestrength for coarse limestone filler.Analyses of cement have shown that the majority of the samples in Luleå have given rise to greater slumpfor finer cements and higher compressive strength for coarser cement. In Norrköping, the analysisdemonstrated the same compressive strength, however, to the contrary coarse cement has given rise togreater slump.Analysis of superplasticizers in Luleå show that the majority of the samples at higher dry content resultedin greater slump and at lower dry content resulted in higher compressive strength. Gothenburg factory Luleå factory Norrköping factory variation aggregate cement limestone fillers superplasticizers dry content descriptive analysis data sets screening grain size slump and compressive strength. Göteborgfabrik Luleåfabrik Norrköpingfabrik spridning ballast cement kalkfiller flytmedel torrhalt deskriptiv analys dataset siktning kornstorlek sättmått hållfasthet. Construction Management Byggproduktion Building Technologies Husbyggnad
45	Analýza dat síťové komunikace mobilních zařízení / Analysis of Mobile Devices Network Communication Data Abraham, Lukáš January 2020 (has links) At the beginning, the work describes DNS and SSL/TLS protocols, it mainly deals with communication between devices using these protocols. Then we'll talk about data preprocessing and data cleaning. Furthermore, the thesis deals with basic data mining techniques such as data classification, association rules, information retrieval, regression analysis and cluster analysis. The next chapter we can read something about how to identify mobile devices on the network. We will evaluate data sets that contain collected data from communication between the above mentioned protocols, which will be used in the practical part. After that, we finally get to the design of a system for analyzing network communication data. We will describe the libraries, which we used and the entire system implementation. We will perform a large number of experiments, which we will finally evaluate.
46	Distributed Support Vector Machine With Graphics Processing Units Zhang, Hang 06 August 2009 (has links) Training a Support Vector Machine (SVM) requires the solution of a very large quadratic programming (QP) optimization problem. Sequential Minimal Optimization (SMO) is a decomposition-based algorithm which breaks this large QP problem into a series of smallest possible QP problems. However, it still costs O(n2) computation time. In our SVM implementation, we can do training with huge data sets in a distributed manner (by breaking the dataset into chunks, then using Message Passing Interface (MPI) to distribute each chunk to a different machine and processing SVM training within each chunk). In addition, we moved the kernel calculation part in SVM classification to a graphics processing unit (GPU) which has zero scheduling overhead to create concurrent threads. In this thesis, we will take advantage of this GPU architecture to improve the classification performance of SVM.

Page generated in 0.0755 seconds