551 |
Využití technik Data Mining v různých odvětvích / Using Data Mining in Various IndustriesFabian, Jaroslav January 2014 (has links)
This master’s thesis concerns about the use of data mining techniques in banking, insurance and shopping centres industries. The thesis theoretically describes algorithms and methodology CRISP-DM dedicated to data mining processes. With usage of theoretical knowledge and methods, the thesis suggests possible solution for various industries within business intelligence processes.
|
552 |
Time Dynamic Topic ModelsJähnichen, Patrick 22 March 2016 (has links)
Information extraction from large corpora can be a useful tool for many applications in industry and academia. For instance, political communication science has just recently begun to use the opportunities that come with the availability of massive amounts of information available through the Internet and the computational tools that natural language processing can provide. We give a linguistically motivated interpretation of topic modeling, a state-of-the-art algorithm for extracting latent semantic sets of words from large text corpora, and extend this interpretation to cover issues and issue-cycles as theoretical constructs coming from political communication science. We build on a dynamic topic model, a model whose semantic sets of words are allowed to evolve over time governed by a Brownian motion stochastic process and apply a new form of analysis to its result. Generally this analysis is based on the notion of volatility as in the rate of change of stocks or derivatives known from econometrics. We claim that the rate of change of sets of semantically related words can be interpreted as issue-cycles, the word sets as describing the underlying issue. Generalizing over the existing work, we introduce dynamic topic models that are driven by general (Brownian motion is a special case of our model) Gaussian processes, a family of stochastic processes defined by the function that determines their covariance structure. We use the above assumption and apply a certain class of covariance functions to allow for an appropriate rate of change in word sets while preserving the semantic relatedness among words. Applying our findings to a large newspaper data set, the New York Times Annotated corpus (all articles between 1987 and 2007), we are able to identify sub-topics in time, \\\\textit{time-localized topics} and find patterns in their behavior over time. However, we have to drop the assumption of semantic relatedness over all available time for any one topic. Time-localized topics are consistent in themselves but do not necessarily share semantic meaning between each other. They can, however, be interpreted to capture the notion of issues and their behavior that of issue-cycles.
|
553 |
Application of small area estimation techniques in modelling accessibility of water, sanitation and electricity in South Africa : the case of Capricorn DistrictMokobane, Reshoketswe January 2019 (has links)
Thesis (Ph.D. (Statistics)) -- University of Limpopo, 2019 / This study presents the application of Direct and Indirect methods of Small AreaEstimation(SAE)techniques. Thestudyisaimedatestimatingthetrends and the proportions of households accessing water, sanitation, and electricity for lighting at small areas of the Limpopo Province, South Africa. The study modified Statistics South Africa’s General Household Survey series 2009-2015 and Census 2011 data. The option categories of three variables: Water, Sanitation and Electricity for lighting, were re-coded. Empirical Bayes and Hierarchical Bayes models known as Markov Chain Monte Carlo (MCMC) methods were used to refine estimates in SAS. The Census 2011 data aggregated in ‘Supercross’ was used to validate the results obtained from the models. The SAE methods were applied to account for the census undercoverage counts and rates. It was found that the electricity services were more prioritised than water and sanitation in the Capricorn District of the Limpopo Province. The greatest challenge, however, lies with the poor provision of sanitation services in the country, particularly in the small rural areas. The key point is to suggestpolicyconsiderationstotheSouthAfricangovernmentforfutureequitable provisioning of water, sanitation and electricity services across the country.
|
554 |
Prediction of the transaction confirmation time in Ethereum BlockchainSingh, Harsh Jot 08 1900 (has links)
La blockchain propose un système d'enregistrement décentralisé, immuable et transparent. Elle offre un réseau de nœuds sans entité de gouvernance centralisée, ce qui la rend "indéchiffrable" et donc plus sûr que le système d'enregistrement centralisé sur papier ou centralisé telles que les banques. L’approche traditionnelle basée sur l’enregistrement ne fonctionne pas bien avec les relations numériques où les données changent constamment. Contrairement aux canaux traditionnels, régis par des entités centralisées, blockchain offre à ses utilisateurs un certain niveau d'anonymat en leur permettant d'interagir sans divulguer leur identité personnelle et en leur permettant de gagner la confiance sans passer par une entité tierce.
En raison des caractéristiques susmentionnées de la blockchain, de plus en plus d'utilisateurs dans le monde sont enclins à effectuer une transaction numérique via blockchain plutôt que par des canaux rudimentaires. Par conséquent, nous devons de toute urgence mieux comprendre comment ces opérations sont gérées par la blockchain et combien de temps cela prend à un nœud du réseau pour confirmer une transaction et l’ajouter au réseau de la blockchain.
Dans cette thèse, nous visons à introduire une nouvelle approche qui permettrait d'estimer le temps il faudrait à un nœud de la blockchain Ethereum pour accepter et confirmer une transaction sur un bloc tout en utilisant l'apprentissage automatique. Nous explorons deux des approches les plus fondamentales de l’apprentissage automatique, soit la classification et la régression, afin de déterminer lequel des deux offrirait l’outil le plus efficace pour effectuer la prévision du temps de confirmation dans la blockchain Ethereum. Nous explorons le classificateur Naïve Bayes, le classificateur Random Forest et le classificateur Multilayer Perceptron pour l’approche de la classification. Comme la plupart des transactions sur Ethereum sont confirmées dans le délai de confirmation moyen (15 secondes) de deux confirmations de bloc, nous discutons également des moyens pour résoudre le problème asymétrique du jeu de données rencontré avec l’approche de la classification. Nous visons également à comparer la précision prédictive de deux modèles de régression d’apprentissage automatique, soit le Random Forest Regressor et le Multilayer Perceptron, par rapport à des modèles de régression statistique, précédemment proposés, avec un critère d’évaluation défini, afin de déterminer si l’apprentissage automatique offre un modèle prédictif plus précis que les modèles statistiques conventionnels. / Blockchain offers a decentralized, immutable, transparent system of records. It offers a peer-to-peer network of nodes with no centralised governing entity making it ‘unhackable’ and therefore, more secure than the traditional paper based or centralised system of records like banks etc. While there are certain advantages to the paper based recording approach, it does not work well with digital relationships where the data is in constant flux. Unlike traditional channels, governed by centralized entities, blockchain offers its users a certain level of anonymity by providing capabilities to interact without disclosing their personal identities and allows them to build trust without a third-party governing entity.
Due to the aforementioned characteristics of blockchain, more and more users around the globe are inclined towards making a digital transaction via blockchain than via rudimentary channels. Therefore, there is a dire need for us to gain insight on how these transactions are processed by the blockchain and how much time it may take for a peer to confirm a transaction and add it to the blockchain network.
In this thesis, we aim to introduce a novel approach that would allow one to estimate the time (in block time or otherwise) it would take for Ethereum Blockchain to accept and confirm a transaction to a block using machine learning. We explore two of the most fundamental machine learning approaches, i.e., Classification and Regression in order to determine which of the two would be more accurate to make confirmation time prediction in the Ethereum blockchain. More specifically, we explore Naïve Bayes classifier, Random Forest classifier and Multilayer Perceptron classifier for the classification approach. Since most transactions in the network are confirmed well within the average confirmation time of two block confirmations or 15 seconds, we also discuss ways to tackle the skewed dataset problem encountered in case of the classification approach. We also aim to compare the predictive accuracy of two machine learning regression models- Random Forest Regressor and Multilayer Perceptron against previously proposed statistical regression models under a set evaluation criterion; the objective is to determine whether machine learning offers a more accurate predictive model than conventional statistical models.
|
555 |
Near Real-time Detection of Masquerade attacks in Web applications : catching imposters using their browsing behavorPanopoulos, Vasileios January 2016 (has links)
This Thesis details the research on Machine Learning techniques that are central in performing Anomaly and Masquerade attack detection. The main focus is put on Web Applications because of their immense popularity and ubiquity. This popularity has led to an increase in attacks, making them the most targeted entry point to violate a system. Specifically, a group of attacks that range from identity theft using social engineering to cross site scripting attacks, aim at exploiting and masquerading users. Masquerading attacks are even harder to detect due to their resemblance with normal sessions, thus posing an additional burden. Concerning prevention, the diversity and complexity of those systems makes it harder to define reliable protection mechanisms. Additionally, new and emerging attack patterns make manually configured and Signature based systems less effective with the need to continuously update them with new rules and signatures. This leads to a situation where they eventually become obsolete if left unmanaged. Finally the huge amount of traffic makes manual inspection of attacks and False alarms an impossible task. To tackle those issues, Anomaly Detection systems are proposed using powerful and proven Machine Learning algorithms. Gravitating around the context of Anomaly Detection and Machine Learning, this Thesis initially defines several basic definitions such as user behavior, normality and normal and anomalous behavior. Those definitions aim at setting the context in which the proposed method is targeted and at defining the theoretical premises. To ease the transition into the implementation phase, the underlying methodology is also explained in detail. Naturally, the implementation is also presented, where, starting from server logs, a method is described on how to pre-process the data into a form suitable for classification. This preprocessing phase was constructed from several statistical analyses and normalization methods (Univariate Selection, ANOVA) to clear and transform the given logs and perform feature selection. Furthermore, given that the proposed detection method is based on the source and1request URLs, a method of aggregation is proposed to limit the user privacy and classifier over-fitting issues. Subsequently, two popular classification algorithms (Multinomial Naive Bayes and Support Vector Machines) have been tested and compared to define which one performs better in our given situations. Each of the implementation steps (pre-processing and classification) requires a number of different parameters to be set and thus a method called Hyper-parameter optimization is defined. This method searches for the parameters that improve the classification results. Moreover, the training and testing methodology is also outlined alongside the experimental setup. The Hyper-parameter optimization and the training phases are the most computationally intensive steps, especially given a large number of samples/users. To overcome this obstacle, a scaling methodology is also defined and evaluated to demonstrate its ability to handle larger data sets. To complete this framework, several other options have been also evaluated and compared to each other to challenge the method and implementation decisions. An example of this, is the "Transitions-vs-Pages" dilemma, the block restriction effect, the DR usefulness and the classification parameters optimization. Moreover, a Survivability Analysis is performed to demonstrate how the produced alarms could be correlated affecting the resulting detection rates and interval times. The implementation of the proposed detection method and outlined experimental setup lead to interesting results. Even so, the data-set that has been used to produce this evaluation is also provided online to promote further investigation and research on this field. / Det här arbetet behandlar forskningen på maskininlärningstekniker som är centrala i utförandet av detektion av anomali- och maskeradattacker. Huvud-fokus läggs på webbapplikationer på grund av deras enorma popularitet och att de är så vanligt förekommande. Denna popularitet har lett till en ökning av attacker och har gjort dem till den mest utsatta punkten för att bryta sig in i ett system. Mer specifikt så syftar en grupp attacker som sträcker sig från identitetsstölder genom social ingenjörskonst, till cross-site scripting-attacker, på att exploatera och maskera sig som olika användare. Maskeradattacker är ännu svårare att upptäcka på grund av deras likhet med vanliga sessioner, vilket utgör en ytterligare börda. Vad gäller förebyggande, gör mångfalden och komplexiteten av dessa system det svårare att definiera pålitliga skyddsmekanismer. Dessutom gör nya och framväxande attackmönster manuellt konfigurerade och signaturbaserade system mindre effektiva på grund av behovet att kontinuerligt uppdatera dem med nya regler och signaturer. Detta leder till en situation där de så småningom blir obsoleta om de inte sköts om. Slutligen gör den enorma mängden trafik manuell inspektion av attacker och falska alarm ett omöjligt uppdrag. För att ta itu med de här problemen, föreslås anomalidetektionssystem som använder kraftfulla och beprövade maskininlärningsalgoritmer. Graviterande kring kontexten av anomalidetektion och maskininlärning, definierar det här arbetet först flera enkla definitioner såsom användarbeteende, normalitet, och normalt och anomalt beteende. De här definitionerna syftar på att fastställa sammanhanget i vilket den föreslagna metoden är måltavla och på att definiera de teoretiska premisserna. För att under-lätta övergången till implementeringsfasen, förklaras även den bakomliggande metodologin i detalj. Naturligtvis presenteras även implementeringen, där, med avstamp i server-loggar, en metod för hur man kan för-bearbeta datan till en form som är lämplig för klassificering beskrivs. Den här för´-bearbetningsfasen konstruerades från flera statistiska analyser och normaliseringsmetoder (univariate se-lection, ANOVA) för att rensa och transformera de givna loggarna och utföra feature selection. Dessutom, givet att en föreslagen detektionsmetod är baserad på käll- och request-URLs, föreslås en metod för aggregation för att begränsa problem med överanpassning relaterade till användarsekretess och klassificerare. Efter det så testas och jämförs två populära klassificeringsalgoritmer (Multinomialnaive bayes och Support vector machines) för att definiera vilken som fungerar bäst i våra givna situationer. Varje implementeringssteg (för-bearbetning och klassificering) kräver att ett antal olika parametrar ställs in och således definieras en metod som kallas Hyper-parameter optimization. Den här metoden söker efter parametrar som förbättrar klassificeringsresultaten. Dessutom så beskrivs tränings- och test-ningsmetodologin kortfattat vid sidan av experimentuppställningen. Hyper-parameter optimization och träningsfaserna är de mest beräkningsintensiva stegen, särskilt givet ett stort urval/stort antal användare. För att övervinna detta hinder så definieras och utvärderas även en skalningsmetodologi baserat på dess förmåga att hantera stora datauppsättningar. För att slutföra detta ramverk, utvärderas och jämförs även flera andra alternativ med varandra för att utmana metod- och implementeringsbesluten. Ett exempel på det är ”Transitions-vs-Pages”-dilemmat, block restriction-effekten, DR-användbarheten och optimeringen av klassificeringsparametrarna. Dessu-tom så utförs en survivability analysis för att demonstrera hur de producerade alarmen kan korreleras för att påverka den resulterande detektionsträ˙säker-heten och intervalltiderna. Implementeringen av den föreslagna detektionsmetoden och beskrivna experimentuppsättningen leder till intressanta resultat. Icke desto mindre är datauppsättningen som använts för att producera den här utvärderingen också tillgänglig online för att främja vidare utredning och forskning på området.
|
556 |
Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government settingMarin Rodenas, Alfonso January 2011 (has links)
Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response. This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used. From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not. Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes.
|
557 |
The past, present or future? : A comparative NLP study of Naive Bayes, LSTM and BERT for classifying Swedish sentences based on their tenseNavér, Norah January 2021 (has links)
Natural language processing is a field in computer science that is becoming increasingly important. One important part of NLP is the ability to sort text to the past, present or future, depending on when the event came or will come about. The objective of this thesis was to use text classification to classify Swedish sentences based on their tense, either past, present or future. Furthermore, the objective was also to compare how lemmatisation would affect the performance of the models. The problem was tackled by implementing three machine learning models on both lemmatised and not lemmatised data. The machine learning models were Naive Bayes, LSTM and BERT. The result showed that the overall performance was affected negatively when the data was lemmatised. The best performing model was BERT with an accuracy of 96.3\%. The result was useful as the best performing model had very high accuracy and performed well on newly constructed sentences. / Språkteknologi är område inom datavetenskap som som har blivit allt viktigare. En viktig del av språkteknologi är förmågan att sortera texter till det förflutna, nuet eller framtiden, beroende på när en händelse skedde eller kommer att ske. Syftet med denna avhandling var att använda textklassificering för att klassificera svenska meningar baserat på deras tempus, antingen dåtid, nutid eller framtid. Vidare var syftet även att jämföra hur lemmatisering skulle påverka modellernas prestanda. Problemet hanterades genom att implementera tre maskininlärningsmodeller på både lemmatiserade och icke lemmatiserade data. Maskininlärningsmodellerna var Naive Bayes, LSTM och BERT. Resultatet var att den övergripande prestandan påverkades negativt när datan lemmatiserade. Den bäst presterande modellen var BERT med en träffsäkerhet på 96,3 \%. Resultatet var användbart eftersom den bäst presterande modellen hade mycket hög träffsäkerhet och fungerade bra på nybyggda meningar.
|
558 |
Maskininlärning: avvikelseklassificering på sekventiell sensordata. En jämförelse och utvärdering av algoritmer för att klassificera avvikelser i en miljövänlig IoT produkt med sekventiell sensordataHeidfors, Filip, Moltedo, Elias January 2019 (has links)
Ett företag har tagit fram en miljövänlig IoT produkt med sekventiell sensordata och vill genom maskininlärning kunna klassificera avvikelser i sensordatan. Det har genom åren utvecklats ett flertal väl fungerande algoritmer för klassificering men det finns emellertid ingen algoritm som fungerar bäst för alla olika problem. Syftet med det här arbetet var därför att undersöka, jämföra och utvärdera olika klassificerare inom "supervised machine learning" för att ta reda på vilken klassificerare som ger högst träffsäkerhet att klassificera avvikelser i den typ av IoT produkt som företaget tagit fram. Genom en litteraturstudie tog vi först reda på vilka klassificerare som vanligtvis använts och fungerat bra i tidigare vetenskapliga arbeten med liknande applikationer. Vi kom fram till att jämföra och utvärdera Random Forest, Naïve Bayes klassificerare och Support Vector Machines ytterligare. Vi skapade sedan ett dataset på 513 exempel som vi använde för träning och validering för respektive klassificerare. Resultatet visade att Random Forest hade betydligt högre träffsäkerhet med 95,7% jämfört med Naïve Bayes klassificerare (81,5%) och Support Vector Machines (78,6%). Slutsatsen för arbetet är att Random Forest med sina 95,7% ger en tillräckligt hög träffsäkerhet så att företaget kan använda maskininlärningsmodellen för att förbättra sin produkt. Resultatet pekar också på att Random Forest, för det här arbetets specifika klassificeringsproblem, är den klassificerare som fungerar bäst inom "supervised machine learning" men att det eventuellt finns möjlighet att få ännu högre träffsäkerhet med andra tekniker som till exempel "unsupervised machine learning" eller "semi-supervised machine learning". / A company has developed a environment-friendly IoT device with sequential sensor data and want to use machine learning to classify anomalies in their data. Throughout the years, several well working algorithms for classifications have been developed. However, there is no optimal algorithm for every problem. The purpose of this work was therefore to investigate, compare and evaluate different classifiers within supervised machine learning to find out which classifier that gives the best accuracy to classify anomalies in the kind of IoT device that the company has developed. With a literature review we first wanted to find out which classifiers that are commonly used and have worked well in related work for similar purposes and applications. We concluded to further compare and evaluate Random Forest, Naïve Bayes and Support Vector Machines. We created a dataset of 513 examples that we used for training and evaluation for each classifier. The result showed that Random Forest had superior accuracy with 95.7% compared to Naïve Bayes (81.5%) and Support Vector Machines (78.6%). The conclusion for this work is that Random Forest, with 95.7%, gives a high enough accuracy for the company to have good use of the machine learning model. The result also indicates that Random Forest, for this thesis specific classification problem, is the best classifier within supervised machine learning but that there is a potential possibility to get even higher accuracy with other techniques such as unsupervised machine learning or semi-supervised machine learning.
|
559 |
Vytvoření nových klasifikačních modulů v systému pro dolování z dat na platformě NetBeans / Creation of New Clasification Units in Data Mining System on NetBeans PlatformKmoščák, Ondřej January 2009 (has links)
This diploma thesis deals with the data mining and the creation of data mining unit for data mining system, which is beeing developed at FIT. This is a client application consisting of a kernel and its graphical user interface and independent mining modules. The application uses support of Oracle Data Mining. The data mining system is implemented in Java language and its graphical user interface is built on NetBeans platform. The content of this work will be the introduction into the issue of knowledge discovery and then the presentation of the chosen Bayesian classification method, for which there will subsequently be implemented the stand-alone data mining module. Furthermore, the implementation of this module will be described.
|
560 |
Using Natural Language Processing and Machine Learning for Analyzing Clinical Notes in Sickle Cell Disease PatientsKhizra, Shufa January 2018 (has links)
No description available.
|
Page generated in 0.0285 seconds