21 |
Analýza reálných dat produktové redakce Alza.cz pomocí metod DZD / Analysis of real data from Alza.cz product department using methods of KDDVálek, Martin January 2014 (has links)
This thesis deals with data analysis using methods of knowledge discovery in databases. The goal is to select appropriate methods and tools for implementation of a specific project based on real data from Alza.cz product department. Data analysis is performed by using association rules and decision rules in the Lisp-Miner and decision trees in the RapidMiner. The methodology used is the CRISP-DM. The thesis is divided into three main sections. First section is focused on the theoretical summary of information about KDD. There are defined basic terms and described the types of tasks and methods of KDD. In the second section is introduced the methodology CRISP-DM. The practical part firstly introduces company Alza.cz and its goals for this task. Afterwards, the basic structure of the data and preparation for the next step (data mining) is described. In conclusion, the results are evaluated and the possibility of their use is outlined.
|
22 |
Reálná úloha dobývání znalostí / The Real Knowledge Discovery TaskKolafa, Ondřej January 2012 (has links)
The major objective of this thesis is to perform a real data mining task of classifying term deposit accounts holders. For this task an anonymous bank customers with low funds position data are used. In correspondence with CRISP-DM methodology the work is guided through these steps: business understanding, data understanding, data preparation, modeling, evaluation and deployment. The RapidMiner application is used for modeling. Methods and procedures used in actual task are described in theoretical part. Basic concepts of data mining with special respect to CRM segment was introduced as well as CRISP-DM methodology and technics suitable for this task. A difference in proportions of long term accounts holders and non-holders enforced data set had to be balanced in favour of holders. At the final stage, there are twelve models built. According to chosen criterias (area under curve and f-measure) two best models (logistic regression and bayes network) were elected. In the last stage of data mining process a possible real-world utilisation is mentioned. The task is developed only in form of recommendations, because it can't be applied to the real situation.
|
23 |
Automatizace dataminingového procesu v datech o dopravních nehodách v Londýně / Automation of a data mining process in the road accidents data from London by the LISp-Miner systemSoukup, Tomáš January 2015 (has links)
This thesis is focused on the area of automated data mining and to describe steps associated with solving analytical questions using the LISp-Miner system in the data with road accident records. Analytical tasks were primarily created based on domene knowledge from road accidents statistics in Great Britain and from previous analysis in my semestral project. The aim of this thesis is creation of an automated data mining process for analyze the input data by applying 4ft-Miner, Ac4ft-Miner a SD4ft-Miner procedures, and looking for a new knowledge for every single year of the analyzed period. The implementation language is the LMCL language that enables usage of the LISp-Miner system's functionality in an automated way. These created scripts could be used for analyses of another dataset with the same structure or with some manual changes in initial parameters for the quite different data.
|
24 |
Datamining - theory and it's application / Datamining - teorie a praxePopelka, Aleš January 2012 (has links)
This thesis deals with the topic of the technology called data mining. First, the thesis describes the term data mining as an independent discipline and then its processing methods and the most common use. The term data mining is thereafter explained with the help of methodologies describing all parts of the process of knowledge discovery in databases -- CRISP-DM, SEMMA. The study's purpose is presenting new data mining methods and particular algorithms -- decision trees, neural networks and genetic algorithms. These facts are used as theoretical introduction, which is followed by practical application searching for causes of meningoencephalitis development of certain sample of patients. Decision trees in system Clementine, which is one of the top datamining tools, were used for the analysys.
|
25 |
An exploratory study of manufacturing data and its potential for continuous process improvements from a production economical perspectiveTodorovac, Kennan, Wiking, Nils January 2021 (has links)
Background: Continues improvements in production are essential in order to compete on the market. However, to be an active competitor on the market, companies need to know their strengths and weaknesses, and improve and develop their production continually. Today process industries generate enormous volumes of data and data are considered a valuable source for companies to find new ways to boost their operations' productivity and profitability. Data Mining (DM) is the process of discovering useful patterns and trends in large data sets. Several authors have pointed out data mining as a good data analysis process for manufacturing due to the large amount of data generated and collected from production processes. In manufacturing, DM has two primary goals, descriptive with the focus on discovering patterns to describe the data and predictive where a model is used to determine future values of important variables. Objectives: The objective of this study was to get a deeper understanding of how collected data from production can lead to insights regarding potential production economic improvementsby following the CRISP-DM methodology. In particular to the chosen production line if there were any differences in replenishment durations when it comes to different procedures. Duration in this study is the time the line is halted during a material replenishment. The procedures in question are single-replenishment versus double-replenishment. Further investigated was if there were any differences in the replenishment duration when it comes to which shift team and at what shift time the replenishment procedures were made. Methods: In this study the CRISP-DM methodology was used for structuring the collected data from the case company. The data was primarily historical data from a continues production process. To verify the objective of the study, three hypotheses derived from the objective was tested by using a t test and Bonferroni test. Results: The result showed that the duration of a double-replenishment is lower compared to two single-replenishments. Further results showed that there is a significant difference in the single-replenishment duration between the different shift times and different working teams. The interpretation of the result is that in the short term there is a possibility that implementingdouble replenishments can reduce the throughput time and possibility also the lead time. Conclusions: This study could contribute with knowledge for others who seek a way to use data to detect information or deeper knowledge about a continuous production process. The findings in this study could be specifically interesting for cable manufacturers and, in general, for continuous process manufacturers. Further conclusions are that time-based competition is one way for increasing the competitive advantage in the market. By using manufacturing generated data, it is possible to analyse and find valuable information that can contribute to continuous process improvements and increase the competitive advantage.
|
26 |
Försäljningsprediktion : en jämförelse mellan regressionsmodeller / Sales prediction : a comparison between regression modelsFridh, Anton, Sandbecker, Erik January 2021 (has links)
Idag finns mängder av företag i olika branscher, stora som små, som vill förutsäga sin försäljning. Det kan bland annat bero på att de vill veta hur stort antal produkter de skall köpa in eller tillverka, och även vilka produkter som bör investeras i över andra. Vilka varor som är bra att investera i på kort sikt och vilka som är bra på lång sikt. Tidigare har detta gjorts med intuition och statistik, de flesta vet att skidjackor inte säljer så bra på sommaren, eller att strandprylar inte säljer bra under vintern. Det här är ett simpelt exempel, men hur blir det när komplexiteten ökar, och det finns ett stort antal produkter och butiker? Med hjälp av maskininlärning kan ett sånt här problem hanteras. En maskininlärningsalgoritm appliceras på en tidsserie, som är en datamängd med ett antal ordnade observationer vid olika tidpunkter under en viss tidsperiod. I den här studiens fall är detta försäljning av olika produkter som säljs i olika butiker och försäljningen ska prediceras på månadsbasis. Tidsserien som behandlas är ett dataset från Kaggle.com som kallas för “Predict Future Sales”. Algoritmerna som används i för den här studien för att hantera detta tidsserieproblem är XGBoost, MLP och MLR. XGBoost, MLR och MLP har i tidigare forskning gett bra resultat på liknande problem, där bland annat bilförsäljning, tillgänglighet och efterfrågan på taxibilar och bitcoin-priser legat i fokus. Samtliga algoritmer presterade bra utifrån de evalueringsmått som användes för studierna, och den här studien använder samma evalueringsmått. Algoritmernas prestation beskrivs enligt så kallade evalueringsmått, dessa är R², MAE, RMSE och MSE. Det är dessa mått som används i resultat- och diskussionskapitlen för att beskriva hur väl algoritmerna presterar. Den huvudsakliga forskningsfrågan för studien lyder därför enligt följande: Vilken av algoritmerna MLP, XGBoost och MLR kommer att prestera bäst enligt R², MAE, RMSE och MSE på tidsserien “Predict Future Sales”. Tidsserien behandlas med ett känt tillvägagångssätt inom området som kallas CRISP-DM, där metodens olika steg följs. Dessa steg innebär bland annat dataförståelse, dataförberedelse och modellering. Denna metod är vad som i slutändan leder till resultatet, där resultatet från de olika modellerna som skapats genom CRISP-DM presenteras. I slutändan var det MLP som fick bäst resultat enligt mätvärdena, följt av MLR och XGBoost. MLP fick en RMSE på 0.863, MLR på 1.233 och XGBoost på 1.262 / Today, there are a lot of companies in different industries, large and small, that want to predict their sales. This may be due, among other things, to the fact that they want to know how many products they should buy or manufacture, and also which products should be invested in over others. In the past, this has been done with intuition and statistics. Most people know that ski jackets do not sell so well in the summer, or that beach products do not sell well during the winter. This is a simple example, but what happens when complexity increases, and there are a large number of products and stores? With the help of machine learning, a problem like this can be managed easier. A machine learning algorithm is applied to a time series, which is a set of data with several ordered observations at different times during a certain time period. In the case of this study, it is the sales of different products sold in different stores, and sales are to be predicted on a monthly basis. The time series in question is a dataset from Kaggle.com called "Predict Future Sales". The algorithms used in this study to handle this time series problem are XGBoost, MLP and MLR. XGBoost, MLR and MLP. These have in previous research performed well on similar problems, where, among other things, car sales, availability and demand for taxis and bitcoin prices were in focus. All algorithms performed well based on the evaluation metrics used by the studies, and this study uses the same evaluation metrics. The algorithms' performances are described according to so-called evaluation metrics, these are R², MAE, RMSE and MSE. These measures are used in the results and discussion chapters to describe how well the algorithms perform. The main research question for the study is therefore as follows: Which of the algorithms MLP, XGBoost and MLR will perform best according to R², MAE, RMSE and MSE on the time series "Predict Future Sales". The time series is treated with a known approach called CRISP-DM, where the methods are followed in different steps. These steps include, among other things, data understanding, data preparation and modeling. This method is what ultimately leads to the results, where the results from the various models created by CRISP-DM are presented. In the end, it was the MLP algorithm that got the best results according to the measured values, followed by MLR and XGBoost. MLP got an RMSE of 0.863, MLR of 1,233 and XGBoost of 1,262
|
27 |
An investigation of the relationship between online activity on Studi.se and academic grades of newly arrived immigrant students : An application of educational data miningMenon, Akash, Islam, Nahida January 2017 (has links)
This study attempts to analyze the impact of an online educational resource on academic performances among newly arrived immigrant students in Sweden between the grade six to nine in the Swedish school system. The study focuses on the web based educational resource called Studi.se made by Komplementskolan AB.The aim of the study was to investigate the relationship between academic performance and using Studi.se. Another purpose was to see what other factors that can impact academic performances.The study made use of the data mining process, Cross Industry Standard for Data Mining (CRISP-DM), to understand and prepare the data and then create a regression model that is evaluated. The regression model tries predict the dependent variable of grade based on the independent variables of Studi.se activity, gender and years in Swedish schools. The used data set includes the grades in mathematics, physics, chemistry, biology and religion of newly arrived students in Sweden from six municipalities that have access to Studi.se. The data used also includes metrics of the student’s activity on Studi.se.The results show negative correlation between grade and gender of the student across all subjects. In this report, the negative correlation means that female students perform better than male students. Furthermore, there was a positive correlation between number of years a student has been in the same school and their academic grade. The study could not conclude a statistically significant relationship between the activity on Studi.se and the students’ academic grade.Additional explanatory independent variables are needed to make a predictive model as well as investigating alternative regression models other than multiple linear regression. In the sample, a majority of the students have little or no activity on Studi.se despite having free access to the resource through the municipality. / Denna studie analyserar inverkan som digitala läromedel har på skolbetyg bland nyanlända elever i Sverige mellan årskurs sex och nio i det svenska skolsystemet. Studien fokuserar på den webbaserade pedagogisk resursen Studi.se, gjord av Komplementskolan AB.Målet med studien var att undersöka relationen mellan skolresultat och användandet av Studi.se. Ett annat syfte var att undersöka vad för andra faktorer som kan påverka skolresultat.Studien använder sig av datautvinningsprocessen, Cross Industry Standard for Datamining (CRISP-DM), för att förstå, förbereda och analysera datan i form av en regressionsmodell som sedan evalueras. Datasamlingen som används innehåller bland annat skolbetyg i ämnena matematik, fysik, kemi, biologi och religion från sex kommuner som har tillgång till Studi.se. Aktivitet hos eleverna från dessa kommuner på Studi.se hemsidan användes också för studien.Resultaten visar en negativ korrelation mellan betyg och kön hos eleverna i alla ämnena. Den negativa korrelationen betyder i denna rapport att tjejer får bättre betyg i genomsnitt än killar hos urvalet av nyanlända från de sex kommunerna. Dessutom fanns det en positiv korrelation mellan antal år en elev varit i skolan alternativt i svenska skolsystemet och deras betyg. Studien kunde inte säkerställa ett statistisk signifikant resultat mellan aktivitet på Studi.se och elevernas skolresultat.Ett flertal förklarande oberoende variabler behövs för att kunna skapa en prognastisk modell för skolresultat samt bör en undersökning på alternativa regressions modeller förutom linjär multipel regression göras. I studiens urval av nyanlända elever från kommunerna, har majoriteten inte använt eller knappt använt Studi.se även om dessa kommuner haft tillgång till denna resurs.
|
28 |
Reálná úloha dobývání znalostí / Actual role of knowledge discovery in databasesPešek, Jiří January 2012 (has links)
The thesis "Actual role of knowledge discovery in databases˝ is concerned with churn prediction in mobile telecommunications. The issue is based on real data of a telecommunication company and it covers all steps of data mining process. In accord with the methodology CRISP-DM, the work looks thouroughly at the following stages: business understanding, data understanding, data preparation, modeling, evaluation and deployment. As far as a system for knowledge discovery in databases is concerned, the tool IBM SPSS Modeler was selected. The introductory chapter of the theoretical part familiarises the reader with the issue of so called churn management, which comprises the given assignment; the basic concepts related to data mining are defined in the chapter as well. The attention is also given to the basic types of tasks of knowledge discovery of databasis and algorithms that are pertinent to the selected assignment (decision trees, regression, neural network, bayesian network and SVM). The methodology describing phases of knowledge discovery in databases is included in a separate chapter, wherein the methodology of CRIPS-DM is examined in greater detail, since it represents the foundation for the solution of our practical assignment. The conclusion of the theoretical part also observes comercial or freely available systems for knowledge discovery in databases.
|
29 |
A Framework for How to Make Use of an Automatic Passenger Counting SystemFihn, John, Finndahl, Johan January 2011 (has links)
Most of the modern cities are today facing tremendous traffic congestions, which is a consequence of an increasing usage of private motor vehicles in the cities. Public transport plays a crucial role to reduce this traffic, but to be an attractive alternative to the use of private motor vehicles the public transport needs to provide services that suit the citizens requirements for travelling. A system that can provide transit agencies with rapid feedback about the usage of their transport network is the Automatic Passenger Counting (APC) system, a system that registers the number of passengers boarding and alighting a vehicle. Knowledge about the passengers travel behaviour can be used by transit agencies to adapt and improve their services to satisfy the requirements, but to achieve this knowledge transit agencies needs to know how to use an APC system. This thesis investigates how a transit agency can make use of an APC system. The research has taken place in Melbourne where Yarra Trams, operator of the tram network, now are putting effort in how to utilise the APC system. A theoretical framework based on theories about Knowledge Discovery from Data, System Development, and Human Computer Interaction, is built, tested, and evaluated in a case study at Yarra Trams. The case study resulted in a software system that can process and model Yarra Tram's APC data. The result of the research is a proposal of a framework consistingof different steps and events that can be used as a guide for a transit agency that wants to make use of an APC system.
|
30 |
Automatizace dataminingového procesu v datech o dopravních nehodách v České republice / Automation of a data mining process in the data about traffic accidents in the Czech RepublicPodavka, Jan January 2017 (has links)
This master thesis deals with automation process of a data mining in the LISp-Miner program. The aim of this thesis is to create an automated process that analyzes analytical questions in the data about traffic accidents in the Czech Republic using a LMCL scripting language and LM Exec module. Theoretical part of thesis describes the process of knowledge discovery in databases and most widely used methodology. It also describes the relevant topics for the work with LISp-Miner. The practical part is focused on description of traffic accidents in the Czech Republic, a description of the used data, creation and evaluation of analytical questions and especially a description of created scripts. The output of the thesis is a group of scripts and manual how to use them again, so they can be reused for analysis of actual data on traffic accidents not only in the Czech Republic, if they have the same data structure.
|
Page generated in 0.0287 seconds