Global ETD Search

21	Effects of Sampling Sufficiency and Model Selection on Predicting the Occurrence of Stream Fish Species at Large Spatial Extents Krueger, Kirk L. 17 February 2009 (has links) Knowledge of species occurrence is a prerequisite for efficient and effective conservation and management. Unfortunately, knowledge of species occurrence is usually insufficient, so models that use environmental predictors and species occurrence records are used to predict species occurrence. Predicting the occurrence of stream fishes is often difficult because sampling data insufficiently describe species occurrence and important environmental conditions and predictive models insufficiently describe relations between species and environmental conditions. This dissertation 1) examines the sufficiency of fish species occurrence records at four spatial extents in Virginia, 2) compares modeling methods for predicting stream fish occurrence, and 3) assesses relations between species traits and model prediction characteristics. The sufficiency of sampling is infrequently addressed at the large spatial extents at which many management and conservation actions take place. In the first chapter of this dissertation I examine factors that determine the sufficiency of sampling to describe stream fish species richness at four spatial extents across Virginia using sampling simulations. Few regions of Virginia are sufficiently sampled, portending difficulty in accurately predicting fish species occurrence in most regions. The sufficient number of samples is often large and varies among regions and spatial scales, but it can be substantially reduced by reducing errors of sampling omission and increasing the spatial coverage of samples. Many methods are used to predict species occurrence. In the second chapter of this dissertation I compare the accuracy of the predictions of occurrence of seven species in each of three regions using linear discriminant function, generalized linear, classification tree, and artificial neural network statistical models. I also assess the efficacy of stream classification methods for predicting species occurrence. No modeling method proved distinctly superior. Species occurrence data and predictor data quality and quantity limited the success of predictions of stream fish occurrence for all methods. How predictive models are built and applied may be more important than the statistical method used. The accuracy, generality (transferability), and resolution of predictions of species occurrence vary among species. The ability to anticipate and understand variation in prediction characteristics among species can facilitate the proper application of predictions of species occurrence. In the third chapter of this dissertation I describe some conservation implications of relations between predicted occurrence characteristics and species traits for fishes in the upper Tennessee River drainage. Usually weak relations and variation in the strength and direction of relations among families precludes the accurate prediction of predicted occurrence characteristics. Most predictions of species occurrence have insufficient accuracy and resolution to guide conservation decisions at fine spatial grains. Comparison of my results with alternative model predictions and the results of many models described in peer-reviewed journals suggests that this is a common problem. Predictions of species occurrence should be rigorously assessed and cautiously applied to conservation problems. Collectively, the three chapters of this dissertation demonstrate some important limitations of models that are used to predict species occurrence. Model predictions of species occurrence are often used in lieu of sufficient species occurrence data. However, regardless of the method used to predict species occurrence most predictions have relatively low accuracy, generality and resolution. Model predictions of species occurrence can facilitate management and conservation, but they should be rigorously assessed and applied cautiously. / Ph. D. spatial extent classification tree artificial neural network predicting occurrence multiple logistic regression stream fish resolution sampling sufficiency
22	Influencing Elections with Statistics: Targeting Voters with Logistic Regression Trees Rusch, Thomas, Lee, Ilro, Hornik, Kurt, Jank, Wolfgang, Zeileis, Achim 03 1900 (has links) (PDF) Political campaigning has become a multi-million dollar business. A substantial proportion of a campaign's budget is spent on voter mobilization, i.e., on identifying and influencing as many people as possible to vote. Based on data, campaigns use statistical tools to provide a basis for deciding who to target. While the data available is usually rich, campaigns have traditionally relied on a rather limited selection of information, often including only previous voting behavior and one or two demographical variables. Statistical procedures that are currently in use include logistic regression or standard classification tree methods like CHAID, but there is a growing interest in employing modern data mining approaches. Along the lines of this development, we propose a modern framework for voter targeting called LORET (for logistic regression trees) that employs trees (with possibly just a single root node) containing logistic regressions (with possibly just an intercept) in every leaf. Thus, they contain logistic regression and classification trees as special cases and allow for a synthesis of both techniques under one umbrella. We explore various flavors of LORET models that (a) compare the effect of using the full set of available variables against using only limited information and (b) investigate their varying effects either as regressors in the logistic model components or as partitioning variables in the tree components. To assess model performance and illustrate targeting, we apply LORET to a data set of 19,634 eligible voters from the 2004 US presidential election. We find that augmenting the standard set of variables (such as age and voting history) together with additional predictor variables (such as the household composition in terms of party affiliation and each individual's rank in the household) clearly improves predictive accuracy. We also find that LORET models based on tree induction outbeat the unpartitioned competitors. Additionally, LORET models using both partitioning variables and regressors in the resulting nodes can improve the efficiency of allocating campaign resources while still providing intelligible models. / Series: Research Report Series / Department of Statistics and Mathematics
23	Plant species rarity and data restriction influence the prediction success of species distribution models Mugodo, James, n/a January 2002 (has links) There is a growing need for accurate distribution data for both common and rare plant species for conservation planning and ecological research purposes. A database of more than 500 observations for nine tree species with different ecological and geographical distributions and a range of frequencies of occurrence in south-eastern New South Wales (Australia) was used to compare the predictive performance of logistic regression models, generalised additive models (GAMs) and classification tree models (CTMs) using different data restriction regimes and several model-building strategies. Environmental variables (mean annual rainfall, mean summer rainfall, mean winter rainfall, mean annual temperature, mean maximum summer temperature, mean minimum winter temperature, mean daily radiation, mean daily summer radiation, mean daily June radiation, lithology and topography) were used to model the distribution of each of the plant species in the study area. Model predictive performance was measured as the area under the curve of a receiver operating characteristic (ROC) plot. The initial predictive performance of logistic regression models and generalised additive models (GAMs) using unrestricted, temperature restricted, major gradient restricted and climatic domain restricted data gave results that were contrary to current practice in species distribution modelling. Although climatic domain restriction has been used in other studies, it was found to produce models that had the lowest predictive performance. The performance of domain restricted models was significantly (p = 0.007) inferior to the performance of major gradient restricted models when the predictions of the models were confined to the climatic domain of the species. Furthermore, the effect of data restriction on model predictive performance was found to depend on the species as shown by a significant interaction between species and data restriction treatment (p = 0.013). As found in other studies however, the predictive performance of GAM was significantly (p = 0.003) better than that of logistic regression. The superiority of GAM over logistic regression was unaffected by different data restriction regimes and was not significantly different within species. The logistic regression models used in the initial performance comparisons were based on models developed using the forward selection procedure in a rigorous-fitting model-building framework that was designed to produce parsimonious models. The rigorous-fitting modelbuilding framework involved testing for the significant reduction in model deviance (p = 0.05) and significance of the parameter estimates (p = 0.05). The size of the parameter estimates and their standard errors were inspected because large estimates and/or standard errors are an indication of model degradation from overfilling or effecls such as mullicollinearily. For additional variables to be included in a model, they had to contribule significantly (p = 0.025) to the model prediclive performance. An attempt to improve the performance of species distribution models using logistic regression models in a rigorousfitting model-building framework, the backward elimination procedure was employed for model selection, bul it yielded models with reduced performance. A liberal-filling model-building framework that used significant model deviance reduction at p = 0.05 (low significance models) and 0.00001 (high significance models) levels as the major criterion for variable selection was employed for the development of logistic regression models using the forward selection and backward elimination procedures. Liberal filling yielded models that had a significantly greater predictive performance than the rigorous-fitting logistic regression models (p = 0.0006). The predictive performance of the former models was comparable to that of GAM and classification tree models (CTMs). The low significance liberal-filling models had a much larger number of variables than the high significance liberal-fitting models, but with no significant increase in predictive performance. To develop liberal-filling CTMs, the tree shrinking program in S-PLUS was used to produce a number of trees of differenl sizes (subtrees) by optimally reducing the size of a full CTM for a given species. The 10-fold cross-validated model deviance for the subtrees was plotted against the size of the subtree as a means of selecting an appropriate tree size. In contrast to liberal-fitting logistic regression, liberal-fitting CTMs had poor predictive performance. Species geographical range and species prevalence within the study area were used to categorise the tree species into different distributional forms. These were then used, to compare the effect of plant species rarity on the predictive performance of logistic regression models, GAMs and CTMs. The distributional forms included restricted and rare (RR) species (Eucalyptus paliformis and Eucalyptus kybeanensis), restricted and common (RC) species (Eucalyptus delegatensis, Eucryphia moorei and Eucalyptus fraxinoides), widespread and rare (WR) species (Eucalyptus data) and widespread and common (WC) species (Eucalyptus sieberi, Eucalyptus pauciflora and Eucalyptus fastigata). There were significant differences (p = 0.076) in predictive performance among the distributional forms for the logistic regression and GAM. The predictive performance for the WR distributional form was significantly lower than the performance for the other plant species distributional forms. The predictive performance for the RC and RR distributional forms was significantly greater than the performance for the WC distributional form. The trend in model predictive performance among plant species distributional forms was similar for CTMs except that the CTMs had poor predictive performance for the RR distributional form. This study shows the importance of data restriction to model predictive performance with major gradient data restriction being recommended for consistently high performance. Given the appropriate model selection strategy, logistic regression, GAM and CTM have similar predictive performance. Logistic regression requires a high significance liberal-fitting strategy to both maximise its predictive performance and to select a relatively small model that could be useful for framing future ecological hypotheses about the distribution of individual plant species. The results for the modelling of plant species for conservation purposes were encouraging since logistic regression and GAM performed well for the restricted and rare species, which are usually of greater conservation concern. accurate distribution data restriction of data prediction models plants generalized additive models classification tree models GAMs CTMs RR restricted and rare restricted and common widespread and common WC WR
24	ISTQB : Black Box testing Strategies used in Financial Industry for Functional testing Saeed, Umar, Amjad, Ansur Mahmood January 2009 (has links) Black box testing techniques are important to test the functionality of the system without knowing its inner detail which makes sure correct, consistent, complete and accurate behavior or function of a system. Black box testing strategies are used to test logical, data or behavioral dependencies, to generate test data and quality of test cases which have potential to guess more defects. Black box testing strategies play pivotal role to detect possible defects in system and can help in successful completion of system according to functionality. The studies of five companies regarding important black box testing strategies are presented in this thesis. This study explores the black box testing techniques which are present in literature and practiced in industry as well. Interview studies are conducted in companies of Pakistan providing solutions to finance industry, which is an attempt to find the usage of these techniques. The advantages and disadvantages of identified Black box testing strategies are discussed, along with it; the comparison of different techniques with respect to most defect guessing, dependencies, sophistication, effort, and cost is presented as well. BBTS (black box testing strategies) BVA (boundary value analysis) EP (Equivalence Partitioning) DT (Decision Table) CT (Classification Tree) SD (State Diagram) UC (Use Case) Software Engineering Programvaruteknik
25	A Classification Tool for Predictive Data Analysis in Healthcare Victors, Mason Lemoyne 07 March 2013 (has links) (PDF) Hidden Markov Models (HMMs) have seen widespread use in a variety of applications ranging from speech recognition to gene prediction. While developed over forty years ago, they remain a standard tool for sequential data analysis. More recently, Latent Dirichlet Allocation (LDA) was developed and soon gained widespread popularity as a powerful topic analysis tool for text corpora. We thoroughly develop LDA and a generalization of HMMs and demonstrate the conjunctive use of both methods in predictive data analysis for health care problems. While these two tools (LDA and HMM) have been used in conjunction previously, we use LDA in a new way to reduce the dimensionality involved in the training of HMMs. With both LDA and our extension of HMM, we train classifiers to predict development of Chronic Kidney Disease (CKD) in the near future. predictive data analysis Hidden Markov Models Latent Dirichlet Allocation health care convex analysis Markov chains Expectation Maximization Gibbs sampling classification tree random forest Mathematics
26	應用遺傳規劃法於知識管理流程之知識擷取和整合機制 / GP-Based Knowledge Acquisition and Integration Mechanisms in Knowledge Management Processes 郭展盛, Kuo,Chan Sheng Unknown Date (has links) 在目前的企業環境中，很多企業致力於管理和應用組織知識，來維持他們的核心能力和創造競爭優勢。有效率的管理組織知識，能減少解決問題的時間和成本，並增加組織學習和創新能力。並且，由於累積知識資源的需求，很多企業開始發展知識庫，以儲存組織及個人的知識，用來增加組織整體的效率、支援日常的運作以及企業策略的操作。知識管理是現代的典範，可用來有效管理組織知識，進而改善組織績效。知識管理的目的是強調管理知識的流動及流程。在知識管理流程方面，主要區分為知識擷取、整合、儲存/歸類、散播和應用知識等程序。另外，資訊技術可用來協助知識管理，並能使知識管理更有效率。知識管理的主要議題之ㄧ是知識的擷取，由於目前知識來源的提供，主要是透過知識工作者，可是它對於知識工作者而言，是一種額外的負擔。因此，設計一個有效的方法來自動產生組織知識，以減輕他們的額外負擔，將是一個很重要的課題。第二個相當重要的議題是知識整合，由於不同來源的知識，可能造成組織知識的衝突，因此設計一個知識整合方法，把不同來源的知識整合成一個完整的知識，組織將會逐漸增加這方面的需求。分類在很多應用中是常遭遇的問題，例如財務預測、疾病診斷等。在過去，分類規則常藉由決策樹的方法所產生，並用於解決分類的問題。在本論文中，提出兩個以遺傳規劃為基礎的知識擷取方法和兩個以遺傳規劃為基礎的知識整合方法，分別支援知識管理流程中的知識擷取和知識整合。在兩個所提的知識擷取方法中，第一個方法是著重在快速和容易地找到想要的分類樹，但是，此方法可能會產生結構較複雜的分類樹。第二個方法是修正第一個方法，產生一個較精簡和應用性高的分類樹。這些所獲得的分類樹，能被轉換成規則集合，並匯入知識庫中，幫助企業決策的制定和日常的運作。此外，在兩個所提的知識整合方法中，第一個方法，能自動結合多重的知識來源成為一個整合的知識，並可匯入知識庫中，但是此方法只考慮到單一時間點的整合。第二個方法則是可以解決不同時間點的知識整合問題。另外，本論文提出三個新的遺傳運算子，在演化過程中，可用來解決規則集合中有重複、包含和衝突等常見的問題，因而可以產生較精簡及一致性高的分類規則。最後，本論文採用信用卡資料及乳癌資料來驗證所提方法的可行性，實驗結果皆獲得良好的成效。 / In today’s business environment, many enterprises make efforts in managing and applying organizational knowledge to sustain their core competence and create competitive advantage. The effective management of organizational knowledge can reduce the time and cost of solving problems, improve organizational performance, and increase organizational learning as well as innovative competence. Moreover, due to the need to accumulate knowledge resources, many enterprises have devoted to developing their knowledge repositories. These repositories store organizational and individual knowledge that are used to increase overall organization efficiency, support daily operations, and implement business strategies. Knowledge management (KM) is the modern paradigm for effective management of organizational knowledge to improve organizational performance. The intent of KM is to emphasize knowledge flows and the main process of acquisition, integration, storage/categorization, dissemination, and application. Furthermore, extant information technologies can provide a way of enabling more effective knowledge management. One of the important issues in knowledge management is knowledge acquisition. It is an extra burden for knowledge workers to contribute their knowledge into repositories, such that designing an effective method for abating an extra burden to automatically generate organizational knowledge will play a critical role in knowledge management. A second rather important issue in knowledge management is knowledge integration from different knowledge sources. Designing a knowledge-integration method to combine multiple knowledge sources will gradually become a necessity for enterprises. Classification problems, such as financial prediction and disease diagnosis, are often encountered in many applications. In the past, classification trees were often generated by decision-tree methods and commonly used to solve classification problems. In this dissertation, we propose two GP-based knowledge-acquisition methods and two GP-based knowledge-integration methods to support knowledge acquisition and knowledge integration respectively in the knowledge management processes for classification tasks. In the two proposed knowledge-acquisition methods, the first one is fast and easy to find the desired classification tree. It may, however, generate a complicated classification tree. The second method then further modifies the first method and produces a more concise and applicatory classification tree than the first one. The classification tree obtained can be transferred into a rule set, which can then be fed into a knowledge base to support decision making and facilitate daily operations. Furthermore, in the two proposed knowledge-integration methods, the former method can automatically combine multiple knowledge sources into one integrated knowledge base; nevertheless, it focuses on a single time point to deal with such knowledge-integration problems. The latter method then extends the former one to handle integrating situations properly with different time points. Additionally, three new genetic operators are designed in the evolving process to remove redundancy, subsumption and contradiction, thus producing more concise and consistent classification rules than those without using them. Finally, the proposed methods are applied to credit card data and breast cancer data for evaluating their effectiveness. They are also compared with several well-known classification methods. The experimental results show the good performance and feasibility of the proposed approaches. 知識擷取知識整合遺傳規劃分類樹分類問題知識管理 knowledge acquisition knowledge integration genetic programming classification tree classification problem knowledge management
27	Análise de dados sequenciais heterogêneos baseada em árvore de decisão e modelos de Markov : aplicação na logística de transporte Ataky, Steve Tsham Mpinda 16 October 2015 (has links) Submitted by Bruna Rodrigues (bruna92rodrigues@yahoo.com.br) on 2016-09-16T12:52:39Z No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:28Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-16T19:59:34Z (GMT) No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) / Made available in DSpace on 2016-09-16T19:59:41Z (GMT). No. of bitstreams: 1 DissSATM.pdf: 3079104 bytes, checksum: 51b46ffeb4387370e30fb92e31771606 (MD5) Previous issue date: 2015-10-16 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Latterly, the development of data mining techniques has emerged in many applications’ fields with aim at analyzing large volumes of data which may be simple and / or complex. The logistics of transport, the railway setor in particular, is a sector with such a characteristic in that the data available in are of varied natures (classic variables such as top speed or type of train, symbolic variables such as the set of routes traveled by train, degree of tack, etc.). As part of this dissertation, one addresses the problem of classification and prediction of heterogeneous data; it is proposed to study through two main approaches. First, an automatic classification approach was implemented based on classification tree technique, which also allows new data to be efficiently integrated into partitions initialized beforehand. The second contribution of this work concerns the analysis of sequence data. It has been proposed to combine the above classification method with Markov models for obtaining a time series (temporal sequences) partition in homogeneous and significant groups based on probabilities. The resulting model offers good interpretation of classes built and allows us to estimate the evolution of the sequences of a particular vehicle. Both approaches were then applied onto real data from the a Brazilian railway information system company in the spirit of supporting the strategic management of planning and coherent prediction. This work is to initially provide a thinner type of planning to solve the problems associated with the existing classification in homogeneous circulations groups. Second, it sought to define a typology of train paths (sucession traffic of the same train) in order to provide or predict the next movement of statistical characteristics of a train carrying the same route. The general methodology provides a supportive environment for decision-making to monitor and control the planning organization. Thereby, a formula with two variants was proposed to calculate the adhesion degree between the track effectively carried out or being carried out with the planned one. / Nos últimos anos aflorou o desenvolvimento de técnicas de mineração de dados em muitos domínios de aplicação com finalidade de analisar grandes volumes de dados, os quais podendo ser simples e/ou complexos. A logística de transporte, o setor ferroviário em particular, é uma área com tal característica em que os dados disponíveis são muitos e de variadas naturezas (variáveis clássicas como velocidade máxima ou tipo de trem, variáveis simbólicas como o conjunto de vias percorridas pelo trem, etc). Como parte desta dissertação, aborda-se o problema de classificação e previsão de dados heterogêneos, propõe-se estudar através de duas abordagens principais. Primeiramente, foi utilizada uma abordagem de classificação automática com base na técnica por ´arvore de classificação, a qual também permite que novos dados sejam eficientemente integradas nas partições inicial. A segunda contribuição deste trabalho diz respeito à análise de dados sequenciais. Propôs-se a combinar o método de classificação anterior com modelos de Markov para obter uma participação de sequências temporais em grupos homogêneos e significativos com base nas probabilidades. O modelo resultante oferece uma boa interpretação das classes construídas e permite estimar a evolução das sequências de um determinado veículo. Ambas as abordagens foram então aplicadas nos dados do sistema de informação ferroviário, no espírito de dar apoio à gestão estratégica de planejamentos e previsões aderentes. Este trabalho consiste em fornecer inicialmente uma tipologia mais fina de planejamento para resolver os problemas associados com a classificação existente em grupos de circulações homogêneos. Em segundo lugar, buscou-se definir uma tipologia de trajetórias de trens (sucessão de circulações de um mesmo trem) para assim fornecer ou prever características estatísticas da próxima circulação mais provável de um trem realizando o mesmo percurso. A metodologia geral proporciona um ambiente de apoio à decisão para o monitoramento e controle da organização de planejamento. Deste fato, uma fórmula com duas variantes foi proposta para calcular o grau de aderência entre a trajetória efetivamente realizada ou em curso de realização com o planejado. Data mining (Mineração de dados) Análise de dados Classificação automática Árvore de decisão Markov, Processos de Logística - transporte Automatic classification Sequence data analysis Heterogeneous data Train planning Adherence Replanning Planning Forecasting Classification tree
28	Využití vybraných metod strojového učení pro modelování kreditního rizika / Machine Learning Methods for Credit Risk Modelling Drábek, Matěj January 2017 (has links) This master's thesis is divided into three parts. In the first part I described P2P lending, its characteristics, basic concepts and practical implications. I also compared P2P market in the Czech Republic, UK and USA. The second part consists of theoretical basics for chosen methods of machine learning, which are naive bayes classifier, classification tree, random forest and logistic regression. I also described methods to evaluate the quality of classification models listed above. The third part is a practical one and shows the complete workflow of creating classification model, from data preparation to evaluation of model.
29	Machine Vision Assisted In Situ Ichthyoplankton Imaging System Iyer, Neeraj 12 July 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Recently there has been a lot of effort in developing systems for sampling and automatically classifying plankton from the oceans. Existing methods assume the specimens have already been precisely segmented, or aim at analyzing images containing single specimen (extraction of their features and/or recognition of specimens as single targets in-focus in small images). The resolution in the existing systems is limiting. Our goal is to develop automated, very high resolution image sensing of critically important, yet under-sampled, components of the planktonic community by addressing both the physical sensing system (e.g. camera, lighting, depth of field), as well as crucial image extraction and recognition routines. The objective of this thesis is to develop a framework that aims at (i) the detection and segmentation of all organisms of interest automatically, directly from the raw data, while filtering out the noise and out-of-focus instances, (ii) extract the best features from images and (iii) identify and classify the plankton species. Our approach focusses on utilizing the full computational power of a multicore system by implementing a parallel programming approach that can process large volumes of high resolution plankton images obtained from our newly designed imaging system (In Situ Ichthyoplankton Imaging System (ISIIS)). We compare some of the widely used segmentation methods with emphasis on accuracy and speed to find the one that works best on our data. We design a robust, scalable, fully automated system for high-throughput processing of the ISIIS imagery. Plankton Segmentation Recognition Classification Tree Plankton Computer vision -- Methodology Classification Parallel programming (Computer science) Image processing -- Digital techniques Document imaging systems -- Research Image analysis Pattern recognition systems Computer algorithms
30	Data mining and predictive analytics application on cellular networks to monitor and optimize quality of service and customer experience Muwawa, Jean Nestor Dahj 11 1900 (has links) This research study focuses on the application models of Data Mining and Machine Learning covering cellular network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms have been applied on real cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: RStudio for Machine Learning and process visualization, Apache Spark, SparkSQL for data and big data processing and clicData for service Visualization. Two use cases have been studied during this research. In the first study, the process of Data and predictive Analytics are fully applied in the field of Telecommunications to efficiently address users’ experience, in the goal of increasing customer loyalty and decreasing churn or customer attrition. Using real cellular network transactions, prediction analytics are used to predict customers who are likely to churn, which can result in revenue loss. Prediction algorithms and models including Classification Tree, Random Forest, Neural Networks and Gradient boosting have been used with an exploratory Data Analysis, determining relationship between predicting variables. The data is segmented in to two, a training set to train the model and a testing set to test the model. The evaluation of the best performing model is based on the prediction accuracy, sensitivity, specificity and the Confusion Matrix on the test set. The second use case analyses Service Quality Management using modern data mining techniques and the advantages of in-memory big data processing with Apache Spark and SparkSQL to save cost on tool investment; thus, a low-cost Service Quality Management model is proposed and analyzed. With increase in Smart phone adoption, access to mobile internet services, applications such as streaming, interactive chats require a certain service level to ensure customer satisfaction. As a result, an SQM framework is developed with Service Quality Index (SQI) and Key Performance Index (KPI). The research concludes with recommendations and future studies around modern technology applications in Telecommunications including Internet of Things (IoT), Cloud and recommender systems. / Cellular networks have evolved and are still evolving, from traditional GSM (Global System for Mobile Communication) Circuit switched which only supported voice services and extremely low data rate, to LTE all Packet networks accommodating high speed data used for various service applications such as video streaming, video conferencing, heavy torrent download; and for say in a near future the roll-out of the Fifth generation (5G) cellular networks, intended to support complex technologies such as IoT (Internet of Things), High Definition video streaming and projected to cater massive amount of data. With high demand on network services and easy access to mobile phones, billions of transactions are performed by subscribers. The transactions appear in the form of SMSs, Handovers, voice calls, web browsing activities, video and audio streaming, heavy downloads and uploads. Nevertheless, the stormy growth in data traffic and the high requirements of new services introduce bigger challenges to Mobile Network Operators (NMOs) in analysing the big data traffic flowing in the network. Therefore, Quality of Service (QoS) and Quality of Experience (QoE) turn in to a challenge. Inefficiency in mining, analysing data and applying predictive intelligence on network traffic can produce high rate of unhappy customers or subscribers, loss on revenue and negative services’ perspective. Researchers and Service Providers are investing in Data mining, Machine Learning and AI (Artificial Intelligence) methods to manage services and experience. This research study focuses on the application models of Data Mining and Machine Learning covering network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms will be applied on cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: R-Studio for Machine Learning, Apache Spark, SparkSQL for data processing and clicData for Visualization. / Electrical and Mining Engineering / M. Tech (Electrical Engineering) Data Mining Predictive Analytics Big Data Quality of Service (QoS) Customer Experience Business Intelligence (BI) Network Churn Key Quality Index (KQI) Key Performance Index (KPI) Service Quality Management (SQM) Neural Network (NN) Deep Learning (DL) Random Forest (RF) Classification Tree Regression In-memory Data processing Data Science 006.312 Data mining Machine learning Business intelligence Packet switching (Data transmission) Quality of service (Computer networks) Artificial intelligence

Search results