Global ETD Search

11	The role of classifiers in feature selection : number vs nature Chrysostomou, Kyriacos January 2008 (has links) Wrapper feature selection approaches are widely used to select a small subset of relevant features from a dataset. However, Wrappers suffer from the fact that they only use a single classifier when selecting the features. The problem of using a single classifier is that each classifier is of a different nature and will have its own biases. This means that each classifier will select different feature subsets. To address this problem, this thesis aims to investigate the effects of using different classifiers for Wrapper feature selection. More specifically, it aims to investigate the effects of using different number of classifiers and classifiers of different nature. This aim is achieved by proposing a new data mining method called Wrapper-based Decision Trees (WDT). The WDT method has the ability to combine multiple classifiers from four different families, including Bayesian Network, Decision Tree, Nearest Neighbour and Support Vector Machine, to select relevant features and visualise the relationships among the selected features using decision trees. Specifically, the WDT method is applied to investigate three research questions of this thesis: (1) the effects of number of classifiers on feature selection results; (2) the effects of nature of classifiers on feature selection results; and (3) which of the two (i.e., number or nature of classifiers) has more of an effect on feature selection results. Two types of user preference datasets derived from Human-Computer Interaction (HCI) are used with WDT to assist in answering these three research questions. The results from the investigation revealed that the number of classifiers and nature of classifiers greatly affect feature selection results. In terms of number of classifiers, the results showed that few classifiers selected many relevant features whereas many classifiers selected few relevant features. In addition, it was found that using three classifiers resulted in highly accurate feature subsets. In terms of nature of classifiers, it was showed that Decision Tree, Bayesian Network and Nearest Neighbour classifiers caused signficant differences in both the number of features selected and the accuracy levels of the features. A comparison of results regarding number of classifiers and nature of classifiers revealed that the former has more of an effect on feature selection than the latter. The thesis makes contributions to three communities: data mining, feature selection, and HCI. For the data mining community, this thesis proposes a new method called WDT which integrates the use of multiple classifiers for feature selection and decision trees to effectively select and visualise the most relevant features within a dataset. For the feature selection community, the results of this thesis have showed that the number of classifiers and nature of classifiers can truly affect the feature selection process. The results and suggestions based on the results can provide useful insight about classifiers when performing feature selection. For the HCI community, this thesis has showed the usefulness of feature selection for identifying a small number of highly relevant features for determining the preferences of different users. 005.3
12	Making Sense of the Noise: Statistical Analysis of Environmental DNA Sampling for Invasive Asian Carp Monitoring Near the Great Lakes Song, Jeffery W. 01 May 2017 (has links) Sensitive and accurate detection methods are critical for monitoring and managing the spread of aquatic invasive species, such as invasive Silver Carp (SC; Hypophthalmichthys molitrix) and Bighead Carp (BH; Hypophthalmichthys nobilis) near the Great Lakes. A new detection tool called environmental DNA (eDNA) sampling, the collection and screening of water samples for the presence of the target species’ DNA, promises improved detection sensitivity compared to conventional surveillance methods. However, the application of eDNA sampling for invasive species management has been challenging due to the potential of false positives, from detecting species’ eDNA in the absence of live organisms. In this dissertation, I study the sources of error and uncertainty in eDNA sampling and develop statistical tools to show how eDNA sampling should be utilized for monitoring and managing invasive SC and BH in the United States. In chapter 2, I investigate the environmental and hydrologic variables, e.g. reverse flow, that may be contributing to positive eDNA sampling results upstream of the electric fish dispersal barrier in the Chicago Area Waterway System (CAWS), where live SC are not expected to be present. I used a beta-binomial regression model, which showed that reverse flow volume across the barrier has a statistically significant positive relationship with the probability of SC eDNA detection upstream of the barrier from 2009 to 2012 while other covariates, such as water temperature, season, chlorophyll concentration, do not. This is a potential alternative explanation for why SC eDNA has been detected upstream of the barrier but intact SC have not. In chapter 3, I develop and parameterize a statistical model to evaluate how changes made to the US Fish and Wildlife Service (USFWS)’s eDNA sampling protocols for invasive BH and SC monitoring from 2013 to 2015 have influenced their sensitivity. The model shows that changes to the protocol have caused the sensitivity to fluctuate. Overall, when assuming that eDNA is randomly distributed, the sensitivity of the current protocol is higher for BH eDNA detection and similar for SC eDNA detection compared to the original protocol used from 2009-2012. When assuming that eDNA is clumped, the sensitivity of the current protocol is slightly higher for BH eDNA detection but worse for SC eDNA detection. In chapter 4, I apply the model developed in chapter 3 to estimate the BH and SC eDNA concentration distributions in two pools of the Illinois River where BH and SC are considered to be present, one pool where they are absent, and upstream of the electric barrier in the CAWS given eDNA sampling data and knowledge of the eDNA sampling protocol used in 2014. The results show that the estimated mean eDNA concentrations in the Illinois River are highest in the invaded pools (La Grange; Marseilles) and are lower in the uninvaded pool (Brandon Road). The estimated eDNA concentrations in the CAWS are much lower compared to the concentrations in the Marseilles pool, which indicates that the few eDNA detections in the CAWS (3% of samples positive for SC and 0.4% samples positive for BH) do not signal the presence of live BH or SC. The model shows that >50% samples positive for BH or SC eDNA are needed to infer AC presence in the CAWS, i.e., that the estimated concentrations are similar to what is found in the Marseilles pool. Finally, in chapter 5, I develop a decision tree model to evaluate the value of information that monitoring provides for making decisions about BH and SC prevention strategies near the Great Lakes. The optimal prevention strategy is dependent on prior beliefs about the expected damage of AC invasion, the probability of invasion, and whether or not BH and SC have already invaded the Great Lakes (which is informed by monitoring). Given no monitoring, the optimal strategy is to stay with the status quo of operating electric barriers in the CAWS for low probabilities of invasion and low expected invasion costs. However, if the probability of invasion is greater than 30% and the cost of invasion is greater than $100 million a year, the optimal strategy changes to installing an additional barrier in the Brandon Road pool. Greater risk-aversion (i.e., aversion to monetary losses) causes less prevention (e.g., status quo instead of additional barriers) to be preferred. Given monitoring, the model shows that monitoring provides value for making this decision, only if the monitoring tool has perfect specificity (false positive rate = 0%). Asian Carp Bayesian statistics Decision tree Detection sensitivity Environmental DNA
13	Customer Churn Prediction Using Big Data Analytics TANNEEDI, NAREN NAGA PAVAN PRITHVI January 2016 (has links) Customer churn is always a grievous issue for the Telecom industry as customers do not hesitate to leave if they don’t find what they are looking for. They certainly want competitive pricing, value for money and above all, high quality service. Customer churning is directly related to customer satisfaction. It’s a known fact that the cost of customer acquisition is far greater than cost of customer retention, that makes retention a crucial business prototype. There is no standard model which addresses the churning issues of global telecom service providers accurately. BigData analytics with Machine Learning were found to be an efficient way for identifying churn. This thesis aims to predict customer churn using Big Data analytics, namely a J48 decision tree on a Java based benchmark tool, WEKA. Three different datasets from various sources were considered; first includes Telecom operator’s six month aggregate active and churned users’ data usage volumes, second includes globally surveyed data and third dataset comprises of individual weekly data usage analysis of 22 android customers along with their average quality, annoyance and churn scores by accompanying theses. Statistical analyses and J48 Decision trees were drawn for three different datasets. From the statistics of normalized volumes, autocorrelations were small owing to reliable confidence intervals, but confidence intervals were overlapping and close by, therefore no much significance could be noticed, henceforth no strong trends could be observed. From decision tree analytics, decision trees with 52%, 70% and 95% accuracies were achieved for three different data sources respectively. Data preprocessing, data normalization and feature selection have shown to be prominently influential. Monthly data volumes have not shown much decision power. Average Quality, Churn Risk and to some extent, Annoyance scores may point out a probable churner. Weekly data volumes with customer’s recent history and necessary attributes like age, gender, tenure, bill, contract, data plan, etc., are pivotal for churn prediction. Big Data churn prediction decision tree Quality of Experience
14	Alokační model projektu Miss Sport / Allocation model of Miss Sport project Kyselý, Ondřej January 2010 (has links) The goal of the diploma thesis is to create allocation model for Miss Sport project. This project is a platform, which allows effective association of sponsors and female athletes, who are members of the project. It results in decision tree, whose biggest advantage is in transparency and rate of decision making. One of the objectives is to analyze most important criteria, which are necessary to segment female athletes. One part is a list of aspects, which are important to sponsorship, but they are not included in allocation model directly. Research target is focused on evaluation of attractiveness of female athletes as one of the criteria, which are important for potential sponsors.
15	Classificação da exatidão de coordenadas obtidas com a fase da portadora L1 do GPS / Accuracy's classification of GPS L1 carrier phase obtained coordinates Menzori, Mauro 20 December 2005 (has links) A fixação das duplas diferenças de ambigüidades no processamento dos dados da fase da portadora do Sistema de Posicionamento Global (GPS), é um dos pontos cruciais no posicionamento relativo estático. Esta fixação também é utilizada como um indicador de qualidade e fornece maior segurança quanto ao resultado do posicionamento. No entanto, ela é uma informação puramente estatística baseada na precisão da medida e dissociada da exatidão das coordenadas geradas na solução. A informação sobre a exatidão das coordenadas de pontos medidos através de um vetor simples, é sempre inacessível, independente de a solução ser fixa ou float". Além disso, existe um risco maior em assumir um resultado de solução float", mesmo que ele tenha uma boa, porém, desconhecida exatidão. Por estes motivos a solução float" não é aceita por muitos contratantes de serviços GPS, feitos com a fase da portadora, que exigem uma nova coleta de dados, com o conseqüente dispêndio de tempo e dinheiro. Essa tese foi desenvolvida no sentido de encontrar um procedimento que melhore esta situação. Para tanto, se investigou o comportamento da exatidão em medidas obtidas com a fase da portadora L1 do GPS, monitorando os fatores variáveis presentes neste tipo de medição, o que tornou possível a classificação da exatidão de resultados. Inicialmente, a partir de um conjunto de dados GPS, coletados ao longo dos anos de 2003, 2004 e 2005 em duas bases de monitoramento contínuo da USP, se fez uma análise sistemática do comportamento das variáveis contidas nos dados. A seguir se estruturou um banco de dados, que foi usado como referência na indução de uma árvore de decisão adotada como paradigma. Por último, a partir desta árvore se pôde inferir a exatidão de soluções de posicionamento obtidas com o uso da portadora L1. A validação do procedimento foi feita através da classificação da exatidão de resultados de várias linhas base, coletadas em diferentes condições e locais do estado de São Paulo e do Brasil / The most crucial step on the relative static positioning, when using the Global Positioning System (GPS) carrier phase data, is the fixing ambiguities integer values. The integer ambiguity solution is also used as a quality indicator, ensuring quality to the positioning results. In despite of its capability, the ambiguity fix solution is purely statistical information, based on the precision of measurements and completely apart from the coordinate's solution accuracy. In a single baseline processing, the positioning coordinates accuracy is always inaccessible, no matter if the final solution is float or fixed. In fact, there is some inner risk when using the float solution, although they have a good, nevertheless, unknown accuracy. Probably that is why several GPS job contractors reject the float solutions and require a new data observation, with the consequent time and money loss. This research was developed to improve that situation, investigation the inner accuracy in several GPS L1 carrier phase measurements. Checking the variable factors existing on this kind of measurement it was possible to classify the results accuracy behavior. The investigation was developed in tree steps: started with the systematic analysis of a group of L1 observation data, collected during the years: 2003, 2004 and 2005, followed by the construction of a structured data bank which generated a decision tree, performing the paradigm used to classify the accuracy of any measurement made with GPS L1 carrier phase; and ended with the research validation, through the accuracy classification that was made on several baselines, collected on different conditions and places around the state of São Paulo and Brazil accuracy árvore de decisão decision tree exatidão GPS GPS
16	Metodologias para mapeamento de suscetibilidade a movimentos de massa Riffel, Eduardo Samuel January 2017 (has links) O mapeamento de áreas com predisposição à ocorrência de eventos adversos, que resultam em ameaça e danos a sociedade, é uma demanda de elevada importância, principalmente pelo papel que exerce em ações de planejamento, gestão ambiental, territorial e de riscos. Diante disso, este trabalho busca contribuir na qualificação de metodologias e parâmetros morfométricos para mapeamento de suscetibilidade a movimentos de massa através de SIG e Sensoriamento Remoto, um dos objetivos é aplicar e comparar metodologias de suscetibilidade a movimentos de massa, entre elas o Shalstab, e a Árvore de Decisão que ainda é pouco utilizada nessa área. Buscando um consenso acerca da literatura, fez-se necessário organizar as informações referentes aos eventos adversos através de classificação, para isso foram revisados os conceitos relacionados com desastres, tais como suscetibilidade, vulnerabilidade, perigo e risco. Também foi realizado um estudo no município de Três Coroas – RS, onde foram relacionadas as ocorrências de movimentos de massa e as zonas de risco da CPRM. A partir de parâmetros morfométricos, foram identificados padrões de ocorrência de deslizamentos, e a contribuição de fatores como uso, ocupação e declividade. Por fim, foram comparados dois métodos de mapeamento de suscetibilidade, o modelo Shalstab e a Árvore de Decisão. Como dado de entrada dos modelos foram utilizados parâmetros morfométricos, extraídos de imagens SRTM, e amostras de deslizamentos, identificadas por meio de imagens de satélite de alta resolução espacial. A comparação das metodologias e a análise da acurácia obteve uma resposta melhor para a Árvore de Decisão. A diferença, entretanto, foi pouco significativa e ambos podem representar de forma satisfatória o mapa de suscetibilidade. No entanto, o Shalstab apresentou mais limitações, devido à necessidade de dados de maior resolução espacial. A aplicação de metodologias utilizando SIG e Sensoriamento Remoto contribuíram com uma maior qualificação em relação à prevenção de danos ocasionados por movimentos de massa. Ressalta-se, entretanto, a necessidade de inventários consistentes, para obter uma maior confiabilidade na aplicação dos modelos. / The mapping of areas with predisposition to adverse events, which result in threat and damage to society, is a demand of great importance, mainly for the role it plays in planning, environmental, territorial and risk management actions. Therefore, this work seeks to contribute to the qualification of methodologies and morphometric parameters for mapping susceptibility to mass movements through GIS and Remote Sensing, one of the objectives is to apply and compare methodologies of susceptibility to mass movements, among them Shalstab, and the Decision Tree that is still little used in this area. Seeking a consensus about the literature, it was necessary to organize the information regarding the adverse events through classification, for this the concepts related to disasters such as susceptibility, vulnerability, danger and risk were reviewed. A study was also carried out in the city of Três Coroas - RS, where the occurrence of mass movements and the risk zones of CPRM were related. From morphometric parameters, patterns of occurrence of landslides were identified, and the contribution of factors such as use, occupation and declivity. Finally, two methods of susceptibility mapping, the Shalstab model and the Decision Tree, were compared. Morphometric parameters, extracted from SRTM images, and sliding samples, identified by means of high spatial resolution satellite images, were used as input data. The comparison of the methodologies and the analysis of the accuracy obtained a better answer for the Decision Tree. The difference, however, was insignificant and both can represent satisfactorily the map of susceptibility. However, Shalstab presented more limitations due to the need for higher spatial resolution data. The application of methodologies using GIS and Remote Sensing contributed with a higher qualification in relation to the prevention of damages caused by mass movements. However, the need for consistent inventories to obtain greater reliability in the application of the models is emphasized. Deslizamento Desastres Geoprocessamento Landslides Decision Tree Shalstab Geoprocessing Disasters
17	An Empirical Application with Data Mining in the Construction of Predictive Model on Corruption Wu, Hsing-yi 03 August 2006 (has links) Now Taiwan is not only the country that facts the corruption threat. The greedy politician and never satisfied merchant unceasingly perform the scandal in the whole world. The national economy and the people¡¦s wealth are also injured. The topic of this research is how to choose the important variable from the corruption case. In recent years the Data Mining technique application in the behavioral analysis of shopping, customer relations management, crime investigation is in fashion; however the Data Mining technique application in politics and social domain is still not enough. In this research, we attempt to introduce the concepts and techniques of Data Mining and use Data Mining technique to set up a selective model for the consideration for the government in the corruption preventing. It attempts to explore the opportunity for the social sciences research. Artificial Neural Network Corruption Decision Tree Data Mining Clustering
18	Evaluating feature selection in a marketing classification problem Salmeron Perez, Irving Ivan January 2015 (has links) Nowadays machine learning is becoming more popular in prediction andclassification tasks for many fields. In banks, telemarketing area is usingthis approach by gathering information from phone calls made to clientsover the past campaigns. The true fact is that sometimes phone calls areannoying and time consuming for both parts, the marketing department andthe client. This is why this project is intended to prove that feature selectioncould improve machine learning models. A Portuguese bank gathered data regarding phone calls and clientsstatistics information like their actual jobs, salaries and employment statusto determine the probabilities if a person would buy the offered productand/or service. C4.5 decision tree (J48) and multilayer perceptron (MLP)are the machine learning models to be used for the experiments. For featureselection correlation-based feature selection (Cfs), Chi-squared attributeselection and RELIEF attribute selection algorithms will be used. WEKAframework will provide the tools to test and implement the experimentscarried out in this research. The results were very close over the two data mining models with aslight improvement by C4.5 over the correct classifications and MLP onROC curve rate. With these results it was confirmed that feature selectionimproves classification and/or prediction results. Neural networks bank marketing decision tree feature selection
19	Prognosis of Glioblastoma Multiforme Using Textural Properties on MRI Heydari, Maysam Unknown Date No description available. glioblastoma GBM MRI texture machine learning prognosis survival decision tree
20	Decision Trees for Dynamic Decision Making And System Dynamics Modelling Calibration and Expansion 2014 June 1900 (has links) Many practical problems raise the challenge of making decisions over time in the presence of both dynamic complexity and pronounced uncertainty regarding evolution of important factors that affect the dynamics of the system. In this thesis, we provide an end-to-end implementation of an easy-to-use system to confront such challenges. This system gives policy makers a new approach to take complementary advantage of decision analysis techniques and System Dynamics by allowing easy creation, evaluation, and interactive exploration of hybrid models. As an important application of this methodology, we extended a System Dynamic model within the context of West Nile virus transmission in Saskatchewan.

Search results