Global ETD Search

1	Benchmarking purely functional data structures Moss, Graeme E. January 2000 (has links) No description available. 005
2	Text Categorization for E-Government Applications: The Case of City Mayor¡¦s Mailbox Kuo, Chiung-Jung 29 August 2006 (has links) The central government and most of local governments in Taiwan have adopted the e-mail services to provide citizens for requesting services or expressing their opinions through Internet. Traditionally, these requests/opinions need to be manually classified into appropriate departments for service rendering. However, due to the ever-increasing number of requests/opinions received, the manual classification approach is time consuming and becomes impractical. Therefore, in this study, we attempt to apply text categorization techniques for constructing automatically a classification mechanism in order to establish an efficient e-government service portal. The purpose of this thesis is to investigate effectiveness of different text categorization methods in supporting automatic classification of service requests/opinions emails sent to Mayor¡¦s mailbox. Specifically, in each phase of text categorization learning, we adopt and evaluate two methods commonly employed in prior research. In the feature selection phase, both the maximal x2¡@statistic method and the weighted average x2¡@statistic method of x2¡@statistic are evaluated. We consider the Binary and TFxIDF representation schemes in the document representation phase. Finally, we adopt the decision tree induction technique and the support vector machines (SVM) technique for inducing a text categorization model for our target e-government application. Our empirical evaluation results show that the text categorization method that employs the maximal x2 statistic method for feature selection, the Binary representation scheme, and the support vector machines as the underlying induction algorithm can reach an accuracy rate of 77.28% and an recall and precision rates of more than 77%. Such satisfactory classification effectiveness suggests that the text categorization approach can be employed to establish an effective and intelligent e-government service portal. Decision Tree Induction Support Vector Machines E-government Text categorization
3	Empirical Evaluations of Different Strategies for Classification with Skewed Class Distribution Ling, Shih-Shiung 09 August 2004 (has links) Existing classification analysis techniques (e.g., decision tree induction,) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes. Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness. In this study, we empirically evaluate three different approaches, namely the under-sampling, the over-sampling and the multi-classifier committee approaches, for addressing classification with highly skewed class distribution. Due to its popularity, C4.5 is selected as the underlying classification analysis technique. Based on 10 highly skewed class distribution datasets, our empirical evaluations suggest that the multi-classifier committee generally outperformed the under-sampling and the over-sampling approaches, using the recall rate, precision rate and F1-measure as the evaluation criteria. Furthermore, for applications aiming at a high recall rate, use of the over-sampling approach will be suggested. On the other hand, if the precision rate is the primary concern, adoption of the classification model induced directly from original datasets would be recommended. Classification Analysis Decision Tree Induction Multi-classifier Committee Approach Under-sampling Over-sampling Skewed Class Distribution
4	Applications of Data Mining on Drug Safety: Predicting Proper Dosage of Vancomycin for Patients with Renal Insufficiency and Impairment Yon, Chuen-huei 24 August 2004 (has links) Abstract Drug misuses result in medical resource wastes and significant society costs. Due to the narrow therapeutic range of vancomycin, appropriate vancomycin dosage is difficult to determine. When inappropriate dosage is used, such side effects as poisoning reaction or drug resistance may occur. Clinically, medical professionals adjust drug protocols of vancomycin based on the Therapeutic Drug Monitoring (TDM) results. TDM is usually defined as the clinical use of drug blood concentration measurements as an aid in dosage finding and adjustment. However, TDM cannot be applied to first-time treatments and, in case, dosage decisions need to reply on medical professionals¡¦ clinical experiences and judgments. Data mining has been applied in various medical and healthcare applications. In this study, we will employ a decision-tree induction (specifically, C4.5) and a backpropagation neural network technique for predicting the appropriateness of vancomycin usage for patients with renal insufficiency and impairment. In addition, we will evaluate whether the use of the boosting and bagging algorithms will improve predictive accuracy. Our empirical evaluation results suggest that use of the boosting and bagging algorithms could improve predictive accuracy. Specifically, use of C4.5 in conjunction with the AdaBoost algorithm achieves an overall accuracy of 79.65%, which significantly improves that of the existing practice, recording an accuracy rate at 41.38%. With respect to the appropriateness category (¡§Y¡¨) and the inappropriateness category (¡§N¡¨), C4.5 in conjunction with the AdaBoost algorithm can achieve a recall rate at 78.75% and 80.25%, respectively. Hence, the incorporation of data mining techniques to decision support would enhance the drug safety, which in turn, would improve patient safety and reduce subsequent medical resource wastes. Data Mining AdaBoost Drug Safety Bagging Backpropagation Network Classification Analysis Decision Tree Induction
5	Improving Data Quality: Development and Evaluation of Error Detection Methods Lee, Nien-Chiu 25 July 2002 (has links) High quality of data are essential to decision support in organizations. However estimates have shown that 15-20% of data within an organization¡¦s databases can be erroneous. Some databases contain large number of errors, leading to a large potential problem if they are used for managerial decision-making. To improve data quality, data cleaning endeavors are needed and have been initiated by many organizations. Broadly, data quality problems can be classified into three categories, including incompleteness, inconsistency, and incorrectness. Among the three data quality problems, data incorrectness represents the major sources for low quality data. Thus, this research focuses on error detection for improving data quality. In this study, we developed a set of error detection methods based on the semantic constraint framework. Specifically, we proposed a set of error detection methods including uniqueness detection, domain detection, attribute value dependency detection, attribute domain inclusion detection, and entity participation detection. Empirical evaluation results showed that some of our proposed error detection techniques (i.e., uniqueness detection) achieved low miss rates and low false alarm rates. Overall, our error detection methods together could identify around 50% of the errors introduced by subjects during experiments. Semantic Constraint Error Detection Data Quality Outlier Detection Data Cleaning Decision Tree Induction
6	An Integrative Approach for Examining the Determinants of Abnormal Returns: The Cases of Internet Security Breach and Ecommerce Initiative Andoh-Baidoo, Francis Kofi 01 January 2006 (has links) Researchers in various business disciplines use the event study methodology to assess the market value of firms through capital market reaction to news in the public media about the firm's activities. Capital market reaction is assessed based on cumulative abnormal return (sum of abnormal returns over the event window). In this study, the event study methodology is used to assess the impact that two important information technology activities, Internet security breach and ecommerce initiative, have on the market value of firms. While prior research on the relationship between these business activities and cumulative abnormal return involved the use of regression analysis, in this study, we use decision tree induction and regression.For the Internet security breach study, we use negative cumulative abnormal return as a surrogate for damage to the breached firm. In contrast to what has been reported in the research literature, our results suggest that the relationship between cumulative abnormal return and the independent variables for both the Internet security breach and ecommerce initiative studies is complex, often involving conditional interactions between the independent variables. We report that the incomplete contract theory is unable to effectively explain the relationship between cumulative abnormal return and the organizational variables. Other ecommerce theories provide support to the findings from our analysis. We show that both attack and firm characteristics are determinants of damage to breached firms.Our results revealed that the use of decision tree induction presents additional insight to that provided by regression models. We illustrate that there is value in using data mining techniques to study the market value of e-commerce initiative and Internet security breach and that this approach has applicability in other domains and that Decision Tree can enhance the event study methodology.We demonstrate that Decision Tree induction can be used for both theory building and theory testing. We specifically employ Decision Tree induction to test and enhance ecommerce theories and develop a theoretical model for cumulative abnormal return and ecommerce. We also present theoretical models for Internet security breach and damage to the breached firm. These models can be used by decision makers in Internet security and ecommerce investments strategic formulations and implementations. cumulative abnormal return investor perception decision tree induction ecommerce Internet security market value event study methodology Business Management Information Systems
7	Enhancing Accuracy Of Hybrid Recommender Systems Through Adapting The Domain Trends Aksel, Fatih 01 September 2010 (has links) (PDF) Traditional hybrid recommender systems typically follow a manually created fixed prediction strategy in their decision making process. Experts usually design these static strategies as fixed combinations of different techniques. However, people&#039 / s tastes and desires are temporary and they gradually evolve. Moreover, each domain has unique characteristics, trends and unique user interests. Recent research has mostly focused on static hybridization schemes which do not change at runtime. In this thesis work, we describe an adaptive hybrid recommender system, called AdaRec that modifies its attached prediction strategy at runtime according to the performance of prediction techniques (user feedbacks). Our approach to this problem is to use adaptive prediction strategies. Experiment results with datasets show that our system outperforms naive hybrid recommender. QA Computer Software 76.75-76.765
8	Estimation of distribution algorithms for clustering and classification Cagnini, Henry Emanuel Leal 20 March 2017 (has links) Submitted by Caroline Xavier (caroline.xavier@pucrs.br) on 2017-06-29T11:51:00Z No. of bitstreams: 1 DIS_HENRY_EMANUEL_LEAL_CAGNINI_COMPLETO.pdf: 3650909 bytes, checksum: 55d52061a10460875dba677a9812fe9c (MD5) / Made available in DSpace on 2017-06-29T11:51:00Z (GMT). No. of bitstreams: 1 DIS_HENRY_EMANUEL_LEAL_CAGNINI_COMPLETO.pdf: 3650909 bytes, checksum: 55d52061a10460875dba677a9812fe9c (MD5) Previous issue date: 2017-03-20 / Extrair informa??es relevantes a partir de dados n?o ? uma tarefa f?cil. Tais dados podem vir a partir de lotes ou em fluxos cont?nuos, podem ser completos ou possuir partes faltantes, podem ser duplicados, e tamb?m podem ser ruidosos. Ademais, existem diversos algoritmos que realizam tarefas de minera??o de dados e, segundo o teorema do "Almo?o Gr?tis", n?o existe apenas um algoritmo que venha a solucionar satisfatoriamente todos os poss?veis problemas. Como um obst?culo final, algoritmos geralmente necessitam que hiper-par?metros sejam definidos, o que n?o surpreendentemente demanda um m?nimo de conhecimento sobre o dom?nio da aplica??o para que tais par?metros sejam corretamente definidos. J? que v?rios algoritmos tradicionais empregam estrat?gias de busca local gulosas, realizar um ajuste fino sobre estes hiper-par?metros se torna uma etapa crucial a fim de obter modelos preditivos de qualidade superior. Por outro lado, Algoritmos de Estimativa de Distribui??o realizam uma busca global, geralmente mais eficiente que realizar uma buscam exaustiva sobre todas as poss?veis solu??es para um determinado problema. Valendo-se de uma fun??o de aptid?o, algoritmos de estimativa de distribui??o ir?o iterativamente procurar por melhores solu??es durante seu processo evolutivo. Baseado nos benef?cios que o emprego de algoritmos de estimativa de distribui??o podem oferecer para as tarefas de agrupamento e indu??o de ?rvores de decis?o, duas tarefas de minera??o de dados consideradas NP-dif?cil e NP-dif?cil/completo respectivamente, este trabalho visa desenvolver novos algoritmos de estimativa de distribui??o a fim de obter melhores resultados em rela??o a m?todos tradicionais que empregam estrat?gias de busca local gulosas, e tamb?m sobre outros algoritmos evolutivos. / Extracting meaningful information from data is not an easy task. Data can come in batches or through a continuous stream, and can be incomplete or complete, duplicated, or noisy. Moreover, there are several algorithms to perform data mining tasks, and the no-free lunch theorem states that there is not a single best algorithm for all problems. As a final obstacle, algorithms usually require hyperparameters to be set in order to operate, which not surprisingly often demand a minimum knowledge of the application domain to be fine-tuned. Since many traditional data mining algorithms employ a greedy local search strategy, fine-tuning is a crucial step towards achieving better predictive models. On the other hand, Estimation of Distribution Algorithms perform a global search, which often is more efficient than performing a wide search through the set of possible parameters. By using a quality function, estimation of distribution algorithms will iteratively seek better solutions throughout its evolutionary process. Based on the benefits that estimation of distribution algorithms may offer to clustering and decision tree-induction, two data mining tasks considered to be NP-hard and NPhard/ complete, respectively, this works aims at developing novel algorithms in order to obtain better results than traditional, greedy algorithms and baseline evolutionary approaches. Estimation of Distribution Algorithm Decision-Tree Induction Clustering Optimization Algoritmos de Estimativa de Distribui??o Indu??o de ?rvores de Decis?o Agrupamento Otimiza??o
9	Applications of Knowledge Discovery in Quality Registries - Predicting Recurrence of Breast Cancer and Analyzing Non-compliance with a Clinical Guideline Razavi, Amir Reza January 2007 (has links) In medicine, data are produced from different sources and continuously stored in data depositories. Examples of these growing databases are quality registries. In Sweden, there are many cancer registries where data on cancer patients are gathered and recorded and are used mainly for reporting survival analyses to high level health authorities. In this thesis, a breast cancer quality registry operating in South-East of Sweden is used as the data source for newer analytical techniques, i.e. data mining as a part of knowledge discovery in databases (KDD) methodology. Analyses are done to sift through these data in order to find interesting information and hidden knowledge. KDD consists of multiple steps, starting with gathering data from different sources and preparing them in data pre-processing stages prior to data mining. Data were cleaned from outliers and noise and missing values were handled. Then a proper subset of the data was chosen by canonical correlation analysis (CCA) in a dimensionality reduction step. This technique was chosen because there were multiple outcomes, and variables had complex relationship to one another. After data were prepared, they were analyzed with a data mining method. Decision tree induction as a simple and efficient method was used to mine the data. To show the benefits of proper data pre-processing, results from data mining with pre-processing of the data were compared with results from data mining without data pre-processing. The comparison showed that data pre-processing results in a more compact model with a better performance in predicting the recurrence of cancer. An important part of knowledge discovery in medicine is to increase the involvement of medical experts in the process. This starts with enquiry about current problems in their field, which leads to finding areas where computer support can be helpful. The experts can suggest potentially important variables and should then approve and validate new patterns or knowledge as predictive or descriptive models. If it can be shown that the performance of a model is comparable to domain experts, it is more probable that the model will be used to support physicians in their daily decision-making. In this thesis, we validated the model by comparing predictions done by data mining and those made by domain experts without finding any significant difference between them. Breast cancer patients who are treated with mastectomy are recommended to receive radiotherapy. This treatment is called postmastectomy radiotherapy (PMRT) and there is a guideline for prescribing it. A history of this treatment is stored in breast cancer registries. We analyzed these datasets using rules from a clinical guideline and identified cases that had not been treated according to the PMRT guideline. Data mining revealed some patterns of non-compliance with the PMRT guideline. Further analysis with data mining revealed some reasons for guideline non-compliance. These patterns were then compared with reasons acquired from manual inspection of patient records. The comparisons showed that patterns resulting from data mining were limited to the stored variables in the registry. A prerequisite for better results is availability of comprehensive datasets. Medicine can take advantage of KDD methodology in different ways. The main advantage is being able to reuse information and explore hidden knowledge that can be obtained using advanced analysis techniques. The results depend on good collaboration between medical informaticians and domain experts and the availability of high quality data. Breast cancer Clinical guidelines Canonical correlation analysis Data Mining Data pre-processing Decision tree induction Knowledge Discovery in Databases Medical informatics Medicinsk informatik
10	Classification Analysis Techniques for Skewed Class Chyi, Yu-Meei 12 February 2003 (has links) Abstract Existing classification analysis techniques (e.g., decision tree induction, backpropagation neural network, k-nearest neighbor classification, etc.) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes (e.g., 2% churners and 98% non-churners). Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness and might result in a ¡§null¡¨ prediction system that simply predicts all instances as having the majority decision class as the training instances (e.g., predicting all customers as non-churners). In this study, we extended the multi-classifier class-combiner approach and proposed a clustering-based multi-classifier class-combiner technique to address the highly skewed class distribution problem in classification analysis. In addition, we proposed four distance-based methods for selecting a subset of instances having the majority decision class for lowering the degree of skewness in a data set. Using two real-world datasets (including mortality prediction for burn patients and customer loyalty prediction), empirical results suggested that the proposed clustering-based multi-classifier class-combiner technique generally outperformed the traditional multi-classifier class-combiner approach and the four distance-based methods. Keywords: Data Mining, Classification Analysis, Skewed Class Distribution Problem, Decision Tree Induction, Multi-classifier Class-combiner Approach, Clustering-based Multi-classifier Class-combiner Approach Data Mining Classification Analysis Skewed Class Distribution Problem Decision Tree Induction Multi-classifier Class-combiner Approach

Search results