Global ETD Search

171	Discovering and summarizing email conversations Zhou, Xiaodong 05 1900 (has links) With the ever increasing popularity of emails, it is very common nowadays that people discuss specific issues, events or tasks among a group of people by emails. Those discussions can be viewed as conversations via emails and are valuable for the user as a personal information repository. For instance, in 10 minutes before a meeting, a user may want to quickly go through a previous discussion via emails that is going to be discussed in the meeting soon. In this case, rather than reading each individual email one by one, it is preferable to read a concise summary of the previous discussion with major information summarized. In this thesis, we study the problem of discovering and summarizing email conversations. We believe that our work can greatly support users with their email folders. However, the characteristics of email conversations, e.g., lack of synchronization, conversational structure and informal writing style, make this task particularly challenging. In this thesis, we tackle this task by considering the following aspects: discovering emails in one conversation, capturing the conversation structure and summarizing the email conversation. We first study how to discover all emails belonging to one conversation. Specifically, we study the hidden email problem, which is important for email summarization and other applications but has not been studied before. We propose a framework to discover and regenerate hidden emails. The empirical evaluation shows that this framework is accurate and scalable to large folders. Second, we build a fragment quotation graph to capture email conversations. The hidden emails belonging to each conversation are also included into the corresponding graph. Based on the quotation graph, we develop a novel email conversation summarizer, ClueWordSummarizer. The comparison with a state-of-the-art email summarizer as well as with a popular multi-document summarizer shows that ClueWordSummarizer obtains a higher accuracy in most cases. Furthermore, to address the characteristics of email conversations, we study several ways to improve the ClueWordSummarizer by considering more lexical features. The experiments show that many of those improvements can significantly increase the accuracy especially the subjective words and phrases. / Science, Faculty of / Computer Science, Department of / Graduate email summarization data mining
172	Mining continuous classes using evolutionary computing Potgieter, Gavin 22 July 2005 (has links) Data mining is the term given to knowledge discovery paradigms that attempt to infer knowledge, in the form of rules, from structured data using machine learning algorithms. Specifically, data mining attempts to infer rules that are accurate, crisp, comprehensible and interesting. There are not many data mining algorithms for mining continuous classes. This thesis develops a new approach for mining continuous classes. The approach is based on a genetic program, which utilises an efficient genetic algorithm approach to evolve the non-linear regressions described by the leaf nodes of individuals in the genetic program's population. The approach also optimises the learning process by using an efficient, fast data clustering algo¬rithm to reduce the training pattern search space. Experimental results from both algorithms are compared with results obtained from a neural network. The experimental results of the genetic program is also compared against a commercial data mining package (Cubist). These results indicate that the genetic algorithm technique is substantially faster than the neural network, and produces comparable accuracy. The genetic program produces substantially less complex rules than that of both the neural network and Cubist. / Dissertation (MSc)--University of Pretoria, 2006. / Computer Science / unrestricted Data mining UCTD
173	Ant colony optimization approach for stacking configurations CHEN, Yijun 01 January 2011 (has links) In data mining, classifiers are generated to predict the class labels of the instances. An ensemble is a decision making system which applies certain strategies to combine the predictions of different classifiers and generate a collective decision. Previous research has empirically and theoretically demonstrated that an ensemble classifier can be more accurate and stable than its component classifiers in most cases. Stacking is a well-known ensemble which adopts a two-level structure: the base-level classifiers to generate predictions and the meta-level classifier to make collective decisions. A consequential problem is: what learning algorithms should be used to generate the base-level and meta-level classifier in the Stacking configuration? It is not easy to find a suitable configuration for a specific dataset. In some early works, the selection of a meta classifier and its training data are the major concern. Recently, researchers have tried to apply metaheuristic methods to optimize the configuration of the base classifiers and the meta classifier. Ant Colony Optimization (ACO), which is inspired by the foraging behaviors of real ant colonies, is one of the most popular approaches among the metaheuristics. In this work, we propose a novel ACO-Stacking approach that uses ACO to tackle the Stacking configuration problem. This work is the first to apply ACO to the Stacking configuration problem. Different implementations of the ACO-Stacking approach are developed. The first version identifies the appropriate learning algorithms in generating the base-level classifiers while using a specific algorithm to create the meta-level classifier. The second version simultaneously finds the suitable learning algorithms to create the base-level classifiers and the meta-level classifier. Moreover, we study how different kinds on local information of classifiers will affect the classification results. Several pieces of local information collected from the initial phase of ACO-Stacking are considered, such as the precision, f-measure of each classifier and correlative differences of paired classifiers. A series of experiments are performed to compare the ACO-Stacking approach with other ensembles on a number of datasets of different domains and sizes. The experiments show that the new approach can achieve promising results and gain advantages over other ensembles. The correlative differences of the classifiers could be the best local information in this approach. Under the agile ACO-Stacking framework, an application to deal with a direct marketing problem is explored. A real world database from a US-based catalog company, containing more than 100,000 customer marketing records, is used in the experiments. The results indicate that our approach can gain more cumulative response lifts and cumulative profit lifts in the top deciles. In conclusion, it is competitive with some well-known conventional and ensemble data mining methods. Data mining Business
174	Mining for Significant Information from Unstructured and Structured Biological Data and Its Applications Al-Azzam, Omar Ghazi January 2012 (has links) Massive amounts of biological data are being accumulated in science. Searching for significant meaningful information and patterns from different types of data is necessary towards gaining knowledge from these large amounts of data available to users. However, data mining techniques do not normally deal with significance. Integrating data mining techniques with standard statistical procedures provides a way for mining statistically signi- ficant, interesting information from both structured and unstructured data. In this dissertation, different algorithms for mining significant biological information from both unstructured and structured data are proposed. A weighted-density-based approach is presented for mining item data from unstructured textual representations. Different algorithms in the area of radiation hybrid mapping are developed for mining significant information from structured binary data. The proposed algorithms have different applications in the ordering problem in radiation hybrid mapping including: identifying unreliable markers, and building solid framework maps. Effectiveness of the proposed algorithms towards improving map stability is demonstrated. Map stability is determined based on resampling analysis. The proposed algorithms deal effectively and efficiently with multidimensional data and also reduce computational cost dramatically. Evaluation shows that the proposed algorithms outperform comparative methods in terms of both accuracy and computation cost. Data mining. Gene mapping.
175	Analyse et fouille de données de trajectoires d'objets mobiles / Analysis and data mining of moving object trajectories El Mahrsi, Mohamed Khalil 30 September 2013 (has links) Dans un premier temps, nous étudions l'échantillonnage de flux de trajectoires. Garder l'intégralité des trajectoires capturées par les terminaux de géo-localisation modernes peut s'avérer coûteux en espace de stockage et en temps de calcul. L'élaboration de techniques d'échantillonnage adaptées devient primordiale afin de réduire la taille des données en supprimant certaines positions tout en veillant à préserver le maximum des caractéristiques spatiotemporelles des trajectoires originales. Dans le contexte de flux de données, ces techniques doivent en plus être exécutées "à la volée" et s'adapter au caractère continu et éphémère des données. A cet effet, nous proposons l'algorithme STSS (spatiotemporal stream sampling) qui bénéficie d'une faible complexité temporelle et qui garantit une borne supérieure pour les erreurs d’échantillonnage. Nous montrons les performances de notre proposition en la comparant à d'autres approches existantes. Nous étudions également le problème de la classification non supervisée de trajectoires contraintes par un réseau routier. Nous proposons trois approches pour traiter ce cas. La première approche se focalise sur la découverte de groupes de trajectoires ayant parcouru les mêmes parties du réseau routier. La deuxième approche vise à grouper des segments routiers visités très fréquemment par les mêmes trajectoires. La troisième approche combine les deux aspects afin d'effectuer un co-clustering simultané des trajectoires et des segments. Nous démontrons comment ces approches peuvent servir à caractériser le trafic et les dynamiques de mouvement dans le réseau routier et réalisons des études expérimentales afin d'évaluer leurs performances. / In this thesis, we explore two problems related to managing and mining moving object trajectories. First, we study the problem of sampling trajectory data streams. Storing the entirety of the trajectories provided by modern location-aware devices can entail severe storage and processing overheads. Therefore, adapted sampling techniques are necessary in order to discard unneeded positions and reduce the size of the trajectories while still preserving their key spatiotemporal features. In streaming environments, this process needs to be conducted "on-the-fly" since the data are transient and arrive continuously. To this end, we introduce a new sampling algorithm called spatiotemporal stream sampling (STSS). This algorithm is computationally-efficient and guarantees an upper bound for the approximation error introduced during the sampling process. Experimental results show that stss achieves good performances and can compete with more sophisticated and costly approaches. The second problem we study is clustering trajectory data in road network environments. We present three approaches to clustering such data: the first approach discovers clusters of trajectories that traveled along the same parts of the road network; the second approach is segment-oriented and aims to group together road segments based on trajectories that they have in common; the third approach combines both aspects and simultaneously clusters trajectories and road segments. We show how these approaches can be used to reveal useful knowledge about flow dynamics and characterize traffic in road networks. We also provide experimental results where we evaluate the performances of our propositions. Fouille de données Data mining
176	Applications of Data Mining in Healthcare Peng, Bo 05 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / With increases in the quantity and quality of healthcare related data, data mining tools have the potential to improve people’s standard of living through personalized and predictive medicine. In this thesis we improve the state-of-the-art in data mining for several problems in the healthcare domain. In problems such as drug-drug interaction prediction and Alzheimer’s Disease (AD) biomarkers discovery and prioritization, current methods either require tedious feature engineering or have unsatisfactory performance. New effective computational tools are needed that can tackle these complex problems. In this dissertation, we develop new algorithms for two healthcare problems: high-order drug-drug interaction prediction and amyloid imaging biomarker prioritization in Alzheimer’s Disease. Drug-drug interactions (DDIs) and their associated adverse drug reactions (ADRs) represent a significant detriment to the public h ealth. Existing research on DDIs primarily focuses on pairwise DDI detection and prediction. Effective computational methods for high-order DDI prediction are desired. In this dissertation, I present a deep learning based model D 3 I for cardinality-invariant and order-invariant high-order DDI pre- diction. The proposed models achieve 0.740 F1 value and 0.847 AUC value on high-order DDI prediction, and outperform classical methods on order-2 DDI prediction. These results demonstrate the strong potential of D 3 I and deep learning based models in tackling the prediction problems of high-order DDIs and their induced ADRs. The second problem I consider in this thesis is amyloid imaging biomarkers discovery, for which I propose an innovative machine learning paradigm enabling precision medicine in this domain. The paradigm tailors the imaging biomarker discovery process to individual characteristics of a given patient. I implement this paradigm using a newly developed learning-to-rank method PLTR. The PLTR model seamlessly integrates two objectives for joint optimization: pushing up relevant biomarkers and ranking among relevant biomarkers. The empirical study of PLTR conducted on the ADNI data yields promising results to identify and prioritize individual-specific amyloid imaging biomarkers based on the individual’s structural MRI data. The resulting top ranked imaging biomarkers have the potential to aid personalized diagnosis and disease subtyping. Data mining Healthcare
177	Trajectory Data Mining in the Design of Intelligent Vehicular Networks Soares de Sousa, Roniel 02 November 2022 (has links) Vehicular networks are a promising technology to help solve complex problems of modern society, such as urban mobility. However, the vehicular environment has some characteristics that pose challenges for wireless communication in vehicular networks not usually found in traditional networks. Therefore, the scientific community is yet investigating alternative techniques to improve data delivery in vehicular networks. In this context, the recent and increasing availability of trajectory data offers us valuable information in many research areas. These data comprise the so-called "big trajectory data" and represent a new opportunity for improving vehicular networks. However, there is a lack of specific data mining techniques to extract the hidden knowledge from these data. This thesis explores vehicle trajectory data mining to design intelligent vehicular networks. In the first part of this thesis, we deal with errors intrinsic to vehicle trajectory data that hinder their applicability. We propose a trajectory reconstruction framework composed of several preprocessing techniques to convert flawed GPS-based data to road-network constrained trajectories. This new data representation reduces trajectory uncertainty and removes problems such as noise and outliers compared to raw GPS trajectories. After that, we develop a novel and scalable cluster-based trajectory prediction framework that uses enhanced big trajectory data. Besides the prediction framework, we propose a new hierarchical agglomerative clustering algorithm for road-network constrained trajectories that automatically detects the most appropriate number of clusters. The proposed clustering algorithm is one of the components that allow the prediction framework to process large-scale datasets. The second part of this thesis applies the enhanced trajectory representation and the prediction framework to improve the vehicular network. We propose the VDDTP algorithm, a novel vehicle-assisted data delivery algorithm based on trajectory prediction. VDDTP creates an extended trajectory model and uses predicted road-network constrained trajectories to calculate packet delivery probabilities. Then, it applies the predicted trajectories and some proposed heuristics in a data forwarding strategy, aiming to improve the vehicular network's global metrics (i.e., delivery ratio, communication overhead, and delivery delay). In this part, we also propose the DisTraC protocol to demonstrate the applicability of vehicular networks to detect traffic congestion and improve urban mobility. DisTraC uses V2V communication to measure road congestion levels cooperatively and reroute vehicles to reduce travel time. We evaluate the proposed solutions through extensive experiments and simulations. For that, we prepare a new large-scale and real-world dataset based on the city of Rio de Janeiro, Brazil. We also use other real-world datasets publicly available. The results demonstrate the potential of the proposed data mining techniques (i.e., trajectory reconstruction and prediction frameworks) and vehicular networks algorithms. vehicular networks data mining
178	Feature Tracking in Two Dimensional Time Varying Datasets Thampy, Sajjit 10 May 2003 (has links) This study investigates methods that can be used for tracking features in computationalluid-dynamics datasets. The two approaches of overlap based feature tracking and attribute-based feature tracking are studied. Overlap based techniques use the actual degree of overlap between sucessive time steps to conclude a match. Attribute-based techniques use characteristics native to the feature being studied, like size, orientation, speed etc, to conclude a match between candidate features. Due to limitations on the number of time steps that can be held in a computer's memory, it may be possible to load only a time-subsampled data set. This might result in a decrease in the overlap obtained, and hence a subsequent decrease in the confidence of the match. This study looks into using specific attributes of features, like rotational velocity, linear velocity to predict the presence of that feature in a future time step. The use of predictive techniques is tested on swirling features, i.e., vortices. An ellipse-like representation is assumed to be a good approximation of any such feature. The location of a feature in previous time-steps are used to predict its position in a future time-step. The ellipse-like representation of the feature is translated over to the predicted location and aligned in the predicted orientation. An overlap test is then done. Use of predictive techniques will help increase the overlap, and subsequently the confidence in the match obtained. The techniques were tested on an artificial data set for linear velocity and rotation and on a real data set of simulation of flow past a cylinder. Regions of swirling flow, detected by computing the swirl parameter, were taken as features for study. The degree of overlap obtained by a basic overlap and by the use of predictive methods were tabulated. The results show that the use of predictive techniques improved the overlap. Data mining Feature Tracking
179	Mining Truth Tables and Straddling Biclusters in Binary Datasets Owens, Clifford Conley 07 January 2010 (has links) As the world swims deeper into a deluge of data, binary datasets relating objects to properties can be found in many different fields. Such datasets abound in practically any area of interest, including biology, politics, entertainment, and education. This explosion calls for the definition of new types of patterns in binary data, as well as algorithms to find efficiently find these patterns. In this work, we introduce truth tables as a new class of patterns to be mined in binary datasets. Truth tables represent a subset of properties which exhibit maximal variability (and hence, suggest independence) in occurrence patterns over the underlying objects. Unlike other measures of independence, truth tables possess anti-monotone features that can be exploited in order to mine them effectively. We present a level-wise algorithm that takes advantage of these features, showing results on real and synthetic data. These results demonstrate the scalability of our algorithm. We also introduce new methods of mining straddling biclusters. Biclusters relate subsets of objects to subsets of properties they share within a single dataset. Straddling biclusters extend biclusters by relating a subset of objects to subsets of properties they share in two datasets. We present two levelwise algorithms, named UnionMiner and TwoMiner, which discover straddling biclusters efficiently by treating multiple datasets as a single dataset. We show results on real and synthetic data, and explore the advantages and limitations of each algorithm. We develop guidelines which suggest which of these algorithms is likely to perform better based on features of the datasets. / Master of Science data mining binary datasets
180	Critical Success Factors in Data Mining Projects. Sim, Jaesung 08 1900 (has links) The increasing awareness of data mining technology, along with the attendant increase in the capturing, warehousing, and utilization of historical data to support evidence-based decision making, is leading many organizations to recognize that the effective use of data is the key element in the next generation of client-server enterprise information technology. The concept of data mining is gaining acceptance in business as a means of seeking higher profits and lower costs. To deploy data mining projects successfully, organizations need to know the key factors for successful data mining. Implementing emerging information systems (IS) can be risky if the critical success factors (CSFs) have been researched insufficiently or documented inadequately. While numerous studies have listed the advantages and described the data mining process, there is little research on the success factors of data mining. This dissertation identifies CSFs in data mining projects. Chapter 1 introduces the history of the data mining process and states the problems, purposes, and significances of this dissertation. Chapter 2 reviews the literature, discusses general concepts of data mining and data mining project contexts, and reviews general concepts of CSF methodologies. It also describes the identification process for the various CSFs used to develop the research framework. Chapter 3 describes the research framework and methodology, detailing how the CSFs were identified and validated from more than 1,300 articles published on data mining and related topics. The validated CSFs, organized into a research framework using 7 factors, generate the research questions and hypotheses. Chapter 4 presents analysis and results, along with the chain of evidence for each research question, the quantitative instrument and survey results. In addition, it discusses how the data were collected and analyzed to answer the research questions. Chapter 5 concludes with a summary of the findings, describing assumptions and limitations and suggesting future research. Data mining. Critical success factors data mining CSF

Search results