Spelling suggestions: "subject:"[een] DATA MINING"" "subject:"[enn] DATA MINING""
41 |
Trajectory Data Mining in the Design of Intelligent Vehicular NetworksSoares de Sousa, Roniel 02 November 2022 (has links)
Vehicular networks are a promising technology to help solve complex problems of modern society, such as urban mobility. However, the vehicular environment has some characteristics that pose challenges for wireless communication in vehicular networks not usually found in traditional networks. Therefore, the scientific community is yet investigating alternative techniques to improve data delivery in vehicular networks. In this context, the recent and increasing availability of trajectory data offers us valuable information in many research areas. These data comprise the so-called "big trajectory data" and represent a new opportunity for improving vehicular networks. However, there is a lack of specific data mining techniques to extract the hidden knowledge from these data.
This thesis explores vehicle trajectory data mining to design intelligent vehicular networks. In the first part of this thesis, we deal with errors intrinsic to vehicle trajectory data that hinder their applicability. We propose a trajectory reconstruction framework composed of several preprocessing techniques to convert flawed GPS-based data to road-network constrained trajectories. This new data representation reduces trajectory uncertainty and removes problems such as noise and outliers compared to raw GPS trajectories. After that, we develop a novel and scalable cluster-based trajectory prediction framework that uses enhanced big trajectory data. Besides the prediction framework, we propose a new hierarchical agglomerative clustering algorithm for road-network constrained trajectories that automatically detects the most appropriate number of clusters. The proposed clustering algorithm is one of the components that allow the prediction framework to process large-scale datasets.
The second part of this thesis applies the enhanced trajectory representation and the prediction framework to improve the vehicular network. We propose the VDDTP algorithm, a novel vehicle-assisted data delivery algorithm based on trajectory prediction. VDDTP creates an extended trajectory model and uses predicted road-network constrained trajectories to calculate packet delivery probabilities. Then, it applies the predicted trajectories and some proposed heuristics in a data forwarding strategy, aiming to improve the vehicular network's global metrics (i.e., delivery ratio, communication overhead, and delivery delay). In this part, we also propose the DisTraC protocol to demonstrate the applicability of vehicular networks to detect traffic congestion and improve urban mobility. DisTraC uses V2V communication to measure road congestion levels cooperatively and reroute vehicles to reduce travel time.
We evaluate the proposed solutions through extensive experiments and simulations. For that, we prepare a new large-scale and real-world dataset based on the city of Rio de Janeiro, Brazil. We also use other real-world datasets publicly available. The results demonstrate the potential of the proposed data mining techniques (i.e., trajectory reconstruction and prediction frameworks) and vehicular networks algorithms.
|
42 |
Exploring Node Attributes for Data Mining in Attributed GraphsJihwan Lee (6639122) 10 June 2019 (has links)
Graphs have attracted researchers in various fields in that many different kinds of real-world entities and relationships between them can be represented and analyzed effectively and efficiently using graphs. In particular, researchers in data mining and machine learning areas have developed algorithms and models to understand the complex graph data better and perform various data mining tasks. While a large body of work exists on graph mining, most existing work does not fully exploit attributes attached to graph nodes or edges.<div><br></div><div>In this dissertation, we exploit node attributes to generate better solutions to several graph data mining problems addressed in the literature. First, we introduce the notion of statistically significant attribute associations in attribute graphs and propose an effective and efficient algorithm to discover those associations. The effectiveness analysis on the results shows that our proposed algorithm can reveal insightful attribute associations that cannot be identified using the earlier methods focused solely on frequency. Second, we build a probabilistic generative model for observed attributed graphs. Under the assumption that there exist hidden communities behind nodes in a graph, we adopt the idea of latent topic distributions to model a generative process of node attribute values and link structure more precisely. This model can be used to detect hidden communities and profile missing attribute values. Lastly, we investigate how to employ node attributes to learn latent representations of nodes in lower dimensional embedding spaces and use the learned representations to improve the performance of data mining tasks over attributed graphs.<br></div>
|
43 |
Modeling and computational strategies for medical decision makingYuan, Fan 27 May 2016 (has links)
In this dissertation, we investigate three topics: predictive models for disease diagnosis and patient behavior, optimization for cancer treatment planning, and public health decision making for infectious disease prevention. In the first topic, we propose a multi-stage classification framework that incorporates Particle Swarm Optimization (PSO) for feature selection and discriminant analysis via mixed integer programming (DAMIP) for classification. By utilizing the reserved judgment region, it allows the classifier to delay making decisions on ‘difficult-to-classify’ observations and develop new classification rules in later stage. We apply the framework to four real-life medical problems: 1) Patient readmissions: identifies the patients in emergency department who return within 72 hours using patient’s demographic information, complaints, diagnosis, tests, and hospital real-time utility. 2) Flu vaccine responder: predicts high/low responders of flu vaccine on subjects in 5 years using gene signatures. 3) Knee reinjection: predicts whether a patient needs to take a second surgery within 3 years of his/her first knee injection and tackles with missing data. 4) Alzheimer’s disease: distinguishes subjects in normal, mild cognitive impairment (MCI), and Alzheimer’s disease (AD) groups using neuropsychological tests. In the second topic, we first investigate multi-objective optimization approaches to determine the optimal dose configuration and radiation seed locations in brachytherapy treatment planning. Tumor dose escalation and dose-volume constraints on critical organs are incorporated to kill the tumor while preserving the functionality of organs. Based on the optimization framework, we propose a non-linear optimization model that optimizes the tumor control probability (TCP). The model is solved by a solution strategy that incorporates piecewise linear approximation and local search.
In the third topic, we study optimal strategies for public health emergencies under limited resources. First we investigate the vaccination strategies against a pandemic flu to find the optimal strategy when limited vaccines are available by constructing a mathematical model for the course of the 2009 H1N1 pandemic flu and the process of the vaccination. Second, we analyze the cost-effectiveness of emergency response strategies again a large-scale anthrax attack to protect the entire regional population.
|
44 |
Efficient decision tree building algorithms for uncertain dataTsang, Pui-kwan, Smith., 曾沛坤. January 2008 (has links)
published_or_final_version / Computer Science / Master / Master of Philosophy
|
45 |
New results on online job scheduling and data stream algorithmsLee, Lap-kei, 李立基 January 2009 (has links)
published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
46 |
Cluster analysis on uncertain dataNgai, Wang-kay., 倪宏基. January 2008 (has links)
published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
47 |
Customer Churn Predictive Heuristics from Operator and Users' PerspectiveMOUNIKA REDDY, CHANDIRI January 2016 (has links)
Telecommunication organizations are confronting in expanding client administration weight as they launch various user-desired services. Conveying poor client encounters puts client connections and incomes at danger. One of the metrics used by telecommunications companies to determine their relationship with customers is “Churn”. After substantial research in the field of churn prediction over many years, Big Data analytics with Data Mining techniques was found to be an efficient way for identifying churn. These techniques are usually applied to predict customer churn by building models, pattern classification and learning from historical data. Although some work has already been undertaken with regards to users’ perspective, it appears to be in its infancy. The aim of this thesis is to validate churn predictive heuristics from the operator perspective and close to user end. Conducting experiments with different sections of people regarding their data usage, designing a model, which is close to the user end and fitting with the data obtained through the survey done. Correlating the examined churn indicators and their validation, validation with the traffic volume variation with the users’ feedback collected by accompanying theses. A Literature review is done to analyze previous works and find out the difficulties faced in analyzing the users’ feeling, also to understand methodologies to get around problems in handling the churn prediction algorithms accuracy. Experiments are conducted with different sections of people across the globe. Their experiences with quality of calls, data and if they are looking to change in future, what would be their reasons of churn be, are analyzed. Their feedback will be validated using existing heuristics. The collected data set is analyzed by statistical analysis and validated for different datasets obtained by operators’ data. Also statistical and Big Data analysis has been done with data provided by an operator’s active and churned customers monthly data volume usage. A possible correlation of the user churn with users’ feedback will be studied by calculating the percentages and further correlate the results with that of the operators’ data and the data produced by the mobile app. The results show that the monthly volumes have not shown much decision power and the need for additional attributes such as higher time resolution, age, gender and others are needed. Whereas the survey done globally has shown similarities with the operator’s customers’ feedback and issues “around the globe” such a data plan issues, pricing, issues with connectivity and speed. Nevertheless, data preprocessing and feature selection has shown to be the key factors. Churn predictive models have given a better classification of 69.7 % when more attributes were provided. Telecom Operators’ data classification have given an accuracy of 51.7 % after preprocessing and for the variables we choose. Finally, a close observation of the end user revealed the possibility to yield a much higher classification precision of 95.2 %.
|
48 |
Data Visualization for the Benchmarking EngineJoish, Sudha 16 May 2003 (has links)
In today's information age, data collection is not the ultimate goal; it is simply the first step in extracting knowledge-rich information to shape future decisions. In this thesis, we present ChartVisio - a simple web-based visual data-mining system that lets users quickly explore databases and transform raw data into processed visuals. It is highly interactive, easy to use and hides the underlying complexity of querying from its users. Data from tables is internally mapped into charts using aggregate functions across tables. The tool thus integrates querying and charting into a single general-purpose application. ChartVisio has been designed as a component of the Benchmark data engine, being developed at the Computer Science department, University of New Orleans. The data engine is an intelligent website generator and users who create websites using the Data Engine are the site owners. Using ChartVisio, owners may generate new charts and save them as XML templates for prospective website surfers. Everyday Internet users may view saved charts with the touch of a button and get real-time data, since charts are generated dynamically. Website surfers may also generate new charts, but may not save them as templates. As a result, even non-technical users can design and generate charts with minimal time and effort.
|
49 |
Bioinformatic mining and analysis of genetic elements in genomes. / CUHK electronic theses & dissertations collectionJanuary 2013 (has links)
在海量的生物數據中發掘重要的功能元件、揭示其功能特徵及相應的潛在生物機制是後基因組時代的一個巨大的挑戰。這裡,以特定的基因組為對象,運用生物信息學的理論與方法,對基因組島及後翻譯修飾系統進行了系統的挖掘、分析。 / 首先,收集源於7個真核生物的超過70,000個試驗驗證的翻譯後修飾事件。對照不帶有任何後翻譯修飾靶點的蛋白, 對受多種翻譯後修飾調控的蛋白 (MTP-蛋白) 的特性和功能進行了分析比較。(1) MTP-蛋白顯著傾向於形成蛋白質複合物,並能與更多的蛋白質相互作用,同時偏好於在蛋白質-蛋白質相互作用網絡中擔當樞紐。(2) MTP-蛋白還具有獨特的功能偏好以及特定的亞細胞定位。(3) 約80的後翻譯修飾位點位於蛋白的無序區域。同時MTP-蛋白比不受後翻譯修飾調控的蛋白擁有更多的無序區域。(4) 擁有較少無序區域的MTP-蛋白主要和蛋白質-DNA複合物的形成相關。(5) 只有一小部分單個後翻譯修飾事件對結合能的影響大於2kcal/mol,但組合的多種後翻譯修飾,如磷酸化加上乙酰化, 對結合能的影響大 幅提升。 / 隨後,對74真菌基因組中泛素化系統的不同組件(分別為泛素,E1,E2,E3和E3的底物) 進行註釋並比較分析。 (1) 與擔子菌的其他基因組相比, 菇類基因組中具有顯著多的泛素。 (2) 儘管E1的數目在目標基因組之間波動極小, 菇類基因組中E2的數目仍顯著高於其他擔子菌。 (3) 對於候選的E3,菇類基因組中Paracaspase和F-box的數目也顯著高於其他擔子菌。這些結果表明,泛素化系統很可能在真菌形態分化、尤其是菇的形成中扮演著重要角色。 / 然後,與全基因組相比,發現基因組島具有顯著高的轉錄起始信號富集. 基於這種特異的轉錄調控信號,設計了一個新的基因組島預測程序(命名GIST)。通過分析顯示GIST具有較高的靈敏度和準確性. 最後,運用GIST,對最近在德國暴發的菌株TY-2482中的基因組島進行了首次的檢測和分析。 / 總之,這些工作不僅大大拓展了我們關於特定功能元素的理解,如MTP-蛋白和基因組島,同時也為進一步的相關研究提供了重要的工具和線索,如GIST以及菇類基因組中的泛素化系統。 / In the post-genomic era, it is a huge challenge to detect the functional elements in the "ocean" of data and provide meaningful biological inferences. Here, many interesting functional elements have been characterized and analyzed among targeted genomes. / First, through compiling more than 70,000 experimentally determined posttranslational modification (PTM) events from 7 eukaryotic organisms, the features and functions of proteins regulated by multiple types of PTMs (Mtp-Proteins) are detected and analyzed by compared with proteins harboring no known target site of PTMs. (1) The Mtp-Proteins are found significantly enriched in protein complexes, having more protein partners and preferred to act as hubs in protein-protein interaction network. (2) Mtp-Proteins also possess distinct function focus and biased subcellular locations. (3) Overall, about 80% analyzed PTM events are embedded in intrinsic disordered regions (IDRs). And most Mtp-Proteins have more IDRs than proteins without PTM sites. It suggests IDR may account most for why some proteins can harbor so many extraordinary functions. (4) Interestingly, some particular Mtp-Proteins biased carrying PTMs located in ordered regions are observed mainly related to "protein-DNA complex assembly". (5) We further evaluated the energetic effects of PTMs on stability of PPI and found that only a small fraction of single PTM event influence the binding energy more than 2kcal/mol; but combinational use of PTM types i.e. combinational phosphorylation and acetylation can change the binding energy dramatically. / On the second part, the different components in ubiquitination system, respectively ubiquitin, E1, E2, E3 and the substrates of E3, are identified and analyzed comparatively across 74 fungi genomes. The results mainly include: (1) the ubiquitin number is significantly higher within the mushroom-forming genomes compared to other basidiomycota genomes. (2) The number of E1, with the average of 2.92, is consistent among most genomes. However, the number of E2 is different between mushroom-forming genomes and other basidiomycota genomes. (3) For the E3 candidates, it is found that the number of domain Paracaspase and F-box in the mushroom-forming genomes is significantly higher than the other basidiomycota genomes. These results suggest that the ubiquitination system may play vital role in divergence of fungi morphogenesis, especially, such as the formation of mushroom. / Then, the focus shift to genomic islands (GIs). Compared to the whole genome, highly enriched transcription initiation positions are firstly found to be precipitated in GI regions. Based on this heterogeneous transcriptional regulatory signal, a novel procedure GIST (Genome-island Identification by Signals of Transcription) for genomic island detection is designed. Interestingly, our method demonstrates higher sensitivity in detecting genomic islands harboring genes with biased GI-like function, preferenced subcellular localization, skewed GC property and shorter gene length. Finally, using the GIST, many interesting GIs are detected and analyzed in the German outbreak strain TY-2482 for the first time. / In summary, these work not only considerably expand our understanding of several functional genetic elements, such as genomic island and proteins regulated by combinational multiple PTMs, but also provide important tool and clues, such as GIST and potential E3 expansion in mushroom-forming fungi, for further related studies. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Huang, Qianli. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 161-186). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. / Abstract --- p.i / 論文摘要 --- p.iii / Abbreviations --- p.v / Acknowledgements --- p.vi / Declaration --- p.viii / Table of Contents --- p.ix / List of Figures --- p.xi / List of Tables --- p.xiv / Chapter Chapter 1 --- Literature Review --- p.1 / Chapter 1.1 --- General introduction --- p.1 / Chapter 1.2 --- Post-translational modification --- p.2 / Chapter 1.2.1 --- Combinational multiple types of post-translational modification --- p.2 / Chapter 1.3 --- Genomic islands --- p.7 / Chapter 1.3.1 --- Brief introduction --- p.7 / Chapter 1.3.2 --- Bioinformatic tools and database for identification of Genomic islands --- p.9 / Chapter 1.4 --- Objectives and significance --- p.13 / Chapter Chapter 2 --- Systematic analysis on features and functions of proteins regulated by combinational multiple types of post-translational modifications --- p.15 / Chapter 2.1 --- Introduction --- p.15 / Chapter 2.2 --- Materials and Methods --- p.18 / Chapter 2.2.1 --- Annotation of PTM pattern and analyses on target residues --- p.18 / Chapter 2.2.2 --- Classification of Human Proteins --- p.19 / Chapter 2.2.3 --- Dataset of human protein-protein interactions (PPIs) and Construction of PPI network --- p.19 / Chapter 2.2.4 --- Calculation of Binding Energy --- p.20 / Chapter 2.2.5 --- Functional characterization and subcellular localization analysis --- p.21 / Chapter 2.2.5 --- Annotating IDR regions --- p.22 / Chapter 2.2.7 --- Statistical analyses --- p.23 / Chapter 2.3 --- Results --- p.23 / Chapter 2.3.1 --- Combinational interactions of multiple PTM types are undergoing evolutionary selection --- p.23 / Chapter 2.3.2 --- Evolutionary profile of modified amino acid residues --- p.33 / Chapter 2.3.3 --- Mtp-Proteins are enriched in the protein complex --- p.43 / Chapter 2.3.4 --- Multiple PTMs enable target protein function as hub or super-hub in PPI network --- p.46 / Chapter 2.3.5 --- Energetic effect of PTMs on the Stability of protein-protein binding --- p.60 / Chapter 2.3.6 --- Mtp-Proteins demonstrate distinct function focus --- p.65 / Chapter 2.3.7 --- Mtp-Proteins: located preferedly in Cytoplasm and Nucleus --- p.69 / Chapter 2.3.8 --- Why Mtp-Proteins possess so many special features : importance of IDR --- p.75 / Chapter 2.4 --- Discussion --- p.82 / Chapter 2.4.1 --- The hints from the features of Mtp-Proteins --- p.82 / Chapter 2.4.2 --- The implication of combinational interaction between two different functional PTM categories: biased locating in IDRs and ordered regions respectively --- p.84 / Chapter Chapter 3 --- Genome-wide comparative analyses of ubiquitome among basidiomycota and other typical fungi genomes --- p.87 / Chapter 3.1 --- Introduction --- p.87 / Chapter 3.2 --- Materials and Methods --- p.89 / Chapter 3.2.1 --- Genome sequences and annotation acquirement. --- p.89 / Chapter 3.2.2 --- Bioinformatic prediction of components in ubiquitome --- p.89 / Chapter 3.3 --- Results --- p.90 / Chapter 3.3.1 --- Identification of ubiquitin candidates among 74 fungi genomes --- p.90 / Chapter 3.3.2 --- Detection of potential E1 and E2 among all considered genomes --- p.94 / Chapter 3.3.3 --- Prediction and comparative analysis of different types of E3 --- p.98 / Chapter 3.3.4 --- The possible substrates of E3 --- p.104 / Chapter 3.4 --- Discussion --- p.107 / Chapter Chapter 4 --- Genomic islands Identification by Signals of Transcription --- p.109 / Chapter 4.1 --- Introduction --- p.109 / Chapter 4.2 --- Materials and Methods --- p.112 / Chapter 4.2.1 --- Genome sequence and annotation data --- p.112 / Chapter 4.2.2 --- Transcription start points (TSPs) scanning --- p.113 / Chapter 4.2.3 --- Genomic island dataset construction --- p.114 / Chapter 4.2.4 --- GIST: Genomic-island Identification by Signal of Transcription --- p.115 / Chapter 4.2.5 --- Functional characterization and subcellular localization analysis --- p.116 / Chapter 4.2.6 --- Codon usage, GC content and gene length --- p.117 / Chapter 4.2.7 --- Statistical analyses --- p.118 / Chapter 4.3 --- Results --- p.132 / Chapter 4.3.1 --- High-density transcriptional initiation signals associated with GIs --- p.132 / Chapter 4.3.2 --- Predict the potential novel GIs through GIST: Genomic-island Identification by Signal of Transcription --- p.134 / Chapter 4.3.3 --- Comparative Analysis: Distribution of gene function categories --- p.138 / Chapter 4.3.4 --- Comparative Analysis: Divergence of subcellular locations --- p.140 / Chapter 4.3.5 --- Comparative Analysis: GC property and gene length --- p.144 / Chapter 4.3.6 --- Hints of "non-optimal" codon usage bias --- p.145 / Chapter 4.3.7 --- Application of GIST to analyze GIs in the German E. coli O104:H4 outbreak strain --- p.147 / Chapter 4.4 --- Discussion --- p.152 / Chapter Chapter 5 --- Concluding remarks --- p.158 / References --- p.161
|
50 |
Visually Mining Interesting Patterns in Multivariate DatasetsGuo, Zhenyu 06 January 2013 (has links)
Data mining for patterns and knowledge discovery in multivariate datasets are very important processes and tasks to help analysts understand the dataset, describe the dataset, and predict unknown data values. However, conventional computer-supported data mining approaches often limit the user from getting involved in the mining process and performing interactions during the pattern discovery. Besides, without the visual representation of the extracted knowledge, the analysts can have difficulty explaining and understanding the patterns. Therefore, instead of directly applying automatic data mining techniques, it is necessary to develop appropriate techniques and visualization systems that allow users to interactively perform knowledge discovery, visually examine the patterns, adjust the parameters, and discover more interesting patterns based on their requirements. In the dissertation, I will discuss different proposed visualization systems to assist analysts in mining patterns and discovering knowledge in multivariate datasets, including the design, implementation, and the evaluation. Three types of different patterns are proposed and discussed, including trends, clusters of subgroups, and local patterns. For trend discovery, the parameter space is visualized to allow the user to visually examine the space and find where good linear patterns exist. For cluster discovery, the user is able to interactively set the query range on a target attribute, and retrieve all the sub-regions that satisfy the user's requirements. The sub-regions that satisfy the same query and are neareach other are grouped and aggregated to form clusters. For local pattern discovery, the patterns for the local sub-region with a focal point and its neighbors are computationally extracted and visually represented. To discover interesting local neighbors, the extracted local patterns are integrated and visually shown to the analysts. Evaluations of the three visualization systems using formal user studies are also performed and discussed.
|
Page generated in 0.0446 seconds