Global ETD Search

31	Information Extraction from data Sottovia, Paolo 22 October 2019 (has links) Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.
32	Motivating Introductory Computing Students with Pedagogical Datasets Bart, Austin Cory 03 May 2017 (has links) Computing courses struggle to retain introductory students, especially as learner demographics have expanded to include more diverse majors, backgrounds, and career interests. Motivational contexts for these courses must extend beyond short-term interest to empower students and connect to learners' long-term goals, while maintaining a scaffolded experience. To solve ongoing problems such as student retention, methods should be explored that can engage and motivate students. I propose Data Science as an introductory context that can appeal to a wide range of learners. To test this hypothesis, my work uses two educational theories — the MUSIC Model of Academic Motivation and Situated Learning Theory — to evaluate different components of a student's learning experience for their contribution to the student's motivation. I analyze existing contexts that are used in introductory computing courses, such as game design and media computation, and their limitations in regard to educational theories. I also review how Data Science has been used as a context, and its associated affordances and barriers. Next, I describe two research projects that make it simple to integrate Data Science into introductory classes. The first project, RealTimeWeb, was a prototypical exploration of how real-time web APIs could be scaffolded into introductory projects and problems. RealTimeWeb evolved into the CORGIS Project, an extensible framework populated by a diverse collection of freely available "Pedagogical Datasets" designed specifically for novices. These datasets are available in easy-to-use libraries for multiple languages, various file formats, and also through accessible web-based tools. While developing these datasets, I identified and systematized a number of design issues, opportunities, and concepts involved in the preparation of Pedagogical Datasets. With the completed technology, I staged a number of interventions to evaluate Data Science as an introductory context and to better understand the relationship between student motivation and course outcomes. I present findings that show evidence for the potential of a Data Science context to motivate learners. While I found evidence that the course content naturally has a stronger influence on course outcomes, the course context is a valuable component of the course's learning experience. / Ph. D. Motivation Introductory Computing Computational Thinking Engagement CORGIS Datasets Data Science Data Pedagogical Datasets Computer Science Education Engagement
33	Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance Tradeoffs Hu, Li Ang, Ma, Long January 2023 (has links) This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets. Imbalanced datasets Swedish text financial datasets Accuracy Matthews correlation coefficient Recall Multinomial Naive Bayes SMOTE TomekLinks Performance optimization Computer Sciences Datavetenskap (datalogi)
34	On the discovery of relevant structures in dynamic and heterogeneous data Preti, Giulia 22 October 2019 (has links) We are witnessing an explosion of available data coming from a huge amount of sources and domains, which is leading to the creation of datasets larger and larger, as well as richer and richer. Understanding, processing, and extracting useful information from those datasets requires specialized algorithms that take into consideration both the dynamism and the heterogeneity of the data they contain. Although several pattern mining techniques have been proposed in the literature, most of them fall short in providing interesting structures when the data can be interpreted differently from user to user, when it can change from time to time, and when it has different representations. In this thesis, we propose novel approaches that go beyond the traditional pattern mining algorithms, and can effectively and efficiently discover relevant structures in dynamic and heterogeneous settings. In particular, we address the task of pattern mining in multi-weighted graphs, pattern mining in dynamic graphs, and pattern mining in heterogeneous temporal databases. In pattern mining in multi-weighted graphs, we consider the problem of mining patterns for a new category of graphs called emph{multi-weighted graphs}. In these graphs, nodes and edges can carry multiple weights that represent, for example, the preferences of different users or applications, and that are used to assess the relevance of the patterns. We introduce a novel family of scoring functions that assign a score to each pattern based on both the weights of its appearances and their number, and that respect the anti-monotone property, pivotal for efficient implementations. We then propose a centralized and a distributed algorithm that solve the problem both exactly and approximately. The approximate solution has better scalability in terms of the number of edge weighting functions, while achieving good accuracy in the results found. An extensive experimental study shows the advantages and disadvantages of our strategies, and proves their effectiveness. Then, in pattern mining in dynamic graphs, we focus on the particular task of discovering structures that are both well-connected and correlated over time, in graphs where nodes and edges can change over time. These structures represent edges that are topologically close and exhibit a similar behavior of appearance and disappearance in the snapshots of the graph. To this aim, we introduce two measures for computing the density of a subgraph whose edges change in time, and a measure to compute their correlation. The density measures are able to detect subgraphs that are silent in some periods of time but highly connected in the others, and thus they can detect events or anomalies happened in the network. The correlation measure can identify groups of edges that tend to co-appear together, as well as edges that are characterized by similar levels of activity. For both variants of density measure, we provide an effective solution that enumerates all the maximal subgraphs whose density and correlation exceed given minimum thresholds, but can also return a more compact subset of representative subgraphs that exhibit high levels of pairwise dissimilarity. Furthermore, we propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy. We evaluate our framework with an extensive set of experiments on both real and synthetic datasets, and compare its performance with the main competitor algorithm. The results confirm the correctness of the exact solution, the high accuracy of the approximate, and the superiority of our framework over the existing solutions. In addition, they demonstrate the scalability of the framework and its applicability to networks of different nature. Finally, we address the problem of entity resolution in heterogeneous temporal data-ba-se-s, which are datasets that contain records that give different descriptions of the status of real-world entities at different periods of time, and thus are characterized by different sets of attributes that can change over time. Detecting records that refer to the same entity in such scenario requires a record similarity measure that takes into account the temporal information and that is aware of the absence of a common fixed schema between the records. However, existing record matching approaches either ignore the dynamism in the attribute values of the records, or assume that all the records share the same set of attributes throughout time. In this thesis, we propose a novel time-aware schema-agnostic similarity measure for temporal records to find pairs of matching records, and integrate it into an exact and an approximate algorithm. The exact algorithm can find all the maximal groups of pairwise similar records in the database. The approximate algorithm, on the other hand, can achieve higher scalability with the size of the dataset and the number of attributes, by relying on a technique called meta-blocking. This algorithm can find a good-quality approximation of the actual groups of similar records, by adopting an effective and efficient clustering algorithm.
35	ESSAYS ON SCALABLE BAYESIAN NONPARAMETRIC AND SEMIPARAMETRIC MODELS Chenzhong Wu (18275839) 29 March 2024 (has links) <p dir="ltr">In this thesis, we delve into the exploration of several nonparametric and semiparametric econometric models within the Bayesian framework, highlighting their applicability across a broad spectrum of microeconomic and macroeconomic issues. Positioned in the big data era, where data collection and storage expand at an unprecedented rate, the complexity of economic questions we aim to address is similarly escalating. This dual challenge ne- cessitates leveraging increasingly large datasets, thereby underscoring the critical need for designing flexible Bayesian priors and developing scalable, efficient algorithms tailored for high-dimensional datasets.</p><p dir="ltr">The initial two chapters, Chapter 2 and 3, are dedicated to crafting Bayesian priors suited for environments laden with a vast array of variables. These priors, alongside their corresponding algorithms, are optimized for computational efficiency, scalability to extensive datasets, and, ideally, distributability. We aim for these priors to accommodate varying levels of dataset sparsity. Chapter 2 assesses nonparametric additive models, employing a smoothing prior alongside a band matrix for each additive component. Utilizing the Bayesian backfitting algorithm significantly alleviates the computational load. In Chapter 3, we address multiple linear regression settings by adopting a flexible scale mixture of normal priors for coefficient parameters, thus allowing data-driven determination of the necessary amount of shrinkage. The use of a conjugate prior enables a closed-form solution for the posterior, markedly enhancing computational speed.</p><p dir="ltr">The subsequent chapters, Chapter 4 and 5, pivot towards time series dataset model- ing and Bayesian algorithms. A semiparametric modeling approach dissects the stochastic volatility in macro time series into persistent and transitory components, the latter addi- tional component addressing outliers. Utilizing a Dirichlet process mixture prior for the transitory part and a collapsed Gibbs sampling algorithm, we devise a method capable of efficiently processing over 10,000 observations and 200 variables. Chapter 4 introduces a simple univariate model, while Chapter 5 presents comprehensive Bayesian VARs. Our al- gorithms, more efficient and effective in managing outliers than existing ones, are adept at handling extensive macro datasets with hundreds of variables.</p> Financial econometrics Econometric and statistical methods Economic models and forecasting Bayesian Large VARs Additive Models MCMC Nonparametrics High-dimensional datasets Large datasets Flexible bayesian prior
36	Networks and multivariate statistics as applied to biological datasets and wine-related omics / Netwerke en meerveranderlike statistiek toegepas op biologiese datastelle en wyn-verwante omika Jacobson, Daniel A. 12 1900 (has links) Thesis (PhD)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: Introduction: Wine production is a complex biotechnological process aiming at productively coordinating the interactions and outputs of several biological systems, including grapevine and many microorganisms such as wine yeast and wine bacteria. High-throughput data generating tools in the elds of genomics, transcriptomics, proteomics, metabolomics and microbiomics are being applied both locally and globally in order to better understand complex biological systems. As such, the datasets available for analysis and mining include de novo datasets created by collaborators as well as publicly available datasets which one can use to get further insight into the systems under study. In order to model the complexity inherent in and across these datasets it is necessary to develop methods and approaches based on network theory and multivariate data analysis as well as to explore the intersections between these two approaches to data modelling, mining and interpretation. Networks: The traditional reductionist paradigm of analysing single components of a biological system has not provided tools with which to adequately analyse data sets that are attempting to capture systems-level information. Network theory has recently emerged as a new discipline with which to model and analyse complex systems and has arisen from the study of real and often quite large networks derived empirically from the large volumes of data that have collected from communications, internet, nancial and biological systems. This is in stark contrast to previous theoretical approaches to understanding complex systems such as complexity theory, synergetics, chaos theory, self-organised criticality, and fractals which were all sweeping theoretical constructs based on small toy models which proved unable to address the complexity of real world systems. Multivariate Data Analysis: Principle components analysis (PCA) and Partial Least Squares (PLS) regression are commonly used to reduce the dimensionality of a matrix (and amongst matrices in the case of PLS) in which there are a considerable number of potentially related variables. PCA and PLS are variance focused approaches where components are ranked by the amount of variance they each explain. Components are, by de nition, orthogonal to one another and as such, uncorrelated. Aims: This thesis explores the development of Computational Biology tools that are essential to fully exploit the large data sets that are being generated by systems-based approaches in order to gain a better understanding of winerelated organisms such as grapevine (and tobacco as a laboratory-based plant model), plant pathogens, microbes and their interactions. The broad aim of this thesis is therefore to develop computational methods that can be used in an integrated systems-based approach to model and describe di erent aspects of the wine making process from a biological perspective. To achieve this aim, computational methods have been developed and applied in the areas of transcriptomics, phylogenomics, chemiomics and microbiomics. Summary: The primary approaches taken in this thesis have been the use of networks and multivariate data analysis methods to analyse highly dimensional data sets. Furthermore, several of the approaches have started to explore the intersection between networks and multivariate data analysis. This would seem to be a logical progression as both networks and multivariate data analysis are focused on matrix-based data modelling and therefore have many of their roots in linear algebra. / AFRIKAANSE OPSOMMING: Inleiding: Wynproduksie is 'n komplekse biotegnologiese proses wat mik op die produktiewe koördinering van verskeie interaksies en uitsette van verskeie biologiese sisteme. Hierdie sisteme sluit in die wingerd, wat van besondere belang is, asook die wyn gis en wyn bakterieë. Hoë-deurset data generasie word huidiglik beide globaal en plaaslik toegepas in die velde van genomika, transkriptomika, proteomika, metabolomika en mikrobiomika. As sulks is hierdie tipe datastelle beskikbaar vir ontleding, bemyning en verkening. Die datastelle kan de novo gegenereer word, met behulp van medewerkers, of dit kan vanuit die publieke databasisse gewerf word waar sulke datastelle dikwels beskikbaar gemaak word sodat verdere insig verkry kan word met betrekking tot die sisteem onder studie. Die hoë-deurset datastelle onder bespreking bevat 'n hoë mate van inherente kompleksiteit, beide ten opsigte van ditself asook tussen verskeie datastelle. Om ten einde hierdie datastelle en hul inherente kompleksiteit te modelleer is dit nodig om metodes en benaderings te ontwikkel wat gesetel is in netwerk teorie en meerveranderlike statistiek. Verdermeer is dit ook nodig om die kruisings tussen netwerk teorie en meerveranderlike statistiek te verken om sodoende die modellering, bemyning, verkening en interpretasie van data te verbeter. Netwerke: Die tradisionele reduksionistiese paradigma, waarby enkele komponente van 'n biologiese sisteem geontleed word, het tot dusver nie voldoende metodes en gereedskap gelewer waarmee datastelle, wat streef om sisteemvlak informasie te bekom, geontleed kan word nie. Netwerk teorie het na vore gekom as 'n nuwe dissipline wat toegepas kan word vir die model-skepping en ontleding van komplekse sisteme. Dit stem uit die studie van egte, dikwels groot netwerke wat empiries afgelei word uit die groot volumes data wat tans na vore kom vanuit kommunikasie-, internet-, nansiële- en biologiese sisteme. Dit is in skrille kontras met vorige teoretiese benaderings wat gestreef het om komplekse sisteme te verstaan met konsepte soos kompleksiteits teorie, synergetics , chaos teorie, self-georganiseerde kritikaliteit en fraktale. Al die bogeneomde is breë teoretiese konstrukte, gebasseer op relatief kleinskaal modelle, wat nie instaat was om oplossings vir die kompleksiteit van egte-wêreld sisteme te bied nie. Meerveranderlike Data-analise: Hoofkomponente-ontleding (PCA) en Partial Least Squares (PLS) regressie word dikwels gebruik om die dimensionaliteit van 'n matriks (en tussen matrikse in die geval van PLS) te verminder. Hierdie matrikse bevat dikwels 'n aansienlike groot hoeveelheid moontlikverwante veranderlikes. PCA en PLS is variansie gedrewe metodes en behels dat komponente gerang word deur die hoeveelheid variansie wat elke component verduidelik. Komponente is by de nisie ortogonaal ten opsigte van mekaar en as sulks ongekorreleerd. Doelwitte: Hierdie tesis verken die ontwikkeling van verskeie Computational Biology metodes wat noodsaaklik is om ten volle die groot skaal datastelle te benut wat tans deur sisteem-gebasseerde benaderings gegenereer word. Die doel is om beter begrip en kennis van wyn verwante organismes te kry, hierdie organismes sluit in die wingerd (met tabak as laboratorium-gebasseerde plant model), plant patogene en microbes sowel as hulle interaksies. Die breë mikpunt van hierdie tesis is dus om gerekenaardiseerde metodes te ontwikkel wat gebruik kan word in 'n geintergreerde sisteem-gebaseerde benadering tot die modellering en beskrywing van verskillende aspekte van die wynmaak proses vanuit 'n biologiese standpunt. Om die mikpunt te bereik is gerekenaardiseerde metodes ontwikkel en toegepas in die velde van transkriptomika, logenomika, chemiomika en mikrobiomika. Opsomming: Die primêre benadering geneem in hierdie tesis is die gebruik van netwerke en meerveranderlike data-ontleding metodes om hoë-dimensie datastelle te ontleed. Verdermeer, verskeie van die metodes begin om die gemeenskaplike grond tussen netwerke en meerveranderlike data-ontleding te verken. Dit blyk om 'n logiese progressie te wees, aangesien beide netwerke en meerveranderlike data-ontleding gefokus is op matriks-gebaseerde data modellering en dus gewortel is in liniêre algebra. Biological datasets Wine-related Omics Multivariate Statistics Theses -- Wine biotechnology Dissertations -- Wine biotechnology
37	Classification de bases de données déséquilibrées par des règles de décomposition / Handling imbalanced datasets by reconstruction rules in decomposition schemes D'Ambrosio, Roberto 07 March 2014 (has links) Le déséquilibre entre la distribution des a priori est rencontré dans un nombre très large de domaines. Les algorithmes d’apprentissage conventionnels sont moins efficaces dans la prévision d’échantillons appartenant aux classes minoritaires. Notre but est de développer une règle de reconstruction adaptée aux catégories de données biaisées. Nous proposons une nouvelle règle, la Reconstruction Rule par sélection, qui, dans le schéma ‘One-per-Class’, utilise la fiabilité, des étiquettes et des distributions a priori pour permettre de calculer une décision finale. Les tests démontrent que la performance du système s’améliore en utilisant cette règle plutôt que des règles classiques. Nous étudions également les règles dans l’ ‘Error Correcting Output Code’ (ECOC) décomposition. Inspiré par une règle de reconstitution de données statistiques conçue pour le ‘One-per-Class’ et ‘Pair-Wise Coupling’ des approches sur la décomposition, nous avons développé une règle qui s’applique à la régression ‘softmax’ sur la fiabilité afin d’évaluer la classification finale. Les résultats montrent que ce choix améliore les performances avec respect de la règle statistique existante et des règles de reconstructions classiques. Sur ce thème d’estimation fiable nous remarquons que peu de travaux ont porté sur l’efficacité de l’estimation postérieure dans le cadre de boosting. Suivant ce raisonnement, nous développons une estimation postérieure efficace en boosting Nearest Neighbors. Utilisant Universal Nearest Neighbours classification nous prouvons qu’il existe une sous-catégorie de fonctions, dont la minimisation apporte statistiquement de simples et efficaces estimateurs de Bayes postérieurs. / Disproportion among class priors is encountered in a large number of domains making conventional learning algorithms less effective in predicting samples belonging to the minority classes. We aim at developing a reconstruction rule suited to multiclass skewed data. In performing this task we use the classification reliability that conveys useful information on the goodness of classification acts. In the framework of One-per-Class decomposition scheme we design a novel reconstruction rule, Reconstruction Rule by Selection, which uses classifiers reliabilities, crisp labels and a-priori distributions to compute the final decision. Tests show that system performance improves using this rule rather than using well-established reconstruction rules. We investigate also the rules in the Error Correcting Output Code (ECOC) decomposition framework. Inspired by a statistical reconstruction rule designed for the One-per-Class and Pair-Wise Coupling decomposition approaches, we have developed a rule that applies softmax regression on reliability outputs in order to estimate the final classification. Results show that this choice improves the performances with respect to the existing statistical rule and to well-established reconstruction rules. On the topic of reliability estimation we notice that small attention has been given to efficient posteriors estimation in the boosting framework. On this reason we develop an efficient posteriors estimator by boosting Nearest Neighbors. Using Universal Nearest Neighbours classifier we prove that a sub-class of surrogate losses exists, whose minimization brings simple and statistically efficient estimators for Bayes posteriors. Apprentissage automatique Données déséquilibrées Fiabilité Estimation postérieure Machine learning Imbalanced datasets Reliability Posterior estimation
38	Fast and accurate estimation of large-scale phylogenetic alignments and trees Liu, Kevin Jensen 06 July 2011 (has links) Phylogenetics is the study of evolutionary relationships. Phylogenetic trees and alignments play important roles in a wide range of biological research, including reconstruction of the Tree of Life - the evolutionary history of all organisms on Earth - and the development of vaccines and antibiotics. Today's phylogenetic studies seek to reconstruct trees and alignments on a greater number and variety of organisms than ever before, primarily due to exponential growth in affordable sequencing and computing power. The importance of phylogenetic trees and alignments motivates the need for methods to reconstruct them accurately and efficiently on large-scale datasets. Traditionally, phylogenetic studies proceed in two phases: first, an alignment is produced from biomolecular sequences with differing lengths, and, second, a tree is produced using the alignment. My dissertation presents the first empirical performance study of leading two-phase methods on datasets with up to hundreds of thousands of sequences. Relatively accurate alignments and trees were obtained using methods with high computational requirements on datasets with a few hundred sequences, but as datasets grew past 1000 sequences and up to tens of thousands of sequences, the set of methods capable of analyzing a dataset diminished and only the methods with the lowest computational requirements and lowest accuracy remained. Alternatively, methods have been developed to simultaneously estimate phylogenetic alignments and trees. Methods optimizing the treelength optimization problem - the most widely-used approach for simultaneous estimation - have not been shown to return more accurate trees and alignments than two-phase approaches. I demonstrate that treelength optimization under a particular class of optimization criteria represents a promising means for inferring accurate trees and alignments. The other methods for simultaneous estimation are not known to support analyses of datasets with a few hundred sequences due to their high computational requirements. The main contribution of my dissertation is SATe, the first fast and accurate method for simultaneous estimation of alignments and trees on datasets with up to several thousand nucleotide sequences. SATe improves upon the alignment and topological accuracy of all existing methods, especially on the most difficult-to-align datasets, while retaining reasonable computational requirements. / text Computational phylogenetics Multiple sequence alignment Phylogeny Treelength optimization problem Simultaneous estimation Biological datasets
39	Point cloud classification for water surface identification in Lidar datasets Sangireddy, Harish 07 July 2011 (has links) Light Detection and Ranging (Lidar) is a remote sensing technique that provides high resolution range measurements between the laser scanner and Earth’s topography. These range measurements are mapped as 3D point cloud with high accuracy (< 0.1 meters). Depending on the geometry of the illuminated surfaces on earth one or more backscattered echoes are recorded for every pulse emitted by the laser scanner. Lidar has the advantage of being able to create elevation surfaces in 3D, while also having information about the intensity of the returned pulse at each point, thus it can be treated as a spatial and as a spectral data system. The 3D elevation attributes of Lidar data are used in this study to identify possible water surface points quickly and efficiently. The approach incorporates the use of Laplacian curvature computed via wavelets where the wavelets are the first and second order derivatives of a Gaussian kernel. In computer science, a kd-tree is a space-partitioning data structure used for organizing points in a k dimensional space. The 3D point cloud is segmented by using a kd-tree and following this segmentation the neighborhood of each point is identified and Laplacian curvature is computed at each point record. A combination of positive curvature values and elevation measures is used to determine the threshold for identifying possible water surface points in the point cloud. The efficiency and accurate localization of the extracted water surface points are demonstrated by using the Lidar data for Williamson County in Texas. Six different test sites are identified and the results are compared against high resolution imagery. The resulting point features mapped accurately on streams and other water surfaces in the test sites. The combination of curvature and elevation filtering allowed the procedure to omit roads and bridges in the test sites and only identify points that belonged to streams, small ponds and floodplains. This procedure shows the capability of Lidar data for water surface mapping thus providing valuable datasets for a number of applications in geomorphology, hydrology and hydraulics. / text Lidar Light detection and ranging Hydrology Topography Elevation Hydrological datasets Water surface mapping
40	Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine Learning Pelayo Ramirez, Lourdes Unknown Date No description available. Machine Learning Software Reliability Stratification Learning in Imbalanced Datasets Sample Selection Bias Software Defect Prediction

Search results