Spelling suggestions: "subject:"[een] DATASETS"" "subject:"[enn] DATASETS""
31 |
Information Extraction from dataSottovia, Paolo 22 October 2019 (has links)
Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.
|
32 |
Motivating Introductory Computing Students with Pedagogical DatasetsBart, Austin Cory 03 May 2017 (has links)
Computing courses struggle to retain introductory students, especially as learner demographics have expanded to include more diverse majors, backgrounds, and career interests. Motivational contexts for these courses must extend beyond short-term interest to empower students and connect to learners' long-term goals, while maintaining a scaffolded experience. To solve ongoing problems such as student retention, methods should be explored that can engage and motivate students.
I propose Data Science as an introductory context that can appeal to a wide range of learners. To test this hypothesis, my work uses two educational theories — the MUSIC Model of Academic Motivation and Situated Learning Theory — to evaluate different components of a student's learning experience for their contribution to the student's motivation. I analyze existing contexts that are used in introductory computing courses, such as game design and media computation, and their limitations in regard to educational theories. I also review how Data Science has been used as a context, and its associated affordances and barriers.
Next, I describe two research projects that make it simple to integrate Data Science into introductory classes. The first project, RealTimeWeb, was a prototypical exploration of how real-time web APIs could be scaffolded into introductory projects and problems. RealTimeWeb evolved into the CORGIS Project, an extensible framework populated by a diverse collection of freely available "Pedagogical Datasets" designed specifically for novices. These datasets are available in easy-to-use libraries for multiple languages, various file formats, and also through accessible web-based tools. While developing these datasets, I identified and systematized a number of design issues, opportunities, and concepts involved in the preparation of Pedagogical Datasets.
With the completed technology, I staged a number of interventions to evaluate Data Science as an introductory context and to better understand the relationship between student motivation and course outcomes. I present findings that show evidence for the potential of a Data Science context to motivate learners. While I found evidence that the course content naturally has a stronger influence on course outcomes, the course context is a valuable component of the course's learning experience. / Ph. D. / Introductory computing courses struggle to keep students. This has become worse as students with more diverse majors take introductory courses. In prior research, introducing fun and interesting material into courses improved student engagement. This material provides a compelling context for the students, beyond the primary material. But instead of only relying on fun material, courses should also rely on material that is useful. This means connecting to students’ long term career goals and empowering learners. Crucial to this is not making the material too difficult for the diverse audience. To keep more students, we need to explore new methods need of teaching computing.
I propose data science as a computing context that can appeal to a wide range of learners. This work tests this hypothesis using theories of academic motivation and learning theory. The components of a learning experience contribute to students’ motivation. I analyze how the components of other existing contexts can motivate students. These existing contexts include material like game design or media manipulation. I also analyze how good data science is as a context.
Next, I describe two projects that make it simple to use data science in introductory classes. The first project was RealTimeWeb. This system made it easy to use real-time web APIs in introductory problems. RealTimeWeb evolved into the CORGIS Project. This is a diverse collection of free “Pedagogical Datasets” designed for novices. These datasets are suitable for many kinds of introductory computing courses. While developing this collection, I identified many design issues involved in pedagogical datasets. I also made tools that made it easy to manage and update the data.
I used both projects in real introductory computing courses. First, I evaluated the projects’ suitability for students. I also evaluated data science as a learning experience. Finally, I also studied the relationship between student motivation and course outcomes. These outcomes include students interest in learning more computing and their retention rate. I present evidence for the potential of a data science context to motivate learners. But, the primary material has a stronger relationship with course outcomes than the data science context. In other words, students are more interested in continuing computing if they like computing, not if they like data science. Still, the results show that data science is an effective learning experience.
|
33 |
Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance TradeoffsHu, Li Ang, Ma, Long January 2023 (has links)
This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets.
|
34 |
On the discovery of relevant structures in dynamic and heterogeneous dataPreti, Giulia 22 October 2019 (has links)
We are witnessing an explosion of available data coming from a huge amount of sources and domains, which is leading to the creation of datasets larger and larger, as well as richer and richer.
Understanding, processing, and extracting useful information from those datasets requires specialized algorithms that take into consideration both the dynamism and the heterogeneity of the data they contain.
Although several pattern mining techniques have been proposed in the literature, most of them fall short in providing interesting structures when the data can be interpreted differently from user to user, when it can change from time to time, and when it has different representations.
In this thesis, we propose novel approaches that go beyond the traditional pattern mining algorithms, and can effectively and efficiently discover relevant structures in dynamic and heterogeneous settings.
In particular, we address the task of pattern mining in multi-weighted graphs, pattern mining in dynamic graphs, and pattern mining in heterogeneous temporal databases.
In pattern mining in multi-weighted graphs, we consider the problem of mining patterns for a new category of graphs called emph{multi-weighted graphs}. In these graphs, nodes and edges can carry multiple weights that represent, for example, the preferences of different users or applications, and that are used to assess the relevance of the patterns.
We introduce a novel family of scoring functions that assign a score to each pattern based on both the weights of its appearances and their number, and that respect the anti-monotone property, pivotal for efficient implementations.
We then propose a centralized and a distributed algorithm that solve the problem both exactly and approximately. The approximate solution has better scalability in terms of the number of edge weighting functions, while achieving good accuracy in the results found.
An extensive experimental study shows the advantages and disadvantages of our strategies, and proves their effectiveness.
Then, in pattern mining in dynamic graphs, we focus on the particular task of discovering structures that are both well-connected and correlated over time, in graphs where nodes and edges can change over time.
These structures represent edges that are topologically close and exhibit a similar behavior of appearance and disappearance in the snapshots of the graph.
To this aim, we introduce two measures for computing the density of a subgraph whose edges change in time, and a measure to compute their correlation.
The density measures are able to detect subgraphs that are silent in some periods of time but highly connected in the others, and thus they can detect events or anomalies happened in the network.
The correlation measure can identify groups of edges that tend to co-appear together, as well as edges that are characterized by similar levels of activity.
For both variants of density measure, we provide an effective solution that enumerates all the maximal subgraphs whose density and correlation exceed given minimum thresholds, but can also return a more compact subset of representative subgraphs that exhibit high levels of pairwise dissimilarity.
Furthermore, we propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy.
We evaluate our framework with an extensive set of experiments on both real and synthetic datasets, and compare its performance with the main competitor algorithm.
The results confirm the correctness of the exact solution, the high accuracy of the approximate, and the superiority of our framework over the existing solutions.
In addition, they demonstrate the scalability of the framework and its applicability to networks of different nature.
Finally, we address the problem of entity resolution in heterogeneous temporal data-ba-se-s, which are datasets that contain records that give different descriptions of the status of real-world entities at different periods of time, and thus are characterized by different sets of attributes that can change over time.
Detecting records that refer to the same entity in such scenario requires a record similarity measure that takes into account the temporal information and that is aware of the absence of a common fixed schema between the records.
However, existing record matching approaches either ignore the dynamism in the attribute values of the records, or assume that all the records share the same set of attributes throughout time.
In this thesis, we propose a novel time-aware schema-agnostic similarity measure for temporal records to find pairs of matching records, and integrate it into an exact and an approximate algorithm.
The exact algorithm can find all the maximal groups of pairwise similar records in the database.
The approximate algorithm, on the other hand, can achieve higher scalability with the size of the dataset and the number of attributes, by relying on a technique called meta-blocking. This algorithm can find a good-quality approximation of the actual groups of similar records, by adopting an effective and efficient clustering algorithm.
|
35 |
ESSAYS ON SCALABLE BAYESIAN NONPARAMETRIC AND SEMIPARAMETRIC MODELSChenzhong Wu (18275839) 29 March 2024 (has links)
<p dir="ltr">In this thesis, we delve into the exploration of several nonparametric and semiparametric econometric models within the Bayesian framework, highlighting their applicability across a broad spectrum of microeconomic and macroeconomic issues. Positioned in the big data era, where data collection and storage expand at an unprecedented rate, the complexity of economic questions we aim to address is similarly escalating. This dual challenge ne- cessitates leveraging increasingly large datasets, thereby underscoring the critical need for designing flexible Bayesian priors and developing scalable, efficient algorithms tailored for high-dimensional datasets.</p><p dir="ltr">The initial two chapters, Chapter 2 and 3, are dedicated to crafting Bayesian priors suited for environments laden with a vast array of variables. These priors, alongside their corresponding algorithms, are optimized for computational efficiency, scalability to extensive datasets, and, ideally, distributability. We aim for these priors to accommodate varying levels of dataset sparsity. Chapter 2 assesses nonparametric additive models, employing a smoothing prior alongside a band matrix for each additive component. Utilizing the Bayesian backfitting algorithm significantly alleviates the computational load. In Chapter 3, we address multiple linear regression settings by adopting a flexible scale mixture of normal priors for coefficient parameters, thus allowing data-driven determination of the necessary amount of shrinkage. The use of a conjugate prior enables a closed-form solution for the posterior, markedly enhancing computational speed.</p><p dir="ltr">The subsequent chapters, Chapter 4 and 5, pivot towards time series dataset model- ing and Bayesian algorithms. A semiparametric modeling approach dissects the stochastic volatility in macro time series into persistent and transitory components, the latter addi- tional component addressing outliers. Utilizing a Dirichlet process mixture prior for the transitory part and a collapsed Gibbs sampling algorithm, we devise a method capable of efficiently processing over 10,000 observations and 200 variables. Chapter 4 introduces a simple univariate model, while Chapter 5 presents comprehensive Bayesian VARs. Our al- gorithms, more efficient and effective in managing outliers than existing ones, are adept at handling extensive macro datasets with hundreds of variables.</p>
|
36 |
Networks and multivariate statistics as applied to biological datasets and wine-related omics / Netwerke en meerveranderlike statistiek toegepas op biologiese datastelle en wyn-verwante omikaJacobson, Daniel A. 12 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: Introduction: Wine production is a complex biotechnological process aiming
at productively coordinating the interactions and outputs of several biological
systems, including grapevine and many microorganisms such as wine yeast
and wine bacteria. High-throughput data generating tools in the elds of
genomics, transcriptomics, proteomics, metabolomics and microbiomics are
being applied both locally and globally in order to better understand complex
biological systems. As such, the datasets available for analysis and mining
include de novo datasets created by collaborators as well as publicly available
datasets which one can use to get further insight into the systems under study.
In order to model the complexity inherent in and across these datasets it is
necessary to develop methods and approaches based on network theory and
multivariate data analysis as well as to explore the intersections between these
two approaches to data modelling, mining and interpretation.
Networks: The traditional reductionist paradigm of analysing single components
of a biological system has not provided tools with which to adequately
analyse data sets that are attempting to capture systems-level information.
Network theory has recently emerged as a new discipline with which to model
and analyse complex systems and has arisen from the study of real and often
quite large networks derived empirically from the large volumes of data
that have collected from communications, internet, nancial and biological
systems. This is in stark contrast to previous theoretical approaches to understanding
complex systems such as complexity theory, synergetics, chaos
theory, self-organised criticality, and fractals which were all sweeping theoretical
constructs based on small toy models which proved unable to address the
complexity of real world systems.
Multivariate Data Analysis: Principle components analysis (PCA) and
Partial Least Squares (PLS) regression are commonly used to reduce the dimensionality of a matrix (and amongst matrices in the case of PLS) in which
there are a considerable number of potentially related variables. PCA and PLS
are variance focused approaches where components are ranked by the amount
of variance they each explain. Components are, by de nition, orthogonal to
one another and as such, uncorrelated.
Aims: This thesis explores the development of Computational Biology tools
that are essential to fully exploit the large data sets that are being generated
by systems-based approaches in order to gain a better understanding of winerelated
organisms such as grapevine (and tobacco as a laboratory-based plant
model), plant pathogens, microbes and their interactions. The broad aim of
this thesis is therefore to develop computational methods that can be used in
an integrated systems-based approach to model and describe di erent aspects
of the wine making process from a biological perspective. To achieve this
aim, computational methods have been developed and applied in the areas of
transcriptomics, phylogenomics, chemiomics and microbiomics.
Summary: The primary approaches taken in this thesis have been the use of
networks and multivariate data analysis methods to analyse highly dimensional
data sets. Furthermore, several of the approaches have started to explore the
intersection between networks and multivariate data analysis. This would seem
to be a logical progression as both networks and multivariate data analysis are
focused on matrix-based data modelling and therefore have many of their roots
in linear algebra. / AFRIKAANSE OPSOMMING: Inleiding: Wynproduksie is 'n komplekse biotegnologiese proses wat mik op
die produktiewe koördinering van verskeie interaksies en uitsette van verskeie
biologiese sisteme. Hierdie sisteme sluit in die wingerd, wat van besondere belang
is, asook die wyn gis en wyn bakterieë. Hoë-deurset data generasie word
huidiglik beide globaal en plaaslik toegepas in die velde van genomika, transkriptomika,
proteomika, metabolomika en mikrobiomika. As sulks is hierdie
tipe datastelle beskikbaar vir ontleding, bemyning en verkening. Die datastelle
kan de novo gegenereer word, met behulp van medewerkers, of dit kan vanuit
die publieke databasisse gewerf word waar sulke datastelle dikwels beskikbaar
gemaak word sodat verdere insig verkry kan word met betrekking tot die sisteem
onder studie. Die hoë-deurset datastelle onder bespreking bevat 'n hoë
mate van inherente kompleksiteit, beide ten opsigte van ditself asook tussen
verskeie datastelle. Om ten einde hierdie datastelle en hul inherente kompleksiteit
te modelleer is dit nodig om metodes en benaderings te ontwikkel wat
gesetel is in netwerk teorie en meerveranderlike statistiek. Verdermeer is dit
ook nodig om die kruisings tussen netwerk teorie en meerveranderlike statistiek
te verken om sodoende die modellering, bemyning, verkening en interpretasie
van data te verbeter.
Netwerke: Die tradisionele reduksionistiese paradigma, waarby enkele komponente
van 'n biologiese sisteem geontleed word, het tot dusver nie voldoende
metodes en gereedskap gelewer waarmee datastelle, wat streef om sisteemvlak
informasie te bekom, geontleed kan word nie. Netwerk teorie het na vore gekom
as 'n nuwe dissipline wat toegepas kan word vir die model-skepping en
ontleding van komplekse sisteme. Dit stem uit die studie van egte, dikwels
groot netwerke wat empiries afgelei word uit die groot volumes data wat tans na vore kom vanuit kommunikasie-, internet-, nansiële- en biologiese sisteme.
Dit is in skrille kontras met vorige teoretiese benaderings wat gestreef het
om komplekse sisteme te verstaan met konsepte soos kompleksiteits teorie,
synergetics , chaos teorie, self-georganiseerde kritikaliteit en fraktale. Al die
bogeneomde is breë teoretiese konstrukte, gebasseer op relatief kleinskaal modelle,
wat nie instaat was om oplossings vir die kompleksiteit van egte-wêreld
sisteme te bied nie.
Meerveranderlike Data-analise: Hoofkomponente-ontleding (PCA) en Partial
Least Squares (PLS) regressie word dikwels gebruik om die dimensionaliteit
van 'n matriks (en tussen matrikse in die geval van PLS) te verminder.
Hierdie matrikse bevat dikwels 'n aansienlike groot hoeveelheid moontlikverwante
veranderlikes. PCA en PLS is variansie gedrewe metodes en behels
dat komponente gerang word deur die hoeveelheid variansie wat elke component
verduidelik. Komponente is by de nisie ortogonaal ten opsigte van
mekaar en as sulks ongekorreleerd.
Doelwitte: Hierdie tesis verken die ontwikkeling van verskeie Computational
Biology metodes wat noodsaaklik is om ten volle die groot skaal datastelle
te benut wat tans deur sisteem-gebasseerde benaderings gegenereer word. Die
doel is om beter begrip en kennis van wyn verwante organismes te kry, hierdie
organismes sluit in die wingerd (met tabak as laboratorium-gebasseerde plant
model), plant patogene en microbes sowel as hulle interaksies.
Die breë mikpunt van hierdie tesis is dus om gerekenaardiseerde metodes
te ontwikkel wat gebruik kan word in 'n geintergreerde sisteem-gebaseerde benadering
tot die modellering en beskrywing van verskillende aspekte van die
wynmaak proses vanuit 'n biologiese standpunt. Om die mikpunt te bereik is
gerekenaardiseerde metodes ontwikkel en toegepas in die velde van transkriptomika,
logenomika, chemiomika en mikrobiomika.
Opsomming: Die primêre benadering geneem in hierdie tesis is die gebruik
van netwerke en meerveranderlike data-ontleding metodes om hoë-dimensie
datastelle te ontleed. Verdermeer, verskeie van die metodes begin om die
gemeenskaplike grond tussen netwerke en meerveranderlike data-ontleding te
verken. Dit blyk om 'n logiese progressie te wees, aangesien beide netwerke en
meerveranderlike data-ontleding gefokus is op matriks-gebaseerde data modellering
en dus gewortel is in liniêre algebra.
|
37 |
Classification de bases de données déséquilibrées par des règles de décomposition / Handling imbalanced datasets by reconstruction rules in decomposition schemesD'Ambrosio, Roberto 07 March 2014 (has links)
Le déséquilibre entre la distribution des a priori est rencontré dans un nombre très large de domaines. Les algorithmes d’apprentissage conventionnels sont moins efficaces dans la prévision d’échantillons appartenant aux classes minoritaires. Notre but est de développer une règle de reconstruction adaptée aux catégories de données biaisées. Nous proposons une nouvelle règle, la Reconstruction Rule par sélection, qui, dans le schéma ‘One-per-Class’, utilise la fiabilité, des étiquettes et des distributions a priori pour permettre de calculer une décision finale. Les tests démontrent que la performance du système s’améliore en utilisant cette règle plutôt que des règles classiques. Nous étudions également les règles dans l’ ‘Error Correcting Output Code’ (ECOC) décomposition. Inspiré par une règle de reconstitution de données statistiques conçue pour le ‘One-per-Class’ et ‘Pair-Wise Coupling’ des approches sur la décomposition, nous avons développé une règle qui s’applique à la régression ‘softmax’ sur la fiabilité afin d’évaluer la classification finale. Les résultats montrent que ce choix améliore les performances avec respect de la règle statistique existante et des règles de reconstructions classiques. Sur ce thème d’estimation fiable nous remarquons que peu de travaux ont porté sur l’efficacité de l’estimation postérieure dans le cadre de boosting. Suivant ce raisonnement, nous développons une estimation postérieure efficace en boosting Nearest Neighbors. Utilisant Universal Nearest Neighbours classification nous prouvons qu’il existe une sous-catégorie de fonctions, dont la minimisation apporte statistiquement de simples et efficaces estimateurs de Bayes postérieurs. / Disproportion among class priors is encountered in a large number of domains making conventional learning algorithms less effective in predicting samples belonging to the minority classes. We aim at developing a reconstruction rule suited to multiclass skewed data. In performing this task we use the classification reliability that conveys useful information on the goodness of classification acts. In the framework of One-per-Class decomposition scheme we design a novel reconstruction rule, Reconstruction Rule by Selection, which uses classifiers reliabilities, crisp labels and a-priori distributions to compute the final decision. Tests show that system performance improves using this rule rather than using well-established reconstruction rules. We investigate also the rules in the Error Correcting Output Code (ECOC) decomposition framework. Inspired by a statistical reconstruction rule designed for the One-per-Class and Pair-Wise Coupling decomposition approaches, we have developed a rule that applies softmax regression on reliability outputs in order to estimate the final classification. Results show that this choice improves the performances with respect to the existing statistical rule and to well-established reconstruction rules. On the topic of reliability estimation we notice that small attention has been given to efficient posteriors estimation in the boosting framework. On this reason we develop an efficient posteriors estimator by boosting Nearest Neighbors. Using Universal Nearest Neighbours classifier we prove that a sub-class of surrogate losses exists, whose minimization brings simple and statistically efficient estimators for Bayes posteriors.
|
38 |
Fast and accurate estimation of large-scale phylogenetic alignments and treesLiu, Kevin Jensen 06 July 2011 (has links)
Phylogenetics is the study of evolutionary relationships.
Phylogenetic trees and alignments play important roles in a wide range
of biological research, including reconstruction of the Tree of Life
- the evolutionary history of all organisms on Earth - and the
development of vaccines and antibiotics.
Today's phylogenetic studies seek to reconstruct
trees and alignments on a greater number and variety of
organisms than ever before, primarily
due to exponential
growth in affordable sequencing and computing power.
The importance of
phylogenetic trees and alignments motivates the need for
methods to reconstruct them accurately and efficiently
on large-scale datasets.
Traditionally, phylogenetic studies proceed in two phases: first, an
alignment is produced from biomolecular sequences with differing
lengths, and, second, a tree is produced using the alignment. My
dissertation presents the first empirical performance study of leading
two-phase methods on datasets with up to hundreds of thousands of
sequences. Relatively accurate alignments and trees were obtained
using methods with high computational requirements on datasets with a
few hundred sequences, but as datasets grew past 1000 sequences and up
to tens of thousands of sequences, the set of methods capable of
analyzing a dataset diminished and only the methods with the lowest
computational requirements and lowest accuracy remained.
Alternatively, methods have been developed to simultaneously estimate
phylogenetic alignments and trees. Methods optimizing the treelength
optimization problem - the most widely-used approach for simultaneous
estimation - have not been shown to return more accurate trees and alignments
than two-phase approaches. I demonstrate that treelength optimization
under a particular class of optimization criteria represents
a promising means for inferring accurate trees
and alignments.
The other methods for simultaneous estimation are not known to
support analyses of datasets with a few hundred sequences due to their
high computational requirements.
The main contribution of my dissertation is SATe,
the first fast and accurate method for simultaneous
estimation of alignments and trees on datasets with up to several
thousand nucleotide sequences. SATe improves upon the alignment and
topological accuracy of all existing methods, especially
on the most difficult-to-align datasets, while retaining
reasonable computational requirements. / text
|
39 |
Point cloud classification for water surface identification in Lidar datasetsSangireddy, Harish 07 July 2011 (has links)
Light Detection and Ranging (Lidar) is a remote sensing technique that provides high resolution range measurements between the laser scanner and Earth’s topography. These range measurements are mapped as 3D point cloud with high accuracy (< 0.1 meters). Depending on the geometry of the illuminated surfaces on earth one or more backscattered echoes are recorded for every pulse emitted by the laser scanner. Lidar has the advantage of being able to create elevation surfaces in 3D, while also having information about the intensity of the returned pulse at each point, thus it can be treated as a spatial and as a spectral data system. The 3D elevation attributes of Lidar data are used in this study to identify possible water surface points quickly and efficiently. The approach incorporates the use of Laplacian curvature computed via wavelets where the wavelets are the first and second order derivatives of a Gaussian kernel. In computer science, a kd-tree is a space-partitioning data structure used for organizing points in a k dimensional space. The 3D point cloud is segmented by using a kd-tree and following this segmentation the neighborhood of each point is identified and Laplacian curvature is computed at each point record. A combination of positive curvature values and elevation measures is used to determine the threshold for identifying possible water surface points in the point cloud. The efficiency and accurate localization of the extracted water surface points are demonstrated by using the Lidar data for Williamson County in Texas. Six different test sites are identified and the results are compared against high resolution imagery. The resulting point features mapped accurately on streams and other water surfaces in the test sites. The combination of curvature and elevation filtering allowed the procedure to omit roads and bridges in the test sites and only identify points that belonged to streams, small ponds and floodplains. This procedure shows the capability of Lidar data for water surface mapping thus providing valuable datasets for a number of applications in geomorphology, hydrology and hydraulics. / text
|
40 |
Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine LearningPelayo Ramirez, Lourdes Unknown Date
No description available.
|
Page generated in 0.0718 seconds