Spelling suggestions: "subject:"data tet"" "subject:"data beet""
31 |
Computational Methods for Large Spatio-temporal Datasets and Functional Data RankingHuang, Huang 16 July 2017 (has links)
This thesis focuses on two topics, computational methods for large spatial datasets and functional data ranking. Both are tackling the challenges of big and high-dimensional data.
The first topic is motivated by the prohibitive computational burden in fitting Gaussian process models to large and irregularly spaced spatial datasets. Various approximation methods have been introduced to reduce the computational cost, but many rely on unrealistic assumptions about the process and retaining statistical efficiency remains an issue. We propose a new scheme to approximate the maximum likelihood estimator and the kriging predictor when the exact computation is infeasible. The proposed method provides different types of hierarchical low-rank approximations that are both computationally and statistically efficient. We explore the improvement of the approximation theoretically and investigate the performance by simulations. For real applications, we analyze a soil moisture dataset with 2 million measurements with the hierarchical low-rank approximation and apply the proposed fast kriging to fill gaps for satellite images.
The second topic is motivated by rank-based outlier detection methods for functional data. Compared to magnitude outliers, it is more challenging to detect shape outliers as they are often masked among samples. We develop a new notion of functional data depth by taking the integration of a univariate depth function. Having a form of the integrated depth, it shares many desirable features. Furthermore, the novel formation leads to a useful decomposition for detecting both shape and magnitude outliers. Our simulation studies show the proposed outlier detection procedure outperforms competitors in various outlier models. We also illustrate our methodology using real datasets of curves, images, and video frames. Finally, we introduce the functional data ranking technique to spatio-temporal statistics for visualizing and assessing covariance properties, such as separability and full symmetry. We formulate test functions as functions of temporal lags for each pair of spatial locations and develop a rank-based testing procedure induced by functional data depth for assessing these properties. The method is illustrated using simulated data from widely used spatio-temporal covariance models, as well as real datasets from weather stations and climate model outputs.
|
32 |
Learning from small data set for object recognition in mobile platforms.Liu, Siyuan 05 1900 (has links)
Did you stand at a door with a bunch of keys and tried to find the right one to unlock the door? Did you hold a flower and wonder the name of it? A need of object recognition could rise anytime and any where in our daily lives. With the development of mobile devices object recognition applications become possible to provide immediate assistance. However, performing complex tasks in even the most advanced mobile platforms still faces great challenges due to the limited computing resources and computing power.
In this thesis, we present an object recognition system that resides and executes within a mobile device, which can efficiently extract image features and perform learning and classification. To account for the computing constraint, a novel feature extraction method that minimizes the data size and maintains data consistency is proposed. This system leverages principal component analysis method and is able to update the trained classifier when new examples become available . Our system relieves users from creating a lot of examples and makes it user friendly.
The experimental results demonstrate that a learning method trained with a very small number of examples can achieve recognition accuracy above 90% in various acquisition conditions. In addition, the system is able to perform learning efficiently.
|
33 |
Anotace NetFlow dat z pohledu bezpečnosti / Annotation of NetFlow Data from Perspective of Network SecurityKadletz, Lukáš January 2016 (has links)
This thesis describes design and implementation of application for offline NetFlow data annotation from perspective of network security. In this thesis is explained the NetFlow architecture in detail along with methods for security incidents detection in the captured data. The application design is based on analysis of manual annotation and supported by several UML diagrams. The Nemea system is used for detecting security events and Warden system as a source of information about reported security incidents on the network. The application uses technologies such as PHP 5, Nette framework, jQuery library and Bootstrap framework. The CESNET association provided NetFlow data for testing the application. The result of this thesis could be used for analysis and annotation of NetFlow data. Resulting data set could be used to verify proper functionality of detection tools.
|
34 |
Representación geoespacial como medio para mejorar visibilidad de las tesis: caso de la Universidad Peruana de Ciencias Aplicadas (UPC)Huaroto, Libio 23 October 2018 (has links)
VIII Conferencia Internacional BIREDIAL – ISTEC 2018 22 al 25 de octubre de 2018. Organizado por la Pontificia Universidad Católica del Perú. Lima Perú / La Universidad Peruana de Ciencias Aplicadas (UPC) ha desarrollado, desde finales del año 2017, diversas iniciativas para generar una infraestructura de servicios geoespaciales para sus tesis y otros tipos de producción intelectual, con los siguientes fines: mejorar su accesibilidad; implementar mapas temáticos; identificar nuevas formas de difusión en el marco de los repositorios académicos; y promover el intercambio de los datos y servicios de información espacial a nivel local e internacional mediante estándares: Norma ISO 19115, Content Standard for Digital Geospatial Metadata (CSDGM) y el estándar Open Geospatial Consortium (OGC).
Esta iniciativa guarda relación con diversas acciones para el fortalecimiento de una infraestructura de datos espaciales desplegadas por el Estado Peruano. En este esfuerzo, se promulga la Resolución Ministerial N° 126-2003-PCM, que constituye el Comité Coordinador de la Infraestructura de Datos Espaciales del Perú (CC-IDEP) y el Decreto Supremo 133-2013-PCM, el cual establece como obligatorio el acceso e intercambio de información espacial entre entidades de la administración pública y promueve la creación de infraestructuras de datos espaciales institucionales.
En este contexto, las iniciativas de la UPC se han enfocado en:
1. Generación de mapas temáticos en Psicología y Arquitectura a partir del Repositorio Académico UPC.
2. Modificación del Reglamento de las tesis para agregar la información geoespacial en los metadatos del Repositorio Académico UPC (ubicación geográfica y/o UTM).
3. Desarrollo de un análisis cualitativo de las tesis en Psicología mediante mapas temáticos de tres repositorios institucionales peruanos.
4. Utilización de softwares de sistemas de información geográfica (SIG) para generar reportes de mapas temáticos.
5. Desarrollar estrategias de Search Engine Optimization (SEO) para mejorar la visibilidad de la información geoespacial.
Las siguientes acciones se orientan:
1. Generación de mapas temáticos de tesis en todos los programas académicos a partir del Repositorio Académico.
2. Evaluación de la visibilidad del Repositorio Académico UPC a partir de la generación de mapas temáticos.
3. Establecer recomendaciones para la generación de servicios geoespaciales en las Bibliotecas.
4. Establecer convenios con organismos especializados en información geoespacial a nivel local e internacional.
|
35 |
PARALLEL 3D IMAGE SEGMENTATION BY GPU-AMENABLE LEVEL SET SOLUTIONHagan, Aaron M. 17 June 2009 (has links)
No description available.
|
36 |
Proximity curves for potential-based clusteringCsenki, Attila, Neagu, Daniel, Torgunov, Denis, Micic, Natasha 11 January 2020 (has links)
Yes / The concept of proximity curve and a new algorithm are proposed for obtaining clusters in a finite set of data points in the finite dimensional Euclidean space. Each point is endowed with a potential constructed by means of a multi-dimensional Cauchy density, contributing to an overall anisotropic potential function. Guided by the steepest descent algorithm, the data points are successively visited and removed one by one, and at each stage the overall potential is updated and the magnitude of its local gradient is calculated. The result is a finite sequence of tuples, the proximity curve, whose pattern is analysed to give rise to a deterministic clustering. The finite set of all such proximity curves in conjunction with a simulation study of their distribution results in a probabilistic clustering represented by a distribution on the set of dendrograms. A two-dimensional synthetic data set is used to illustrate the proposed potential-based clustering idea. It is shown that the results achieved are plausible since both the ‘geographic distribution’ of data points as well as the ‘topographic features’ imposed by the potential function are well reflected in the suggested clustering. Experiments using the Iris data set are conducted for validation purposes on classification and clustering benchmark data. The results are consistent with the proposed theoretical framework and data properties, and open new approaches and applications to consider data processing from different perspectives and interpret data attributes contribution to patterns.
|
37 |
<b>Predicting The Risks of Recurrent Stroke and Post-Infection Seizure in Residents of Skilled Nursing Facilities - A Machine Learning Approach</b>Madeleine Gwynn Stanik (18422118) 22 April 2024 (has links)
<p dir="ltr">Recurrent stroke, infection, and seizure are some of the most common complications in stroke survivors. Recurrent stroke leads to death in 38.6% of survivors, and infections are the most common risk factor for seizures, with stroke survivors that experience an infection being at greater risk of experiencing a seizure. Two predictive models were generated, recurrent stroke and post-infection seizure, to determine stroke survivors at greatest risk to help providers focus on prevention in higher risk residents.</p><p dir="ltr">Predictive models were generated from a retrospective study of the Long-Term Care Minimum Data Set (MDS) 3.0 (2014-2018, n=262,301). Techniques included three data balancing methods (SMOTE for up sampling, ENN for down sampling, and SMOTEENN for up and down sampling) and three feature selection methods (LASSO, RFE, and PCA). The resulting datasets were then trained on four machine learning models (Logistic Regression, Random Forest, XGBoost, and Neural Network). Model performance was evaluated with AUC and accuracy, and interpretation used SHapley Addictive exPlanations.</p><p dir="ltr">Using data balancing methods improved the prediction performances of the machine learning models, but feature selection did not remove any features or affect performance. With all models having a high accuracy (78.6% to 99.9%), interpretation on all four models yielded the most holistic view. For recurrent stroke, SHAP values indicated that treatment combinations of occupational therapy, physical therapy, antidepressants, non-medical intervention for pain, therapeutic diet, anticoagulants, and diuretics contributed more to reducing recurrent stroke risk in the model when compared to individual treatments. For post-infection seizure, SHAP values indicated that therapy (speech, physical, occupational, and respiratory), independence (activities of daily living for walking, mobility, eating, dressing, and toilet use), and mood (severity score, anti-anxiety medications, antidepressants, and antipsychotics) features contributed the most. Meaning, stroke survivors who received fewer therapy hours, were less independent, and had a worse overall mood were at a greater risk of having a post-infection seizure.</p><p dir="ltr">The development of a tool to predict recurrent stroke and post-infection seizure in stroke survivors can be interpreted by providers to guide treatment and rehabilitation to prevent complications long-term. This promotes individualized plans that can increase the quality of resident care.</p>
|
38 |
Strategy for construction of polymerized volume data setsAragonda, Prathyusha 12 April 2006 (has links)
This thesis develops a strategy for polymerized volume data set construction.
Given a volume data set defined over a regular three-dimensional grid, a polymerized
volume data set (PVDS) can be defined as follows: edges between adjacent vertices of
the grid are labeled 1 (active) or 0 (inactive) to indicate the likelihood that an edge is
contained in (or spans the boundary of) a common underlying object, adding information
not in the original volume data set. This edge labeling Âpolymerizes adjacent voxels
(those sharing a common active edge) into connected components, facilitating
segmentation of embedded objects in the volume data set. Polymerization of the volume
data set also aids real-time data compression, geometric modeling of the embedded
objects, and their visualization.
To construct a polymerized volume data set, an adjacency class within the grid
system is selected. Edges belonging to this adjacency class are labeled as interior,
exterior, or boundary edges using discriminant functions whose functional forms are
derived for three local adjacency classes. The discriminant function parameter values are
determined by supervised learning. Training sets are derived from an initial
segmentation on a homogeneous sample of the volume data set, using an existing
segmentation method.
The strategy of constructing polymerized volume data sets is initially tested on
synthetic data sets which resemble neuronal volume data obtained by three-dimensional
microscopy. The strategy is then illustrated on volume data sets of mouse brain
microstructure at a neuronal level of detail. Visualization and validation of the resulting
PVDS is shown in both cases. Finally the procedures of polymerized volume data set construction are
generalized to apply to any Bravais lattice over the regular 3D orthogonal grid. Further
development of this latter topic is left to future work.
|
39 |
Using Statistical Methods to Determine Geolocation Via TwitterWright, Christopher M. 01 May 2014 (has links)
With the ever expanding usage of social media websites such as Twitter, it is possible to use statistical inquires to form a geographic location of a person using solely the content of their tweets. According to a study done in 2010, Zhiyuan Cheng, was able to detect a location of a Twitter user within 100 miles of their actual location 51% of the time. While this may seem like an already significant find, this study was done while Twitter was still finding its ground to stand on. In 2010, Twitter had 75 million unique users registered, as of March 2013, Twitter has around 500 million unique users. In this thesis, my own dataset was collected and using Excel macros, a comparison of my results to that of Cheng’s will see if the results have changed over the three years since his study. If found to be that Cheng’s 51% can be shown more efficiently using a simpler methodology, this could have a significant impact on Homeland Security and cyber security measures.
|
40 |
The Evaluation of Well-known Effort Estimation Models based on Predictive Accuracy IndicatorsKhan, Khalid January 2010 (has links)
Accurate and reliable effort estimation is still one of the most challenging processes in software engineering. There have been numbers of attempts to develop cost estimation models. However, the evaluation of model accuracy and reliability of those models have gained interest in the last decade. A model can be finely tuned according to specific data, but the issue remains there is the selection of the most appropriate model. A model predictive accuracy is determined by the difference of the various accuracy measures. The one with minimum relative error is considered to be the best fit. The model predictive accuracy is needed to be statistically significant in order to be the best fit. This practice evolved into model evaluation. Models predictive accuracy indicators need to be statistically tested before taking a decision to use a model for estimation. The aim of this thesis is to statistically evaluate well known effort estimation models according to their predictive accuracy indicators using two new approaches; bootstrap confidence intervals and permutation tests. In this thesis, the significance of the difference between various accuracy indicators were empirically tested on the projects obtained from the International Software Benchmarking Standard Group (ISBSG) data set. We selected projects of Un-Adjusted Function Points (UFP) of quality A. Then, the techniques; Analysis Of Variance ANOVA and regression to form Least Square (LS) set and Estimation by Analogy (EbA) set were used. Step wise ANOVA was used to form parametric model. K-NN algorithm was employed in order to obtain analogue projects for effort estimation use in EbA. It was found that the estimation reliability increased with the pre-processing of the data statistically, moreover the significance of the accuracy indicators were not only tested statistically but also with the help of more complex inferential statistical methods. The decision of selecting non-parametric methodology (EbA) for generating project estimates in not by chance but statistically proved.
|
Page generated in 0.0644 seconds