Global ETD Search

11	Bayesian Spatiotemporal Modeling with Gaussian Processes He, Qing 01 January 2022 (has links) (PDF) Bayesian spatiotemporal models have been successfully applied to various fields of science, such as ecology and epidemiology. The complicated nature of spatiotemporal patterns can be well represented through priors such as Gaussian processes. This dissertation is focused on two applications of Bayesian spatiotemporal models: a) anomaly detection for spatiotemporal data with missingness and b) zero-inflated spatiotemporal count data analysis. Missingness in spatiotemporal data prohibits anomaly detection algorithms from learning characteristic rules and patterns due to the lack of most data. This project is motivated by a challenge provided by the National Science Foundation (NSF) and the National Geospatial-Intelligence Agency (NGA). The proposed model uses traffic patterns at nearby hours of the same day and the same time on different days of the week to recover the complete data. We compare the proposed model with the baseline and other models on the given dataset. It is also tested on a new dataset by the challenge organizer. In the zero-inflated spatiotemporal data analysis, a set of latent variables from Pólya-Gamma distributions are introduced to the Bayesian zero-inflated negative binomial model. The parameters of interest have conjugate priors conditional on the latent variables, which facilitates efficient posterior Markov chain Monte Carlo sampling. Varying spatial and temporal random effects are accommodated through Gaussian processes. To overcome the computation bottleneck that Gaussian processes may suffer when the sample size is large, a nearest-neighbor Gaussian process approach is implemented by constructing a sparse covariance matrix. The proposed Bayesian zero-inflated nearest-neighbor Gaussian processes model has been applied to simulated and COVID-19 data. Data Science
12	Plantilla para elaborar Tesis de Data Science / Programa de Maestría en Data Science. Escuela de Postgrado Dirección de Gestión del Conocimiento 02 1900 (has links) Plantilla para elaborar Tesis de Maestría en Data Science para optar el grado académico de Maestro en Data Science en el Programa de Maestría en Data Science. Escuela de Postgrado. Universidad Peruana de Ciencias Aplicadas Plantilla Tesis Data Science
13	A Pedagogical Approach to Create and Assess Domain-Specific Data Science Learning Materials in the Biomedical Sciences Chen, Daniel 01 February 2022 (has links) This dissertation explores creating a set of domain-specific learning materials for the biomedical sciences to meet the educational gap in biomedical informatics, while also meeting the call for statisticians advocating for process improvements in other disciplines. Data science educational materials are plenty enough to become a commodity. This provides the opportunity to create domain-specific learning materials to better motivate learning using real-world examples while also capturing intricacies of working with data in a specific domain. This dissertation shows how the use of persona methodologies can be combined with a backwards design approach of creating domain-specific learning materials. The work is divided into three (3) major steps: (1) create and validate a learner self-assessment survey that can identify learner personas by clustering. (2) combine the information from persona methodology with a backwards design approach using formative and summative assessments to curate, plan, and assess domain-specific data science workshop materials for short term and long term efficacy. (3) pilot and identify at how to manage real-time feedback within a data coding teaching session to drive better learner motivation and engagement. The key findings from this dissertation suggests using a structured framework to plan and curate learning materials is an effective way to identify key concepts in data science. However, just creating and teaching learning materials is not enough for long-term retention of knowledge. More effort for long-term lesson maintenance and long-term strategies for practice will help retain the concepts learned from live instruction. Finally, it is essential that we are careful and purposeful in our content creation as to not overwhelm learners and to integrate their needs into the materials as a primary focus. Overall, this contributes to the growing need for data science education in the biomedical sciences to train future clinicians use and work with data and improve patient outcomes. / Doctor of Philosophy / Regardless of the field and domain you are in, we are all inundated with data. The more agency we can give individuals to work with data, the better equipped they will be to bring their own expertise to complex problems and work in multidisciplinary teams. There already exists a plethora of data science learning materials to help learners work with data; however, many are not domain-focused and can be overwhelming to new learners. By integrating in domain specificity to data science education, we hypothesize that we can help learners learn and retain knowledge by keeping them more engaged and motivated. This dissertation focuses on the domain of the biomedical sciences to use best practices on how to improve data science education and impact the field. Specifically, we explore how to address major gaps in data education in the biomedical field and create a set of domain-specific learning materials (e.g. workshops) for the biomedical sciences. We use best educational practices to curate these learning materials and assess how effective they are. This assessment was performed in three (3) major steps including: (1) identify who the learners are and what they already know in the context of using a programming language to work with data, (2) plan and curate a learning path for the learners and assessing materials created for short and long term effectiveness, and (3) pilot and identify at how to manage real-time feedback within a data coding teaching session to drive better learner motivation and engagement. The key findings from this dissertation suggest using a structured framework to plan and curate learning materials is an effective way to identify key concepts in data science. However, just creating the materials and teaching them is not enough for long-term retention of knowledge. More effort for long-term lesson maintenance and long-term strategies for practice will help retain the concepts learned from live instruction. Finally, it is essential that we are careful and purposeful in our content creation as to not overwhelm learners and to integrate their needs into the materials as a primary focus. Overall, this contributes to the growing need for data science education in the biomedical sciences to train future clinicians to use and work with data and improve patient outcomes. data science data science education pedagogy medical education biomedical sciences
14	Data Science for Small Businesses January 2016 (has links) abstract: This reports investigates the general day to day problems faced by small businesses, particularly small vendors, in areas of marketing and general management. Due to lack of man power, internet availability and properly documented data, small business cannot optimize their business. The aim of the research is to address and find a solution to these problems faced, in the form of a tool which utilizes data science. The tool will have features which will aid the vendor to mine their data which they record themselves and find useful information which will benefit their businesses. Since there is lack of properly documented data, One Class Classification using Support Vector Machine (SVM) is used to build a classifying model that can return positive values for audience that is likely to respond to a marketing strategy. Market basket analysis is used to choose products from the inventory in a way that patterns are found amongst them and therefore there is a higher chance of a marketing strategy to attract audience. Also, higher selling products can be used to the vendors' advantage and lesser selling products can be paired with them to have an overall profit to the business. The tool, as envisioned, meets all the requirements that it was set out to have and can be used as a stand alone application to bring the power of data mining into the hands of a small vendor. / Dissertation/Thesis / Masters Thesis Engineering 2016 Engineering Business Data Data Science
15	Web Conference Summarization Through a System of Flags Ankola, Annirudh M 01 March 2020 (has links) In today’s world, we are always trying to find new ways to advance. This era has given rise to a global, distributed workforce since technology has allowed people to access and communicate with individuals all over the world. With the rise of remote workers, the need for quality communication tools has risen significantly. These communication tools come in many forms, and web-conference apps are among the most prominent for the task. Developing a system to automatically summarize the web-conference will save companies time and money, leading to more efficient meetings. Current approaches to summarizing multi-speaker web-conferences tend to yield poor or incoherent results, since conversations do not flow in the same manner that monologues or well-structured articles do. This thesis proposes a system of flags used to extract information from sentences, where the flags are fed into Machine Learning models to determine the importance of the the sentence with which they are associated. The system of flags shows promise for multi-speaker conference summaries. NLP Data Science Machine Learning
16	Matrix factorization framework for simultaneous data (co-)clustering and embedding / Cadre basé sur la factorisation matricielle pour un traitement simultané de la (co)-classification et la réduction de la dimension des données Allab, Kais 15 November 2016 (has links) Les progrès des technologies informatiques et l’augmentation continue des capacités de stockage ont permis de disposer de masses de données de trés grandes tailles et de grandes dimensions. Le volume et la nature même des données font qu’il est de plus en plus nécessaire de développer de nouvelles méthodes capables de traiter, résumer et d’extraire l’information contenue dans de tels types de données. D’un point de vue extraction des connaissances, la compréhension de la structure des grandes masses de données est d’une importance capitale dans l’apprentissage artificiel et la fouille de données. En outre, contrairement à l’apprentissage supervisé, l’apprentissage non supervisé peut fournir des outils pour l’analyse de ces ensembles de données en absence de groupes (classes). Dans cette thèse, nous nous concentrons sur des méthodes fondamentales en apprentissage non supervisé notamment les méthodes de réduction de la dimension, de classification simple (clustering) et de classification croisée (co-clustering). Notre contribution majeure est la proposition d’une nouvelle manière de traiter simultanément la classification et la réduction de dimension. L’idée principale s’appuie sur une fonction objective qui peut être décomposée en deux termes, le premier correspond à la réduction de la dimension des données, tandis que le second correspond à l’objectif du clustering et celui du co-clustering. En s’appuyant sur la factorisation matricielle, nous proposons une solution prenant en compte simultanément les deux objectifs: réduction de la dimension et classification. Nous avons en outre proposé des versions régularisées de nos approches basées sur la régularisation du Laplacien afin de mieux préserver la structure géométrique des données. Les résultats expérimentaux obtenus sur des données synthétiques ainsi que sur des données réelles montrent que les algorithmes proposés fournissent d’une part de bonnes représentations dans des espaces de dimension réduite et d’autre part permettent d’améliorer la qualité des clusters et des co-clusters. Motivés par les bons résultats obtenus par les méthodes du clustering et du co-clustering basés sur la régularisation du Laplacien, nous avons développé un nouvel algorithme basé sur l’apprentissage multi-variétés (multi-manifold) dans lequel une variété consensus est approximée par la combinaison d’un ensemble de variétés candidates reflétant au mieux la structure géométrique locale des données. Enfin, nous avons aussi étudié comment intégrer des contraintes dans les Laplaciens utilisés pour la régularisation à la fois dans l’espace des objets et l’espace des variables. De cette façon, nous montrons comment des connaissances a priori peuvent contribuer à l’amélioration de la qualité du co-clustering. / Advances in computer technology and recent advances in sensing and storage technology have created many high-volume, high-dimensional data sets. This increase in both the volume and the variety of data calls for advances in methodology to understand, process, summarize and extract information from such kind of data. From a more technical point of view, understanding the structure of large data sets arising from the data explosion is of fundamental importance in data mining and machine learning. Unlike supervised learning, unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of classes. In this thesis, we focus on three important techniques of unsupervised learning for data analysis, namely data dimensionality reduction, data clustering and data co-clustering. Our major contribution proposes a novel way to consider the clustering (resp. coclustering) and the reduction of the dimension simultaneously. The main idea presented is to consider an objective function that can be decomposed into two terms where one of them performs the dimensionality reduction while the other one returns the clustering (resp. co-clustering) of data in the projected space simultaneously. We have further introduced the regularized versions of our approaches with graph Laplacian embedding in order to better preserve the local geometry of the data. Experimental results on synthetic data as well as real data demonstrate that the proposed algorithms can provide good low-dimensional representations of the data while improving the clustering (resp. co-clustering) results. Motivated by the good results obtained by graph-regularized-based clustering (resp. co-clustering) methods, we developed a new algorithm based on the multi-manifold learning. We approximate the intrinsic manifold using a subset of candidate manifolds that can better reflect the local geometrical structure by making use of the graph Laplacian matrices. Finally, we have investigated the integration of some selected instance-level constraints in the graph Laplacians of both data samples and data features. By doing that, we show how the addition of priory knowledge can assist in data co-clustering and improves the quality of the obtained co-clusters. Science de données Data science 004.5
17	Plantilla para elaborar Trabajo de investigación de Dara Science / Programa de Maestría en Data Science. Escuela de Postgrado Dirección de Gestión del Conocimiento 02 1900 (has links) Plantilla para elaborar Trabajo de investigación de Maestría en Data Science para optar el grado académico de Maestro en Data Science en el Programa de Maestría en Data Science. Escuela de Postgrado. Universidad Peruana de Ciencias Aplicadas. Plantilla Trabajo de investigación Data Science
18	Graph Neural Networks for Improved Interpretability and Efficiency Pho, Patrick 01 January 2022 (has links) (PDF) Attributed graph is a powerful tool to model real-life systems which exist in many domains such as social science, biology, e-commerce, etc. The behaviors of those systems are mostly defined by or dependent on their corresponding network structures. Graph analysis has become an important line of research due to the rapid integration of such systems into every aspect of human life and the profound impact they have on human behaviors. Graph structured data contains a rich amount of information from the network connectivity and the supplementary input features of nodes. Machine learning algorithms or traditional network science tools have limitation in their capability to make use of both network topology and node features. Graph Neural Networks (GNNs) provide an efficient framework combining both sources of information to produce accurate prediction for a wide range of tasks including node classification, link prediction, etc. The exponential growth of graph datasets drives the development of complex GNN models causing concerns about processing time and interpretability of the result. Another issue arises from the cost and limitation of collecting a large amount of annotated data for training deep learning GNN models. Apart from sampling issue, the existence of anomaly entities in the data might degrade the quality of the fitted models. In this dissertation, we propose novel techniques and strategies to overcome the above challenges. First, we present a flexible regularization scheme applied to the Simple Graph Convolution (SGC). The proposed framework inherits fast and efficient properties of SGC while rendering a sparse set of fitted parameter vectors, facilitating the identification of important input features. Next, we examine efficient procedures for collecting training samples and develop indicative measures as well as quantitative guidelines to assist practitioners in choosing the optimal sampling strategy to obtain data. We then improve upon an existing GNN model for the anomaly detection task. Our proposed framework achieves better accuracy and reliability. Lastly, we experiment with adapting the flexible regularization mechanism to link prediction task. Categorical Data Analysis Data Science
19	Change Point Detection for Streaming Data Using Support Vector Methods Harrison, Charles 01 January 2022 (has links) (PDF) Sequential multiple change point detection concerns the identification of multiple points in time where the systematic behavior of a statistical process changes. A special case of this problem, called online anomaly detection, occurs when the goal is to detect the first change and then signal an alert to an analyst for further investigation. This dissertation concerns the use of methods based on kernel functions and support vectors to detect changes. A variety of support vector-based methods are considered, but the primary focus concerns Least Squares Support Vector Data Description (LS-SVDD). LS-SVDD constructs a hypersphere in a kernel space to bound a set of multivariate vectors using a closed-form solution. The mathematical tractability of the LS-SVDD facilitates closed-form updates for the LS-SVDD Lagrange multipliers. The update formulae concern either adding or removing a block of observations from an existing LS-SVDD description, respectively, and thus LS-SVDD can be constructed or updated sequentially which makes it attractive for online problems with sequential data streams. LS-SVDD is applied to a variety of scenarios including online anomaly detection and sequential multiple change point detection. Categorical Data Analysis Data Science
20	2D Jupyter: Design and Evaluation of 2D Computational Notebooks Christman, Elizabeth 12 June 2023 (has links) Computational notebooks are a popular tool for data analysis. However, the 1D linear structure used by many computational notebooks can lead to challenges and pain points in data analysis, including messiness, tedious navigation, inefficient use of screen space, and presentation of non-linear narratives. To address these problems, we designed a prototype Jupyter Notebooks extension called 2D Jupyter that enables a 2D organization of code cells in a multi-column layout, as well as freeform cell placement. We conducted a user study using this extension to evaluate the usability of 2D computational notebooks and understand the advantages and disadvantages that it provides over a 1D layout. As a result of this study, we found evidence that the 2D layout provides enhanced usability and efficiency in computational notebooks. Additionally, we gathered feedback on the design of the prototype that can be used to inform future work. Overall, 2D Jupyter was positively received and users not only enjoyed using the extension, but also expressed a desire to use 2D notebook environments in the future. / Master of Science / Computational notebooks are a tool commonly used by data analysts that allows them to construct computational narratives through a combination of code, text and visualizations. Many computational notebooks use a 1D linear layout; however data analysis work is often conducted in a non-linear fashion due to the need to debug code, test new theories, and evaluate and compare results. In this work, we present a prototype extension for Jupyter Notebooks called 2D Jupyter that enables the user to arrange their notebook in a 2D multi-column layout. A user study was conducted to evaluate the usability of this extension and understand the benefits that a 2D layout may provide. Feedback on the extension's design was also collected to inform future design opportunities. The prototype received a positive reaction overall and users expressed a desire to use 2D computational notebooks in their future work. computational notebooks software data science

Search results