Global ETD Search

1	Plantilla para elaborar Tesis de Data Science / Programa de Maestría en Data Science. Escuela de Postgrado Dirección de Gestión del Conocimiento 02 1900 (has links) Plantilla para elaborar Tesis de Maestría en Data Science para optar el grado académico de Maestro en Data Science en el Programa de Maestría en Data Science. Escuela de Postgrado. Universidad Peruana de Ciencias Aplicadas Plantilla Tesis Data Science
2	Data Science for Small Businesses January 2016 (has links) abstract: This reports investigates the general day to day problems faced by small businesses, particularly small vendors, in areas of marketing and general management. Due to lack of man power, internet availability and properly documented data, small business cannot optimize their business. The aim of the research is to address and find a solution to these problems faced, in the form of a tool which utilizes data science. The tool will have features which will aid the vendor to mine their data which they record themselves and find useful information which will benefit their businesses. Since there is lack of properly documented data, One Class Classification using Support Vector Machine (SVM) is used to build a classifying model that can return positive values for audience that is likely to respond to a marketing strategy. Market basket analysis is used to choose products from the inventory in a way that patterns are found amongst them and therefore there is a higher chance of a marketing strategy to attract audience. Also, higher selling products can be used to the vendors' advantage and lesser selling products can be paired with them to have an overall profit to the business. The tool, as envisioned, meets all the requirements that it was set out to have and can be used as a stand alone application to bring the power of data mining into the hands of a small vendor. / Dissertation/Thesis / Masters Thesis Engineering 2016 Engineering Business Data Data Science
3	Web Conference Summarization Through a System of Flags Ankola, Annirudh M 01 March 2020 (has links) In today’s world, we are always trying to find new ways to advance. This era has given rise to a global, distributed workforce since technology has allowed people to access and communicate with individuals all over the world. With the rise of remote workers, the need for quality communication tools has risen significantly. These communication tools come in many forms, and web-conference apps are among the most prominent for the task. Developing a system to automatically summarize the web-conference will save companies time and money, leading to more efficient meetings. Current approaches to summarizing multi-speaker web-conferences tend to yield poor or incoherent results, since conversations do not flow in the same manner that monologues or well-structured articles do. This thesis proposes a system of flags used to extract information from sentences, where the flags are fed into Machine Learning models to determine the importance of the the sentence with which they are associated. The system of flags shows promise for multi-speaker conference summaries. NLP Data Science Machine Learning
4	Matrix factorization framework for simultaneous data (co-)clustering and embedding / Cadre basé sur la factorisation matricielle pour un traitement simultané de la (co)-classification et la réduction de la dimension des données Allab, Kais 15 November 2016 (has links) Les progrès des technologies informatiques et l’augmentation continue des capacités de stockage ont permis de disposer de masses de données de trés grandes tailles et de grandes dimensions. Le volume et la nature même des données font qu’il est de plus en plus nécessaire de développer de nouvelles méthodes capables de traiter, résumer et d’extraire l’information contenue dans de tels types de données. D’un point de vue extraction des connaissances, la compréhension de la structure des grandes masses de données est d’une importance capitale dans l’apprentissage artificiel et la fouille de données. En outre, contrairement à l’apprentissage supervisé, l’apprentissage non supervisé peut fournir des outils pour l’analyse de ces ensembles de données en absence de groupes (classes). Dans cette thèse, nous nous concentrons sur des méthodes fondamentales en apprentissage non supervisé notamment les méthodes de réduction de la dimension, de classification simple (clustering) et de classification croisée (co-clustering). Notre contribution majeure est la proposition d’une nouvelle manière de traiter simultanément la classification et la réduction de dimension. L’idée principale s’appuie sur une fonction objective qui peut être décomposée en deux termes, le premier correspond à la réduction de la dimension des données, tandis que le second correspond à l’objectif du clustering et celui du co-clustering. En s’appuyant sur la factorisation matricielle, nous proposons une solution prenant en compte simultanément les deux objectifs: réduction de la dimension et classification. Nous avons en outre proposé des versions régularisées de nos approches basées sur la régularisation du Laplacien afin de mieux préserver la structure géométrique des données. Les résultats expérimentaux obtenus sur des données synthétiques ainsi que sur des données réelles montrent que les algorithmes proposés fournissent d’une part de bonnes représentations dans des espaces de dimension réduite et d’autre part permettent d’améliorer la qualité des clusters et des co-clusters. Motivés par les bons résultats obtenus par les méthodes du clustering et du co-clustering basés sur la régularisation du Laplacien, nous avons développé un nouvel algorithme basé sur l’apprentissage multi-variétés (multi-manifold) dans lequel une variété consensus est approximée par la combinaison d’un ensemble de variétés candidates reflétant au mieux la structure géométrique locale des données. Enfin, nous avons aussi étudié comment intégrer des contraintes dans les Laplaciens utilisés pour la régularisation à la fois dans l’espace des objets et l’espace des variables. De cette façon, nous montrons comment des connaissances a priori peuvent contribuer à l’amélioration de la qualité du co-clustering. / Advances in computer technology and recent advances in sensing and storage technology have created many high-volume, high-dimensional data sets. This increase in both the volume and the variety of data calls for advances in methodology to understand, process, summarize and extract information from such kind of data. From a more technical point of view, understanding the structure of large data sets arising from the data explosion is of fundamental importance in data mining and machine learning. Unlike supervised learning, unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of classes. In this thesis, we focus on three important techniques of unsupervised learning for data analysis, namely data dimensionality reduction, data clustering and data co-clustering. Our major contribution proposes a novel way to consider the clustering (resp. coclustering) and the reduction of the dimension simultaneously. The main idea presented is to consider an objective function that can be decomposed into two terms where one of them performs the dimensionality reduction while the other one returns the clustering (resp. co-clustering) of data in the projected space simultaneously. We have further introduced the regularized versions of our approaches with graph Laplacian embedding in order to better preserve the local geometry of the data. Experimental results on synthetic data as well as real data demonstrate that the proposed algorithms can provide good low-dimensional representations of the data while improving the clustering (resp. co-clustering) results. Motivated by the good results obtained by graph-regularized-based clustering (resp. co-clustering) methods, we developed a new algorithm based on the multi-manifold learning. We approximate the intrinsic manifold using a subset of candidate manifolds that can better reflect the local geometrical structure by making use of the graph Laplacian matrices. Finally, we have investigated the integration of some selected instance-level constraints in the graph Laplacians of both data samples and data features. By doing that, we show how the addition of priory knowledge can assist in data co-clustering and improves the quality of the obtained co-clusters. Science de données Data science 004.5
5	Plantilla para elaborar Trabajo de investigación de Dara Science / Programa de Maestría en Data Science. Escuela de Postgrado Dirección de Gestión del Conocimiento 02 1900 (has links) Plantilla para elaborar Trabajo de investigación de Maestría en Data Science para optar el grado académico de Maestro en Data Science en el Programa de Maestría en Data Science. Escuela de Postgrado. Universidad Peruana de Ciencias Aplicadas. Plantilla Trabajo de investigación Data Science
6	Graph Neural Networks for Improved Interpretability and Efficiency Pho, Patrick 01 January 2022 (has links) (PDF) Attributed graph is a powerful tool to model real-life systems which exist in many domains such as social science, biology, e-commerce, etc. The behaviors of those systems are mostly defined by or dependent on their corresponding network structures. Graph analysis has become an important line of research due to the rapid integration of such systems into every aspect of human life and the profound impact they have on human behaviors. Graph structured data contains a rich amount of information from the network connectivity and the supplementary input features of nodes. Machine learning algorithms or traditional network science tools have limitation in their capability to make use of both network topology and node features. Graph Neural Networks (GNNs) provide an efficient framework combining both sources of information to produce accurate prediction for a wide range of tasks including node classification, link prediction, etc. The exponential growth of graph datasets drives the development of complex GNN models causing concerns about processing time and interpretability of the result. Another issue arises from the cost and limitation of collecting a large amount of annotated data for training deep learning GNN models. Apart from sampling issue, the existence of anomaly entities in the data might degrade the quality of the fitted models. In this dissertation, we propose novel techniques and strategies to overcome the above challenges. First, we present a flexible regularization scheme applied to the Simple Graph Convolution (SGC). The proposed framework inherits fast and efficient properties of SGC while rendering a sparse set of fitted parameter vectors, facilitating the identification of important input features. Next, we examine efficient procedures for collecting training samples and develop indicative measures as well as quantitative guidelines to assist practitioners in choosing the optimal sampling strategy to obtain data. We then improve upon an existing GNN model for the anomaly detection task. Our proposed framework achieves better accuracy and reliability. Lastly, we experiment with adapting the flexible regularization mechanism to link prediction task. Categorical Data Analysis Data Science
7	Change Point Detection for Streaming Data Using Support Vector Methods Harrison, Charles 01 January 2022 (has links) (PDF) Sequential multiple change point detection concerns the identification of multiple points in time where the systematic behavior of a statistical process changes. A special case of this problem, called online anomaly detection, occurs when the goal is to detect the first change and then signal an alert to an analyst for further investigation. This dissertation concerns the use of methods based on kernel functions and support vectors to detect changes. A variety of support vector-based methods are considered, but the primary focus concerns Least Squares Support Vector Data Description (LS-SVDD). LS-SVDD constructs a hypersphere in a kernel space to bound a set of multivariate vectors using a closed-form solution. The mathematical tractability of the LS-SVDD facilitates closed-form updates for the LS-SVDD Lagrange multipliers. The update formulae concern either adding or removing a block of observations from an existing LS-SVDD description, respectively, and thus LS-SVDD can be constructed or updated sequentially which makes it attractive for online problems with sequential data streams. LS-SVDD is applied to a variety of scenarios including online anomaly detection and sequential multiple change point detection. Categorical Data Analysis Data Science
8	2D Jupyter: Design and Evaluation of 2D Computational Notebooks Christman, Elizabeth 12 June 2023 (has links) Computational notebooks are a popular tool for data analysis. However, the 1D linear structure used by many computational notebooks can lead to challenges and pain points in data analysis, including messiness, tedious navigation, inefficient use of screen space, and presentation of non-linear narratives. To address these problems, we designed a prototype Jupyter Notebooks extension called 2D Jupyter that enables a 2D organization of code cells in a multi-column layout, as well as freeform cell placement. We conducted a user study using this extension to evaluate the usability of 2D computational notebooks and understand the advantages and disadvantages that it provides over a 1D layout. As a result of this study, we found evidence that the 2D layout provides enhanced usability and efficiency in computational notebooks. Additionally, we gathered feedback on the design of the prototype that can be used to inform future work. Overall, 2D Jupyter was positively received and users not only enjoyed using the extension, but also expressed a desire to use 2D notebook environments in the future. / Master of Science / Computational notebooks are a tool commonly used by data analysts that allows them to construct computational narratives through a combination of code, text and visualizations. Many computational notebooks use a 1D linear layout; however data analysis work is often conducted in a non-linear fashion due to the need to debug code, test new theories, and evaluate and compare results. In this work, we present a prototype extension for Jupyter Notebooks called 2D Jupyter that enables the user to arrange their notebook in a 2D multi-column layout. A user study was conducted to evaluate the usability of this extension and understand the benefits that a 2D layout may provide. Feedback on the extension's design was also collected to inform future design opportunities. The prototype received a positive reaction overall and users expressed a desire to use 2D computational notebooks in their future work. computational notebooks software data science
9	Active provenance for data intensive research Spinuso, Alessandro January 2018 (has links) The role of provenance information in data-intensive research is a significant topic of discussion among technical experts and scientists. Typical use cases addressing traceability, versioning and reproducibility of the research findings are extended with more interactive scenarios in support, for instance, of computational steering and results management. In this thesis we investigate the impact that lineage records can have on the early phases of the analysis, for instance performed through near-real-time systems and Virtual Research Environments (VREs) tailored to the requirements of a specific community. By positioning provenance at the centre of the computational research cycle, we highlight the importance of having mechanisms at the data-scientists' side that, by integrating with the abstractions offered by the processing technologies, such as scientific workflows and data-intensive tools, facilitate the experts' contribution to the lineage at runtime. Ultimately, by encouraging tuning and use of provenance for rapid feedback, the thesis aims at improving the synergy between different user groups to increase productivity and understanding of their processes. We present a model of provenance, called S-PROV, that uses and further extends PROV and ProvONE. The relationships and properties characterising the workflow's abstractions and their concrete executions are re-elaborated to include aspects related to delegation, distribution and steering of stateful streaming operators. The model is supported by the Active framework for tuneable and actionable lineage ensuring the user's engagement by fostering rapid exploitation. Here, concepts such as provenance types, configuration and explicit state management allow users to capture complex provenance scenarios and activate selective controls based on domain and user-defined metadata. We outline how the traces are recorded in a new comprehensive system, called S-ProvFlow, enabling different classes of consumers to explore the provenance data with services and tools for monitoring, in-depth validation and comprehensive visual-analytics. The work of this thesis will be discussed in the context of an existing computational framework and the experience matured in implementing provenance-aware tools for seismology and climate VREs. It will continue to evolve through newly funded projects, thereby providing generic and user-centred solutions for data-intensive research.
10	THEORY AND APPLICATIONS OF DATA SCIENCE Sheng Zhang (13900074) 07 October 2022 (has links) <p>This work is a collection of original research, contributing to various hot topics in contemporary data science, covering both theory and applications. The topics include discovering physical laws from data, data-driven epidemiological models, Gaussian random field surrogate models, and image texture classification. In Chapter 2, we introduce a novel method for discovering physical laws from data with uncertainty quantification. In Chapter 3, this method is enhanced to tackle high noise and outliers. In Chapter 4, the method is applied to discover the law of turbine component damage in industry. In Chapter 5, we propose a new framework for building trustworthy data-driven epidemiological models and apply it to the COVID-19 outbreak in New York City. In Chapter 6, we construct augmented Gaussian random field, a universal framework incorporating the data of observable and derivatives of any order. The theoretical framework as well as computational framework are established. In Chapter 7, we introduce the use of 2-dimensional signature, an object inspired by rough paths theory, as feature for image texture classification.</p> data science

Search results