Global ETD Search

121	[en] ENABLING AUTONOMOUS DATA ANNOTATION: A HUMAN-IN-THE-LOOP REINFORCEMENT LEARNING APPROACH / [pt] HABILITANDO ANOTAÇÕES DE DADOS AUTÔNOMOS: UMA ABORDAGEM DE APRENDIZADO POR REFORÇO COM HUMANO NO LOOP LEONARDO CARDIA DA CRUZ 10 November 2022 (has links) [pt] As técnicas de aprendizado profundo têm mostrado contribuições significativas em vários campos, incluindo a análise de imagens. A grande maioria dos trabalhos em visão computacional concentra-se em propor e aplicar novos modelos e algoritmos de aprendizado de máquina. Para tarefas de aprendizado supervisionado, o desempenho dessas técnicas depende de uma grande quantidade de dados de treinamento, bem como de dados rotulados. No entanto, a rotulagem é um processo caro e demorado. Uma recente área de exploração são as reduções dos esforços na preparação de dados, deixando-os sem inconsistências, ruídos, para que os modelos atuais possam obter um maior desempenho. Esse novo campo de estudo é chamado de Data-Centric IA. Apresentamos uma nova abordagem baseada em Deep Reinforcement Learning (DRL), cujo trabalho é voltado para a preparação de um conjunto de dados em problemas de detecção de objetos, onde as anotações de caixas delimitadoras são feitas de modo autônomo e econômico. Nossa abordagem consiste na criação de uma metodologia para treinamento de um agente virtual a fim de rotular automaticamente os dados, a partir do auxílio humano como professor desse agente. Implementamos o algoritmo Deep Q-Network para criar o agente virtual e desenvolvemos uma abordagem de aconselhamento para facilitar a comunicação do humano professor com o agente virtual estudante. Para completar nossa implementação, utilizamos o método de aprendizado ativo para selecionar casos onde o agente possui uma maior incerteza, necessitando da intervenção humana no processo de anotação durante o treinamento. Nossa abordagem foi avaliada e comparada com outros métodos de aprendizado por reforço e interação humano-computador, em diversos conjuntos de dados, onde o agente virtual precisou criar novas anotações na forma de caixas delimitadoras. Os resultados mostram que o emprego da nossa metodologia impacta positivamente para obtenção de novas anotações a partir de um conjunto de dados com rótulos escassos, superando métodos existentes. Desse modo, apresentamos a contribuição no campo de Data-Centric IA, com o desenvolvimento de uma metodologia de ensino para criação de uma abordagem autônoma com aconselhamento humano para criar anotações econômicas a partir de anotações escassas. / [en] Deep learning techniques have shown significant contributions in various fields, including image analysis. The vast majority of work in computer vision focuses on proposing and applying new machine learning models and algorithms. For supervised learning tasks, the performance of these techniques depends on a large amount of training data and labeled data. However, labeling is an expensive and time-consuming process. A recent area of exploration is the reduction of efforts in data preparation, leaving it without inconsistencies and noise so that current models can obtain greater performance. This new field of study is called Data-Centric AI. We present a new approach based on Deep Reinforcement Learning (DRL), whose work is focused on preparing a dataset, in object detection problems where the bounding box annotations are done autonomously and economically. Our approach consists of creating a methodology for training a virtual agent in order to automatically label the data, using human assistance as a teacher of this agent. We implemented the Deep Q-Network algorithm to create the virtual agent and developed a counseling approach to facilitate the communication of the human teacher with the virtual agent student. We used the active learning method to select cases where the agent has more significant uncertainty, requiring human intervention in the annotation process during training to complete our implementation. Our approach was evaluated and compared with other reinforcement learning methods and human-computer interaction in different datasets, where the virtual agent had to create new annotations in the form of bounding boxes. The results show that the use of our methodology has a positive impact on obtaining new annotations from a dataset with scarce labels, surpassing existing methods. In this way, we present the contribution in the field of Data-Centric AI, with the development of a teaching methodology to create an autonomous approach with human advice to create economic annotations from scarce annotations. [pt] APRENDIZADO POR REFORCO PROFUNDO [pt] ANOTACOES [pt] AGENTE VIRTUAL [pt] DEEP Q-NETWORK [pt] ACONSELHAMENTO [pt] CONJUNTO DE DADOS [pt] CAIXA DELIMITADORA [en] DEEP REINFORCEMENT LEARNING [en] ANNOTATIONS [en] VIRTUAL AGENT [en] DEEP Q-NETWORK [en] ADVICES [en] DATASET [en] BOUNDING BOX DATASETS
122	Cyber Threat Detection using Machine Learning on Graphs : Continuous-Time Temporal Graph Learning on Provenance Graphs / Detektering av cyberhot med hjälp av maskininlärning på grafer : Inlärning av kontinuerliga tidsdiagram på härkomstgrafer Reha, Jakub January 2023 (has links) Cyber attacks are ubiquitous and increasingly prevalent in industry, society, and governmental departments. They affect the economy, politics, and individuals. Ever-increasingly skilled, organized, and funded threat actors combined with ever-increasing volumes and modalities of data require increasingly sophisticated and innovative cyber defense solutions. Current state-of-the-art security systems conduct threat detection on dynamic graph representations of computer systems and enterprise communication networks known as provenance graphs. Most of these security systems are statistics-based, based on rules defined by domain experts, or discard temporal information, and as such come with a set of drawbacks (e.g., incapability to pinpoint the attack, incapability to adapt to evolving systems, reduced expressibility due to lack of temporal information). At the same time, there is little research in the machine learning community on graphs such as provenance graphs, which are a form of largescale, heterogeneous, and continuous-time dynamic graphs, as most research on graph learning has been devoted to static homogeneous graphs to date. Therefore, this thesis aims to bridge these two fields and investigate the potential of learning-based methods operating on continuous-time dynamic provenance graphs for cyber threat detection. Without loss of generality, this work adopts the general Temporal Graph Networks framework for learning representations and detecting anomalies in such graphs. This method explicitly addresses the drawbacks of current security systems by considering the temporal setting and bringing the adaptability of learning-based methods. In doing so, it also introduces and releases two large-scale, continuoustime temporal, heterogeneous benchmark graph datasets with expert-labeled anomalies to foster future research on representation learning and anomaly detection on complex real-world networks. To the best of the author’s knowledge, these are one of the first datasets of their kind. Extensive experimental analyses of modules, datasets, and baselines validate the potency of continuous-time graph neural network-based learning, endorsing its practical applicability to the detection of cyber threats and possibly other semantically meaningful anomalies in similar real-world systems. / Cyberattacker är allestädes närvarande och blir allt vanligare inom industrin, samhället och statliga myndigheter. De påverkar ekonomin, politiken och enskilda individer. Allt skickligare, organiserade och finansierade hotaktörer i kombination med ständigt ökande volymer och modaliteter av data kräver alltmer sofistikerade och innovativa cyberförsvarslösningar. Dagens avancerade säkerhetssystem upptäcker hot på dynamiska grafrepresentationer (proveniensgrafer) av datorsystem och företagskommunikationsnät. De flesta av dessa säkerhetssystem är statistikbaserade, baseras på regler som definieras av domänexperter eller bortser från temporär information, och som sådana kommer de med en rad nackdelar (t.ex. oförmåga att lokalisera attacken, oförmåga att anpassa sig till system som utvecklas, begränsad uttrycksmöjlighet på grund av brist på temporär information). Samtidigt finns det lite forskning inom maskininlärning om grafer som proveniensgrafer, som är en form av storskaliga, heterogena och dynamiska grafer med kontinuerlig tid, eftersom den mesta forskningen om grafinlärning hittills har ägnats åt statiska homogena grafer. Därför syftar denna avhandling till att överbrygga dessa två områden och undersöka potentialen hos inlärningsbaserade metoder som arbetar med dynamiska proveniensgrafer med kontinuerlig tid för detektering av cyberhot. Utan att för den skull göra avkall på generaliserbarheten använder detta arbete det allmänna Temporal Graph Networks-ramverket för inlärning av representationer och upptäckt av anomalier i sådana grafer. Denna metod tar uttryckligen itu med nackdelarna med nuvarande säkerhetssystem genom att beakta den temporala induktiva inställningen och ge anpassningsförmågan hos inlärningsbaserade metoder. I samband med detta introduceras och släpps också två storskaliga, kontinuerliga temporala, heterogena referensgrafdatauppsättningar med expertmärkta anomalier för att främja framtida forskning om representationsinlärning och anomalidetektering i komplexa nätverk i den verkliga världen. Såvitt författaren vet är detta en av de första datamängderna i sitt slag. Omfattande experimentella analyser av moduler, dataset och baslinjer validerar styrkan i induktiv inlärning baserad på kontinuerliga grafneurala nätverk, vilket stöder dess praktiska tillämpbarhet för att upptäcka cyberhot och eventuellt andra semantiskt meningsfulla avvikelser i liknande verkliga system. Graph neural networks Temporal graphs Benchmark datasets Anomaly detection Heterogeneous graphs Provenance graphs Grafiska neurala nätverk temporala grafer benchmark-datauppsättningar anomalidetektering heterogena grafer härkomstgrafer Computer and Information Sciences Data- och informationsvetenskap
123	Identification of Important Cell Cycle Regulators and Novel Genes in Specific Tissues using Microarray Analysis, Bioinformatics and Molecular Tools Zhang, Jibin 19 May 2015 (has links) No description available. Animal Sciences Animals Bioinformatics Biology
124	Comparative Gene Expression Analysis To Identify Common Factors In Multiple Cancers Rybaczyk, Leszek A. 29 July 2008 (has links) No description available. Biochemistry Bioinformatics Biology Biomedical Research Biostatistics Cellular Biology Epidemiology Gender Genetics Gerontology Gynecology Health Immunology Mathematics Molecular Biology Oncology Pharmaceuticals Pharmacology Scientific Imaging Statistics Monoamine cancer indicator of cancer multiple cancers microarray multiple datasets cross platform biomarker
125	Sur la génération d'exemples pour réduire le coût d'annotation Piedboeuf, Frédéric 03 1900 (has links) L'apprentissage machine moderne s'appuie souvent sur l'utilisation de jeux de données massifs, mais il existe de nombreux contextes où l'acquisition et la manipulation de grandes données n'est pas possible, et le développement de techniques d'apprentissage avec de petites données est donc essentiel. Dans cette thèse, nous étudions comment diminuer le nombre de données nécessaires à travers deux paradigmes d'apprentissage~: l'augmentation de données et l'apprentissage par requête synthétisée. La thèse s'organise en quatre volets, chacun démontrant une nouvelle facette concernant la génération d'exemples pour réduire le coût d'annotation. Le premier volet regarde l'augmentation de données pour des textes en anglais, ce qui nous permet d'établir une comparaison objective des techniques et de développer de nouveaux algorithmes. Le deuxième volet regarde ensuite l'augmentation de données dans les langues autres que l'anglais, et le troisième pour la tâche de génération de mots-clés en français. Finalement, le dernier volet s'intéresse à l'apprentissage par requête synthétisée, où les exemples générés sont annotés, en contraste à l'augmentation de données qui produit des exemples sans coût d'annotation supplémentaire. Nous montrons que cette technique permet de meilleures performances, particulièrement lorsque le jeu de données est large et l'augmentation de données souvent inefficace. / Modern machine learning often relies on the use of massive datasets, but there are many contexts where acquiring and handling large data is not feasible, making the development of techniques for learning with small data essential. In this thesis, we investigate how to reduce the amount of data required through two learning paradigms~: data augmentation and membership query synthesis. The thesis is organized into four parts, each demonstrating a new aspect of generating examples to reduce annotation costs. The first part examines data augmentation for English text, allowing us to make an objective comparison of techniques and develop new algorithms. The second one then explores data augmentation in languages other than English, and the third focuses on the task of keyword generation in French. Finally, the last part delves into membership query synthesis, where generated examples are annotated, in contrast to data augmentation, which produces examples without additional annotation costs. We show that this technique leads to better performance, especially when the dataset is large and data augmentation is often ineffective. Intelligence artificielle traitement des langues apprentissage supervisé jeux de données augmentation de données modèles génératifs MQS Petites données Artificial Intelligence Natural Language Processing Supervised Learning Datasets Data Augmentation Generative Models Synthesized Query Learning Small Data
126	Workload- and Data-based Automated Design for a Hybrid Row-Column Storage Model and Bloom Filter-Based Query Processing for Large-Scale DICOM Data Management / Conception automatisée basée sur la charge de travail et les données pour un modèle de stockage hybride ligne-colonne et le traitement des requêtes à l’aide de filtres de Bloom pour la gestion de données DICOM à grande échelle Nguyen, Cong-Danh 04 May 2018 (has links) Dans le secteur des soins de santé, les données d'images médicales toujours croissantes, le développement de technologies d'imagerie, la conservation à long terme des données médicales et l'augmentation de la résolution des images entraînent une croissance considérable du volume de données. En outre, la variété des dispositifs d'acquisition et la différence de préférences des médecins ou d'autres professionnels de la santé ont conduit à une grande variété de données. Bien que la norme DICOM (Digital Imaging et Communication in Medicine) soit aujourd'hui largement adoptée pour stocker et transférer les données médicales, les données DICOM ont toujours les caractéristiques 3V du Big Data: volume élevé, grande variété et grande vélocité. En outre, il existe une variété de charges de travail, notamment le traitement transactionnel en ligne (en anglais Online Transaction Processing, abrégé en OLTP), le traitement analytique en ligne (anglais Online Analytical Processing, abrégé en OLAP) et les charges de travail mixtes. Les systèmes existants ont des limites concernant ces caractéristiques des données et des charges de travail. Dans cette thèse, nous proposons de nouvelles méthodes efficaces pour stocker et interroger des données DICOM. Nous proposons un modèle de stockage hybride des magasins de lignes et de colonnes, appelé HYTORMO, ainsi que des stratégies de stockage de données et de traitement des requêtes. Tout d'abord, HYTORMO est conçu et mis en œuvre pour être déployé sur un environnement à grande échelle afin de permettre la gestion de grandes données médicales. Deuxièmement, la stratégie de stockage de données combine l'utilisation du partitionnement vertical et un stockage hybride pour créer des configurations de stockage de données qui peuvent réduire la demande d'espace de stockage et augmenter les performances de la charge de travail. Pour réaliser une telle configuration de stockage de données, l'une des deux approches de conception de stockage de données peut être appliquée: (1) conception basée sur des experts et (2) conception automatisée. Dans la première approche, les experts créent manuellement des configurations de stockage de données en regroupant les attributs des données DICOM et en sélectionnant une disposition de stockage de données appropriée pour chaque groupe de colonnes. Dans la dernière approche, nous proposons un cadre de conception automatisé hybride, appelé HADF. HADF dépend des mesures de similarité (entre attributs) qui prennent en compte les impacts des informations spécifiques à la charge de travail et aux données pour générer automatiquement les configurations de stockage de données: Hybrid Similarity (combinaison pondérée de similarité d'accès d'attribut et de similarité de densité d'attribut) les attributs dans les groupes de colonnes; Inter-Cluster Access Similarity est utilisé pour déterminer si deux groupes de colonnes seront fusionnés ou non (pour réduire le nombre de jointures supplémentaires); et Intra-Cluster Access La similarité est appliquée pour décider si un groupe de colonnes sera stocké dans une ligne ou un magasin de colonnes. Enfin, nous proposons une stratégie de traitement des requêtes adaptée et efficace construite sur HYTORMO. Il considère l'utilisation des jointures internes et des jointures externes gauche pour empêcher la perte de données si vous utilisez uniquement des jointures internes entre des tables partitionnées verticalement. De plus, une intersection de filtres Bloom (intersection of Bloom filters, abrégé en ) est appliqué pour supprimer les données non pertinentes des tables d'entrée des opérations de jointure; cela permet de réduire les coûts d'E / S réseau. (...) / In the health care industry, the ever-increasing medical image data, the development of imaging technologies, the long-term retention of medical data and the increase of image resolution are causing a tremendous growth in data volume. In addition, the variety of acquisition devices and the difference in preferences of physicians or other health-care professionals have led to a high variety in data. Although today DICOM (Digital Imaging and Communication in Medicine) standard has been widely adopted to store and transfer the medical data, DICOM data still has the 3Vs characteristics of Big Data: high volume, high variety and high velocity. Besides, there is a variety of workloads including Online Transaction Processing (OLTP), Online Analytical Processing (OLAP) and mixed workloads. Existing systems have limitations dealing with these characteristics of data and workloads. In this thesis, we propose new efficient methods for storing and querying DICOM data. We propose a hybrid storage model of row and column stores, called HYTORMO, together with data storage and query processing strategies. First, HYTORMO is designed and implemented to be deployed on large-scale environment to make it possible to manage big medical data. Second, the data storage strategy combines the use of vertical partitioning and a hybrid store to create data storage configurations that can reduce storage space demand and increase workload performance. To achieve such a data storage configuration, one of two data storage design approaches can be applied: (1) expert-based design and (2) automated design. In the former approach, experts manually create data storage configurations by grouping attributes and selecting a suitable data layout for each column group. In the latter approach, we propose a hybrid automated design framework, called HADF. HADF depends on similarity measures (between attributes) that can take into consideration the combined impact of both workload- and data-specific information to generate data storage configurations: Hybrid Similarity (a weighted combination of Attribute Access and Density Similarity measures) is used to group the attributes into column groups; Inter-Cluster Access Similarity is used to determine whether two column groups will be merged together or not (to reduce the number of joins); and Intra-Cluster Access Similarity is applied to decide whether a column group will be stored in a row or a column store. Finally, we propose a suitable and efficient query processing strategy built on top of HYTORMO. It considers the use of both inner joins and left-outer joins. Furthermore, an Intersection Bloom filter () is applied to reduce network I/O cost.We provide experimental evaluations to validate the benefits of the proposed methods over real DICOM datasets. Experimental results show that the mixed use of both row and column stores outperforms a pure row store and a pure column store. The combined impact of both workload-and data-specific information is helpful for HADF to be able to produce good data storage configurations. Moreover, the query processing strategy with the use of the can improve the execution time of an experimental query up to 50% when compared to the case where no is applied. DICOM Données volumineuses Données clairsemées HYTORMO Modèle de stockage hybride Stockage en lignes Stockage en colonnes Similarité hybride Filtre Bloom Intersection de filtres Bloom Joindre DICOM Big data Sparse datasets HYTORMO Hybrid storage model Row store Column store Hybrid similarity Bloom filter Intersection Bloom filter Join
127	Metadados nas instruções de governos para publicadores de dados / Metadata in the government instructions for data publishers / Metadatos en las instrucciones de gobiernos para publicadores de datos Camperos Reyes, Jacquelin Teresa [UNESP] 29 January 2018 (has links) Submitted by Jacquelin Teresa Camperos Reyes (jtcamperos@hotmail.com) on 2018-02-27T00:46:53Z No. of bitstreams: 1 [Dissertação] Jacquelin Teresa Camperos Reyes.pdf: 2175519 bytes, checksum: 3dd6306fa263e044ba509d410d39cea4 (MD5) / Approved for entry into archive by Satie Tagara (satie@marilia.unesp.br) on 2018-02-27T14:17:55Z (GMT) No. of bitstreams: 1 camperosreyes_jt_me_mar.pdf: 2175519 bytes, checksum: 3dd6306fa263e044ba509d410d39cea4 (MD5) / Made available in DSpace on 2018-02-27T14:17:55Z (GMT). No. of bitstreams: 1 camperosreyes_jt_me_mar.pdf: 2175519 bytes, checksum: 3dd6306fa263e044ba509d410d39cea4 (MD5) Previous issue date: 2018-01-29 / Outra / Gerar valor para a sociedade a partir da abundância de dados governamentais tornou-se imperativo nas estratégias de disponibilização de dados que estão sendo publicados por meio de conjuntos de dados ou datasets. Os datasets, dados tabulados com certa estrutura, constituem um exemplo de reunião de bases de dados que pretendem obter sucesso erguendo-se como catálogos centrais do ponto de vista dos cidadãos, ampliando a visibilidade sobre e das ações da gestão pública. Atingir a estruturação desses recursos informacionais, de forma que auxilie na sua revalorização, é um dos desafios da Ciência da Informação. A questão de investigação é: Como está sendo abordada a aderência ao uso de metadados nas instruções entregues aos publicadores de dados em governos? O objetivo é descrever a aderência ao uso de metadados nos datasets de governos de países, tomando como base o contexto e o marco conceitual apresentados nas instruções para publicadores de dados, encontradas nos sites de dados abertos oficiais dos países analisados. Acredita-se que estudos como este podem fornecer elementos que atuem como subsídios para as estratégias governamentais, atendendo dimensões sociais a partir dos profissionais da informação. Trata-se de pesquisa descritiva, de natureza qualitativa, focada na observação crítica dos documentos que abordam o tratamento descritivo dos datasets governamentais. Utilizam-se como procedimentos a análise bibliográfica e documental, e a definição de estudos de caso nos países Colômbia, Brasil, Espanha e Portugal, abordando o volume de dados e informações mediante a técnica de análise de conteúdo. Percebe-se o esforço realizado pelos detentores dos documentos disponibilizados nos quatro países analisados pela ampla abordagem de conteúdo temático relacionado com o uso experimental dos metadados, dando assim maior importância ao aspecto prático em relação ao teórico, sem desconsiderar a relevância das explanações teóricas. Acredita-se na importância da criação e implementação de perfis de aplicação entre comunidades de países, como o caso do DCAT-AP, criado e recomendado pela comunidade europeia de nações e sugerido pelos sites de dados dos países estudados. Admitem-se inquietações referentes aos processos de publicação de dados de governo e suas relações com outros tópicos de interesse socioeconômico, tais como possíveis vínculos com indicadores de desenvolvimento em países e regiões, sob o prisma de pesquisas originadas a partir da Ciência da Informação. / Generating value to society from the abundance of government data, has become imperative in the strategies of data availability that are being published through datasets. The datasets, tabulated data with a certain structure, are an example of a meeting of databases that aim to be successful setting up as central catalogs from the point of view of the citizens, increasing the visibility on and of the actions of the public management. Accomplishing the structuring of these informational resources, so that this helps in their revaluation, is one of the challenges of Information Science. The research question is: How is the adherence to the use of metadata in the instructions given to data publishers in South American governments being addressed? The objective is to describe the adherence to the use of metadata in the datasets of governments of South American countries, based on the context and conceptual framework presented in the instructions for data publishers, found on the official open data sites of the analyzed countries. It is believed that studies such as this one can provide elements that act as subsidies for government strategies, addressing social dimensions deriving out of the information professionals. It is a descriptive research, of qualitative nature, focused on the critical observation of the documents that approach the descriptive treatment of the governmental datasets. Bibliographic and documentary analyses are used as methodological procedures, and the definition of case studies in the countries Colombia, Brazil, Spain and Portugal, addressing the volume of data and information through the technique of content analysis. The effort made by the holders of the documents available in the four analyzed countries by the broad thematic content related to the experimental use of the metadata is noticed, thus giving greater importance to the practical aspect in relation to the theoretical, without disregarding the relevance of the theoretical explanations. It is believed that creating and implementing application profiles among communities of countries, such as the DCAT-AP, created and recommended by the European community of nations and suggested by the data sites of the analyzed countries, is important. Concerns referent to the publication processes of government data and their relations with other topics of socio-economic interest are admitted, such as possible linkages with indicators of development in countries and regions, under the prism of research originated from the Information Science. Metadados Representação de dados Dados governo Instruções para publicação de dados Publicadores de dados Colômbia Brasil Espanha Portugal Datasets Metadata Data representation Government data Instructions for data publishing Data publishers Brazil Spain Metadatos Representación de datos Datos gobierno Instrucciones para publicación de datos Publicadores de datos Colombia España
128	Multi-fidelity Machine Learning for Perovskite Band Gap Predictions Panayotis Thalis Manganaris (16384500) 16 June 2023 (has links) <p>A wide range of optoelectronic applications demand semiconductors optimized for purpose.</p> <p>My research focused on data-driven identification of ABX3 Halide perovskite compositions for optimum photovoltaic absorption in solar cells.</p> <p>I trained machine learning models on previously reported datasets of halide perovskite band gaps based on first principles computations performed at different fidelities.</p> <p>Using these, I identified mixtures of candidate constituents at the A, B or X sites of the perovskite supercell which leveraged how mixed perovskite band gaps deviate from the linear interpolations predicted by Vegard's law of mixing to obtain a selection of stable perovskites with band gaps in the ideal range of 1 to 2 eV for visible light spectrum absorption.</p> <p>These models predict the perovskite band gap using the composition and inherent elemental properties as descriptors.</p> <p>This enables accurate, high fidelity prediction and screening of the much larger chemical space from which the data samples were drawn.</p> <p><br></p> <p>I utilized a recently published density functional theory (DFT) dataset of more than 1300 perovskite band gaps from four different levels of theory, added to an experimental perovskite band gap dataset of \textasciitilde{}100 points, to train random forest regression (RFR), Gaussian process regression (GPR), and Sure Independence Screening and Sparsifying Operator (SISSO) regression models, with data fidelity added as one-hot encoded features.</p> <p>I found that RFR yields the best model with a band gap root mean square error of 0.12 eV on the total dataset and 0.15 eV on the experimental points.</p> <p>SISSO provided compound features and functions for direct prediction of band gap, but errors were larger than from RFR and GPR.</p> <p>Additional insights gained from Pearson correlation and Shapley additive explanation (SHAP) analysis of learned descriptors suggest the RFR models performed best because of (a) their focus on identifying and capturing relevant feature interactions and (b) their flexibility to represent nonlinear relationships between such interactions and the band gap.</p> <p>The best model was deployed for predicting experimental band gap of 37785 hypothetical compounds.</p> <p>Based on this, we identified 1251 stable compounds with band gap predicted to be between 1 and 2 eV at experimental accuracy, successfully narrowing the candidates to about 3% of the screened compositions.</p> Compound semiconductors Organic semiconductors Data engineering and data science halide perovskites band gap feature extraction and representation SISSO random forest regression analysis Gaussian Process Regression Analysis SHapley Additive exPlanations (SHAP) combinatorial datasets data augmentation method Lead Free Perovskite Solar Cells Density Functional Theory (DFT) multi-fidelity data multi-task learning (MTL)
129	Taxonomy of datasets in graph learning : a data-driven approach to improve GNN benchmarking Cantürk, Semih 12 1900 (has links) The core research of this thesis, mostly comprising chapter four, has been accepted to the Learning on Graphs (LoG) 2022 conference for a spotlight presentation as a standalone paper, under the title "Taxonomy of Benchmarks in Graph Representation Learning", and is to be published in the Proceedings of Machine Learning Research (PMLR) series. As a main author of the paper, my specific contributions to this paper cover problem formulation, design and implementation of our taxonomy framework and experimental pipeline, collation of our results and of course the writing of the article. / L'apprentissage profond sur les graphes a atteint des niveaux de succès sans précédent ces dernières années grâce aux réseaux de neurones de graphes (GNN), des architectures de réseaux de neurones spécialisées qui ont sans équivoque surpassé les approches antérieurs d'apprentissage définies sur des graphes. Les GNN étendent le succès des réseaux de neurones aux données structurées en graphes en tenant compte de leur géométrie intrinsèque. Bien que des recherches approfondies aient été effectuées sur le développement de GNN avec des performances supérieures à celles des modèles références d'apprentissage de représentation graphique, les procédures d'analyse comparative actuelles sont insuffisantes pour fournir des évaluations justes et efficaces des modèles GNN. Le problème peut-être le plus répandu et en même temps le moins compris en ce qui concerne l'analyse comparative des graphiques est la "couverture de domaine": malgré le nombre croissant d'ensembles de données graphiques disponibles, la plupart d'entre eux ne fournissent pas d'informations supplémentaires et au contraire renforcent les biais potentiellement nuisibles dans le développement d’un modèle GNN. Ce problème provient d'un manque de compréhension en ce qui concerne les aspects d'un modèle donné qui sont sondés par les ensembles de données de graphes. Par exemple, dans quelle mesure testent-ils la capacité d'un modèle à tirer parti de la structure du graphe par rapport aux fonctionnalités des nœuds? Ici, nous développons une approche fondée sur des principes pour taxonomiser les ensembles de données d'analyse comparative selon un "profil de sensibilité" qui est basé sur la quantité de changement de performance du GNN en raison d'une collection de perturbations graphiques. Notre analyse basée sur les données permet de mieux comprendre quelles caractéristiques des données de référence sont exploitées par les GNN. Par conséquent, notre taxonomie peut aider à la sélection et au développement de repères graphiques adéquats et à une évaluation mieux informée des futures méthodes GNN. Enfin, notre approche et notre implémentation dans le package GTaxoGym (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) sont extensibles à plusieurs types de tâches de prédiction de graphes et à des futurs ensembles de données. / Deep learning on graphs has attained unprecedented levels of success in recent years thanks to Graph Neural Networks (GNNs), specialized neural network architectures that have unequivocally surpassed prior graph learning approaches. GNNs extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNNs with superior performance according to a collection of graph representation learning benchmarks, current benchmarking procedures are insufficient to provide fair and effective evaluations of GNN models. Perhaps the most prevalent and at the same time least understood problem with respect to graph benchmarking is "domain coverage": Despite the growing number of available graph datasets, most of them do not provide additional insights and on the contrary reinforce potentially harmful biases in GNN model development. This problem stems from a lack of understanding with respect to what aspects of a given model are probed by graph datasets. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a "sensitivity profile" that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in the GTaxoGym package (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) are extendable to multiple graph prediction task types and future datasets. Apprentissage automatique Apprentissage profond Apprentissage automatique sur graphes Réseaux de neurones en graphes Réseau de neurones artificiels Taxonomie Théorie des graphes Traitement du signal graphique Jeux de données Machine learning Deep learning Graph representation learning Graph neural networks Neural networks Benchmarking Datasets Taxonomy Graph theory Graph signal processing
130	Prediction Models for TV Case Resolution Times with Machine Learning / Förutsägelsemodeller för TV-fall Upplösningstid med maskininlärning Javierre I Moyano, Borja January 2023 (has links) TV distribution and stream content delivery of video over the Internet, since is made up of complex networks including Content Delivery Networks (CDNs), cables and end-point user devices, that is very prone to issues appearing in different levels of the network ending up affecting the final customer’s TV services. When a problem affects the customer, and this prevents from having a proper TV delivery service in devices used for stream purposes, the issue is reported through a call, a TV case is opened and the company’s customer handling agents start supervising it to solve the problem as soon as possible. The goal of this research work is to present an ML-based solution that predicts the Resolution Times (RTs) of the TV cases in each TV delivery service type, therefore how long the cases will take to be solved. The approach taken to provide meaningful results consisted in utilizing four Machine Learning (ML) algorithms to create 480 models for each of the two scenarios. The results revealed that Random Forest (RF) and, specially, Gradient Boosting Machine (GBM) performed exceptionally well. Surprisingly, hyperparameter tuning didn’t significantly improve the RT as expected. Some challenges included the initial data preprocessing and some uncertainty in hyperparameter tuning approaches. Thanks to these predicted times, the company is now able to better inform their costumers on how long the problem is expected to last until is resolved. This real case scenario also considers how the company processes the available data and manages the problem. The research work consists in, first, a literature review on the prediction of RT of Trouble Ticket (TT) and customer churn in telecommunication companies, as well as the study of the company’s available data for the problem. Later, the research focuses in analysing the provided dataset for the experimentation, the preprocessing of the this data according to the industry standards and, finally, the predictions and analysis of the obtained performance metrics. The proposed solution is designed to offer an improved resolution for the company’s specified task. Future work could involve increasing the number of TV cases per service for improving the results and exploring the link between resolution times and customer churn decisions. / TV-distribution och leverans av strömningsinnehåll via internet består av komplexa nätverk, inklusive CDNs, kablar och slutanvändarutrustning. Detta gör det känsligt för problem på olika nätverksnivåer som kan påverka slutkundens TV-tjänster. När ett problem påverkar kunden och hindrar en korrekt TV-leveranstjänst rapporteras det genom ett samtal. Ett ärende öppnas, och företagets kundhanteringsagenter övervakar det för att lösa problemet så snabbt som möjligt. Målet med detta forskningsarbete är att presentera en maskininlärningsbaserad lösning som förutsäger löstiderna (RTs) för TV-ärenden inom varje TV-leveranstjänsttyp, det vill säga hur lång tid ärendena kommer att ta att lösa. För att få meningsfulla resultat användes fyra maskininlärningsalgoritmer för att skapa 480 modeller för var och en av de två scenarierna. Resultaten visade att Random Forest (RF) och framför allt Gradient Boosting Machine (GBM) presterade exceptionellt bra. Överraskande nog förbättrade inte finjusteringen av hyperparametrar RT som förväntat. Vissa utmaningar inkluderade den initiala dataförbehandlingen och osäkerhet i metoder för hyperparametertuning. Tack vare dessa förutsagda tider kan företaget nu bättre informera sina kunder om hur länge problemet förväntas vara olöst. Denna verkliga fallstudie tar också hänsyn till hur företaget hanterar tillgängliga data och problemet. Forskningsarbetet börjar med en litteraturgenomgång om förutsägelse av RT för Trouble Ticket (TT) och kundavhopp inom telekommunikationsföretag samt studier av företagets tillgängliga data för problemet. Därefter fokuserar forskningen på att analysera den tillhandahållna datamängden för experiment, förbehandling av datan enligt branschstandarder och till sist förutsägelser och analys av de erhållna prestandamätvärdena. Den föreslagna lösningen är utformad för att erbjuda en förbättrad lösning för företagets angivna uppgift. Framtida arbete kan innebära att öka antalet TV-ärenden per tjänst för att förbättra resultaten och utforska sambandet mellan löstider och kundavhoppbeslut. Datasets Machine Learning (ML) Prediction Resolution Time (RT) Solve Time TV Cases Trouble Tickets (TT) CRM system BI system Telecommunications Dataset Machine Learning (ML) Prediction Resolution Time Solve Time TV Cases Trouble Tickets (TT) CRM-system BI-system Telekommunikationer. Elektroteknik och elektronik

Search results