Spelling suggestions: "subject:"[een] UNSUPERVISED LEARNING"" "subject:"[enn] UNSUPERVISED LEARNING""
301 |
Multimodal Data Management in Open-world EnvironmentK M A Solaiman (16678431) 02 August 2023 (has links)
<p>The availability of abundant multimodal data, including textual, visual, and sensor-based information, holds the potential to improve decision-making in diverse domains. Extracting data-driven decision-making information from heterogeneous and changing datasets in real-world data-centric applications requires achieving complementary functionalities of multimodal data integration, knowledge extraction and mining, situationally-aware data recommendation to different users, and uncertainty management in the open-world setting. To achieve a system that encompasses all of these functionalities, several challenges need to be effectively addressed: (1) How to represent and analyze heterogeneous source contents and application context for multimodal data recommendation? (2) How to predict and fulfill current and future needs as new information streams in without user intervention? (3) How to integrate disconnected data sources and learn relevant information to specific mission needs? (4) How to scale from processing petabytes of data to exabytes? (5) How to deal with uncertainties in open-world that stem from changes in data sources and user requirements?</p>
<p><br></p>
<p>This dissertation tackles these challenges by proposing novel frameworks, learning-based data integration and retrieval models, and algorithms to empower decision-makers to extract valuable insights from diverse multimodal data sources. The contributions of this dissertation can be summarized as follows: (1) We developed SKOD, a novel multimodal knowledge querying framework that overcomes the data representation, scalability, and data completeness issues while utilizing streaming brokers and RDBMS capabilities with entity-centric semantic features as an effective representation of content and context. Additionally, as part of the framework, a novel text attribute recognition model called HART was developed, which leveraged language models and syntactic properties of large unstructured texts. (2) In the SKOD framework, we incrementally proposed three different approaches for data integration of the disconnected sources from their semantic features to build a common knowledge base with the user information need: (i) EARS: A mediator approach using schema mapping of the semantic features and SQL joins was proposed to address scalability challenges in data integration; (ii) FemmIR: A data integration approach for more susceptible and flexible applications, that utilizes neural network-based graph matching techniques to learn coordinated graph representations of the data. It introduces a novel graph creation approach from the features and a novel similarity metric among data sources; (iii) WeSJem: This approach allows zero-shot similarity matching and data discovery by using contrastive learning<br>
to embed data samples and query examples in a high-dimensional space using features as a novel source of supervision instead of relevance labels. (3) Finally, to manage uncertainties in multimodal data management for open-world environments, we characterized novelties in multimodal information retrieval based on data drift. Moreover, we proposed a novelty detection and adaptation technique as an augmentation to WeSJem.<br>
</p>
<p>The effectiveness of the proposed frameworks, models, and algorithms was demonstrated<br>
through real-world system prototypes that solved open problems requiring large-scale human<br>
endeavors and computational resources. Specifically, these prototypes assisted law enforcement officers in automating investigations and finding missing persons.<br>
</p>
|
302 |
LEVERAGING MACHINE LEARNING FOR ENHANCED SATELLITE TRACKING TO BOLSTER SPACE DOMAIN AWARENESSCharles William Grey (16413678) 23 June 2023 (has links)
<p>Our modern society is more dependent on its assets in space now more than ever. For<br>
example, the Global Positioning System (GPS) many rely on for navigation uses data from a<br>
24-satellite constellation. Additionally, our current infrastructure for gas pumps, cell phones,<br>
ATMs, traffic lights, weather data, etc. all depend on satellite data from various constel-<br>
lations. As a result, it is increasingly necessary to accurately track and predict the space<br>
domain. In this thesis, after discussing how space object tracking and object position pre-<br>
diction is currently being done, I propose a machine learning-based approach to improving<br>
the space object position prediction over the standard SGP4 method, which is limited in<br>
prediction accuracy time to about 24 hours. Using this approach, we are able to show that<br>
meaningful improvements over the standard SGP4 model can be achieved using a machine<br>
learning model built based on a type of recurrent neural network called a long short term<br>
memory model (LSTM). I also provide distance predictions for 4 different space objects over<br>
time frames of 15 and 30 days. Future work in this area is likely to include extending and<br>
validating this approach on additional satellites to construct a more general model, testing a<br>
wider range of models to determine limits on accuracy across a broad range of time horizons,<br>
and proposing similar methods less dependent on antiquated data formats like the TLE.</p>
|
303 |
PROGRAM ANOMALY DETECTION FOR INTERNET OF THINGSAkash Agarwal (13114362) 01 September 2022 (has links)
<p>Program anomaly detection — modeling normal program executions to detect deviations at runtime as cues for possible exploits — has become a popular approach for software security. To leverage high performance modeling and complete tracing, existing techniques however focus on subsets of applications, e.g., on system calls or calls to predefined libraries. Due to limited scope, it is insufficient to detect subtle control-oriented and data-oriented attacks that introduces new illegal call relationships at the application level. Also such techniques are hard to apply on devices that lack a clear separation between OS and the application layer. This dissertation advances the design and implementation of program anomaly detection techniques by providing application context for library and system calls making it powerful for detecting advanced attacks targeted at manipulating intra- and inter-procedural control-flow and decision variables. </p>
<p><br></p>
<p>This dissertation has two main parts. The first part describes a statically initialized generic calling context program anomaly detection technique LANCET based on Hidden Markov Modeling to provide security against control-oriented attacks at program runtime. It also establishes an efficient execution tracing mechanism facilitated through source code instrumentation of applications. The second part describes a program anomaly detection framework EDISON to provide security against data-oriented attacks using graph representation learning and language models for intra and inter-procedural behavioral modeling respectively.</p>
<p><br>
This dissertation makes three high-level contributions. First, the concise descriptions demonstrates the design, implementation and extensive evaluation of an aggregation-based anomaly detection technique using fine-grained generic calling context-sensitive modeling that allows for scaling the detection over entire applications. Second, the precise descriptions show the design, implementation, and extensive evaluation of a detection technique that maps runtime traces to the program’s control-flow graph and leverages graphical feature representation to learn dynamic program behavior. Finally, this dissertation provides details and experience for designing program anomaly detection frameworks from high-level concepts, design, to low-level implementation techniques.</p>
|
304 |
VISUAL ANALYTICS OF BIG DATA FROM MOLECULAR DYNAMICS SIMULATIONCatherine Jenifer Rajam Rajendran (5931113) 03 February 2023 (has links)
<p>Protein malfunction can cause human diseases, which makes the protein a target in the process of drug discovery. In-depth knowledge of how protein functions can widely contribute to the understanding of the mechanism of these diseases. Protein functions are determined by protein structures and their dynamic properties. Protein dynamics refers to the constant physical movement of atoms in a protein, which may result in the transition between different conformational states of the protein. These conformational transitions are critically important for the proteins to function. Understanding protein dynamics can help to understand and interfere with the conformational states and transitions, and thus with the function of the protein. If we can understand the mechanism of conformational transition of protein, we can design molecules to regulate this process and regulate the protein functions for new drug discovery. Protein Dynamics can be simulated by Molecular Dynamics (MD) Simulations.</p>
<p>The MD simulation data generated are spatial-temporal and therefore very high dimensional. To analyze the data, distinguishing various atomic interactions within a protein by interpreting their 3D coordinate values plays a significant role. Since the data is humongous, the essential step is to find ways to interpret the data by generating more efficient algorithms to reduce the dimensionality and developing user-friendly visualization tools to find patterns and trends, which are not usually attainable by traditional methods of data process. The typical allosteric long-range nature of the interactions that lead to large conformational transition, pin-pointing the underlying forces and pathways responsible for the global conformational transition at atomic level is very challenging. To address the problems, Various analytical techniques are performed on the simulation data to better understand the mechanism of protein dynamics at atomic level by developing a new program called Probing Long-distance interactions by Tapping into Paired-Distances (PLITIP), which contains a set of new tools based on analysis of paired distances to remove the interference of the translation and rotation of the protein itself and therefore can capture the absolute changes within the protein.</p>
<p>Firstly, we developed a tool called Decomposition of Paired Distances (DPD). This tool generates a distance matrix of all paired residues from our simulation data. This paired distance matrix therefore is not subjected to the interference of the translation or rotation of the protein and can capture the absolute changes within the protein. This matrix is then decomposed by DPD</p>
<p>using Principal Component Analysis (PCA) to reduce dimensionality and to capture the largest structural variation. To showcase how DPD works, two protein systems, HIV-1 protease and 14-3-3 σ, that both have tremendous structural changes and conformational transitions as displayed by their MD simulation trajectories. The largest structural variation and conformational transition were captured by the first principal component in both cases. In addition, structural clustering and ranking of representative frames by their PC1 values revealed the long-distance nature of the conformational transition and locked the key candidate regions that might be responsible for the large conformational transitions.</p>
<p>Secondly, to facilitate further analysis of identification of the long-distance path, a tool called Pearson Coefficient Spiral (PCP) that generates and visualizes Pearson Coefficient to measure the linear correlation between any two sets of residue pairs is developed. PCP allows users to fix one residue pair and examine the correlation of its change with other residue pairs.</p>
<p>Thirdly, a set of visualization tools that generate paired atomic distances for the shortlisted candidate residue and captured significant interactions among them were developed. The first tool is the Residue Interaction Network Graph for Paired Atomic Distances (NG-PAD), which not only generates paired atomic distances for the shortlisted candidate residues, but also display significant interactions by a Network Graph for convenient visualization. Second, the Chord Diagram for Interaction Mapping (CD-IP) was developed to map the interactions to protein secondary structural elements and to further narrow down important interactions. Third, a Distance Plotting for Direct Comparison (DP-DC), which plots any two paired distances at user’s choice, either at residue or atomic level, to facilitate identification of similar or opposite pattern change of distances along the simulation time. All the above tools of PLITIP enabled us to identify critical residues contributing to the large conformational transitions in both HIV-1 protease and 14-3-3σ proteins.</p>
<p>Beside the above major project, a side project of developing tools to study protein pseudo-symmetry is also reported. It has been proposed that symmetry provides protein stability, opportunities for allosteric regulation, and even functionality. This tool helps us to answer the questions of why there is a deviation from perfect symmetry in protein and how to quantify it.</p>
|
305 |
Génération de données synthétiques pour l'adaptation hors-domaine non-supervisée en réponse aux questions : méthodes basées sur des règles contre réseaux de neuronesDuran, Juan Felipe 02 1900 (has links)
Les modèles de réponse aux questions ont montré des résultats impressionnants sur plusieurs ensembles de données et tâches de réponse aux questions. Cependant, lorsqu'ils sont testés sur des ensembles de données hors domaine, la performance diminue. Afin de contourner l'annotation manuelle des données d'entraînement du nouveau domaine, des paires de questions-réponses peuvent être générées synthétiquement à partir de données non annotées. Dans ce travail, nous nous intéressons à la génération de données synthétiques et nous testons différentes méthodes de traitement du langage naturel pour les deux étapes de création d'ensembles de données : génération de questions et génération de réponses. Nous utilisons les ensembles de données générés pour entraîner les modèles UnifiedQA et Bert-QA et nous les testons sur SCIQ, un ensemble de données hors domaine sur la physique, la chimie et la biologie pour la tâche de question-réponse à choix multiples, ainsi que sur HotpotQA, TriviaQA, NatQ et SearchQA, quatre ensembles de données hors domaine pour la tâche de question-réponse. Cette procédure nous permet d'évaluer et de comparer les méthodes basées sur des règles avec les méthodes de réseaux neuronaux. Nous montrons que les méthodes basées sur des règles produisent des résultats supérieurs pour la tâche de question-réponse à choix multiple, mais que les méthodes de réseaux neuronaux produisent généralement des meilleurs résultats pour la tâche de question-réponse. Par contre, nous observons aussi qu'occasionnellement, les méthodes basées sur des règles peuvent compléter les méthodes de réseaux neuronaux et produire des résultats compétitifs lorsqu'on entraîne Bert-QA avec les bases de données synthétiques provenant des deux méthodes. / Question Answering models have shown impressive results in several question answering datasets and tasks. However, when tested on out-of-domain datasets, the performance decreases. In order to circumvent manually annotating training data from the new domain, question-answer pairs can be generated synthetically from unnanotated data. In this work, we are interested in the generation of synthetic data and we test different Natural Language Processing methods for the two steps of dataset creation: question/answer generation. We use the generated datasets to train QA models UnifiedQA and Bert-QA and we test it on SCIQ, an out-of-domain dataset about physics, chemistry, and biology for MCQA, and on HotpotQA, TriviaQA, NatQ and SearchQA, four out-of-domain datasets for QA. This procedure allows us to evaluate and compare rule-based methods with neural network methods. We show that rule-based methods yield superior results for the multiple-choice question-answering task, but neural network methods generally produce better results for the question-answering task. However, we also observe that occasionally, rule-based methods can complement neural network methods and produce competitive results when training Bert-QA with synthetic databases derived from both methods.
|
306 |
Malicious Intent Detection Framework for Social NetworksFausak, Andrew Raymond 05 1900 (has links)
Many, if not all people have online social accounts (OSAs) on an online community (OC) such as Facebook (Meta), Twitter (X), Instagram (Meta), Mastodon, Nostr. OCs enable quick and easy interaction with friends, family, and even online communities to share information about. There is also a dark side to Ocs, where users with malicious intent join OC platforms with the purpose of criminal activities such as spreading fake news/information, cyberbullying, propaganda, phishing, stealing, and unjust enrichment. These criminal activities are especially concerning when harming minors. Detection and mitigation are needed to protect and help OCs and stop these criminals from harming others. Many solutions exist; however, they are typically focused on a single category of malicious intent detection rather than an all-encompassing solution. To answer this challenge, we propose the first steps of a framework for analyzing and identifying malicious intent in OCs that we refer to as malicious mntent detection framework (MIDF). MIDF is an extensible proof-of-concept that uses machine learning techniques to enable detection and mitigation. The framework will first be used to detect malicious users using solely relationships and then can be leveraged to create a suite of malicious intent vector detection models, including phishing, propaganda, scams, cyberbullying, racism, spam, and bots for open-source online social networks, such as Mastodon, and Nostr.
|
307 |
Reparametrization in deep learningDinh, Laurent 02 1900 (has links)
No description available.
|
308 |
Dynamics of Forest Ecosystems Under Global Change: Applications of Artificial Intelligence in Mapping, Classification, and ProjectionAkane Ota Abbasi (17123185) 10 October 2023 (has links)
<p dir="ltr">Global forest ecosystems provide essential ecosystem services that contribute to water and climate regulation, food production, recreation, and raw materials. They also serve as crucial habitats for numerous terrestrial species of amphibians, birds, and mammals worldwide. However, recent decades have witnessed unprecedented changes in forest ecosystems due to climate change, shifts in species distribution patterns, increased planted forest areas, and various disturbances such as forest fires, insect infestations, and urbanization. These changes can have far-reaching impacts on ecological networks, human well-being, and the well-being of global forest ecosystems. To address these challenges, I present four studies to quantify forest dynamics through mapping, classification, and projection, using artificial intelligence tools in combination with a vast amount of training data. (I) I present a spatially continuous map of planted forest distribution across East Asia, produced by integrating multiple sources of planted and natural forest data. I found that China contributed 87% of the total planted forest areas in East Asia, most of which are located in the lowland tropical/subtropical regions and Sichuan Basin. I also estimated the dominant genus in each planted forest location. (II) I used continent-wide forest inventory data to compare the range shifts of forest types and their constituent tree species in North America in the past 50 years. I found that forest types shifted more than three times as fast as the average of their constituent tree species. This marked difference was attributable to a predominant positive covariance between tree species ranges and the change of species relative abundance. (III) Based on individual-level field surveys of trees and breeding birds across North America, I characterized New World wood-warbler (<i>Parulidae</i>) species richness and its potential drivers. I identified forest type as the most powerful predictor of New World wood-warbler species richness, which adds valuable evidence to the ongoing physiognomy versus composition debate among ornithologists. (IV) In the appendix, I utilized continent-wide forest inventory data from North America and South America and the combination of supervised and unsupervised machine learning algorithms to produce the first data-driven map of forest types in the Americas. I revealed the distribution of forest types, which are useful for cost-effective forest and biodiversity management and planning. Taken together, these studies provide insight into the dynamics of forest ecosystems at a large geographic scale and have implications for effective decision-making in conservation, management, and global restoration programs in the midst of ongoing global change.</p>
|
309 |
EXPLORING GRAPH NEURAL NETWORKS FOR CLUSTERING AND CLASSIFICATIONFattah Muhammad Tahabi (14160375) 03 February 2023 (has links)
<p><strong>Graph Neural Networks</strong> (GNNs) have become excessively popular and prominent deep learning techniques to analyze structural graph data for their ability to solve complex real-world problems. Because graphs provide an efficient approach to contriving abstract hypothetical concepts, modern research overcomes the limitations of classical graph theory, requiring prior knowledge of the graph structure before employing traditional algorithms. GNNs, an impressive framework for representation learning of graphs, have already produced many state-of-the-art techniques to solve node classification, link prediction, and graph classification tasks. GNNs can learn meaningful representations of graphs incorporating topological structure, node attributes, and neighborhood aggregation to solve supervised, semi-supervised, and unsupervised graph-based problems. In this study, the usefulness of GNNs has been analyzed primarily from two aspects - <strong>clustering and classification</strong>. We focus on these two techniques, as they are the most popular strategies in data mining to discern collected data and employ predictive analysis.</p>
|
Page generated in 0.0434 seconds