• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 179
  • 21
  • 18
  • 6
  • 5
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 308
  • 308
  • 120
  • 105
  • 79
  • 74
  • 72
  • 63
  • 62
  • 62
  • 56
  • 49
  • 46
  • 45
  • 45
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.

Higher-Ordered Feedback Architectures : a Comparison

Jason, Henrik January 2002 (has links)
This dissertation aim is to investigate the application of higher-ordered feedback architectures, as a control system for an autonomous robot, on delayed response task problems in the area of evolutionary robotics. For the two architectures of interest a theoretical and practical experiment study is conducted to elaborate how these architectures cope with the road-sign problem, and extended versions of the same. In the theoretical study conducted in this dissertation focus is on the features of the architectures, how they behave and act in different kinds of road-sign problem environments in earlier work. Based on this study two problem environments are chosen for practical experiments. The two experiments that are tested are the three-way and multiple stimuli road-sign problems. Both architectures seams to be cope with the three-way road-sign problem. Although, both architectures are shown to have difficulties solving the multiple stimuli road-sign problem with the current experimental setting used. This work leads to two insights in the way these architectures cope with and behave in the three-way road-sign problem environment and delayed response tasks. The robot seams to learn to explicitly relate its actions to the different stimuli settings that it is exposed to. Firstly, both architectures forms higher abstracted representations of the inputs from the environment. These representations are used to guide the robots actions in the environment in those situations were the raw input not was enough to do the correct actions. Secondly, it seams to be enough to have two internal representations of stimuli setting and offloading some stimuli settings, relying on the raw input from the environment, to solve the three-way road-sign problem. The dissertation works as an overview for new researchers on the area and also as take-off for the direction to which further investigations should be conducted of using higher-ordered feedback architectures.

Locally Optimized Mapping of Slum Conditions in a Sub-Saharan Context: A Case Study of Bamenda, Cameroon

Anchang, Julius 18 November 2016 (has links)
Despite being an indicator of modernization and macro-economic growth, urbanization in regions such as Sub-Saharan Africa is tightly interwoven with poverty and deprivation. This has manifested physically as slums, which represent the worst residential urban areas, marked by lack of access to good quality housing and basic services. To effectively combat the slum phenomenon, local slum conditions must be captured in quantitative and spatial terms. However, there are significant hurdles to this. Slum detection and mapping requires readily available and reliable data, as well as a proper conceptualization of measurement and scale. Using Bamenda, Cameroon, as a test case, this dissertation research was designed as a three-pronged attack on the slum mapping problematic. The overall goal was to investigate locally optimized slum mapping strategies and methods that utilize high resolution satellite image data, household survey data, simple machine learning and regionalization theory. The first major objective of the study was to tackle a "measurement" problem. The aim was to explore a multi-index approach to measure and map local slum conditions. The rationale behind this was that prior sub-Saharan slum research too often used simplified measurement techniques such as a single unweighted composite index to represent diverse local slum conditions. In this study six household indicators relevant to the United Nations criteria for defining slums were extracted from a 2013 Bamenda household survey data set and aggregated for 63 local statistical areas. The extracted variables were the percent of households having the following attributes: more than two residents per room, non-owner, occupying a single room or studio, having no flush toilet, having no piped water, having no drainage. Hierarchical variable clustering was used as a surrogate for exploratory factor analysis to determine fewer latent slum factors from these six variables. Variable groups were classified such that the most correlated variables fell in the same group while non-correlated variables fell in separate groups. Each group membership was then examined to see if the group suggested a conceptually meaningful slum factor which could quantified as a stand-alone "high" and "low" binary slum index. Results showed that the slum indicators in the study area could be replaced by at least two meaningful and statistically uncorrelated latent factors. One factor reflected the home occupancy conditions (tenancy status, overcrowded and living space conditions) and was quantified using K-means clustering of units as an ‘occupancy disadvantage index’ (Occ_D). The other reflected the state of utilities access (piped water and flush toilet) and was quantified as utilities disadvantage index (UT_D). Location attributes were used to examine/validate both indices. Independent t-tests showed that units with high Occ_D were on average closer to nearest town markets and major roads when compared with units of low Occ_D. This was consistent with theory as it is expected that typical slum residents (in this case overcrowded and non-owner households) will favor accessibility to areas of high economic activity. However, this situation was not the same with UT_D as shown by lack of such as a strong pattern. The second major objective was to tackle a "learning" problem. The purpose was to explore the potential of unsupervised machine learning to detect or "learn" slum conditions from image data. The rationale was that such an approach would be efficient, less reliant on prior knowledge and expertise. A 2012 GeoEye image scene of the study area was subjected to image classification from which the following physical settlement attributes were quantified for each of the 63 statistical areas: per cent roof area, percent open space area, per cent bare soil, per cent paved road surface, per cent dirt road surface, building shadow-roof area ratio. The shadow-roof ratio was an innovative measure used to capture the size and density attributes of buildings. In addition to the 6 image derived variables, the mean slope of each area was calculated from a digital elevation dataset. All 7 attributes were subject to principal component analysis from which the first 2 components were extracted and used for hierarchical clustering of statistical areas to derive physical types. Results show that area units could be optimally classified into 4 physical types labelled generically as Categories 1 – 4, each with at least one defining physical characteristic. Kruskal Wallis tests comparing physical types in terms of household and locations attributes showed that at least two physical types were different in terms of aggregated household slum conditions and location attributes. Category 4 areas, located on steep slopes and having high shadow-to-roof ratio, had the highest distribution of non-owner households. They were also located close to nearest town markets. They were thus the most likely candidates of slums in the city. Category 1 units on other hand located at the outskirts and having abundant open space were least likely to have slum conditions. The third major objective was to tackle the problem of "spatial scale". Neighborhoods, by their very nature of contiguity and homogeneity, represent an ideal scale for urban spatial analysis and mapping. Unfortunately, in most areas, neighborhoods are not objectively defined and slum mapping often relies in the use of arbitrary spatial units which do not capture the true extent of the phenomenon. The objective was thus to explore the use of analytic regionalization to quantitatively derive the neighborhood unit for mapping slums. Analytic neighborhoods were created by spatially constrained clustering of statistical areas using the minimum spanning tree algorithm. Unlike previous studies that relied on socio-economic and/or demographic information, this study innovatively used multiple land cover and terrain attributes as neighborhood homogenizing factors. Five analytic neighborhoods (labeled Regions 1-5) were created this way and compared using Kruskal Wallis tests for differences in household slum attributes. This was to determine largest possible contiguous areas that could be labeled as slum or non-slum neighborhoods. The results revealed that at least two analytic regions were significantly different in terms of aggregated household indicators. Region 1 stood apart as having significantly higher distributions of overcrowded and non-owner households. It could thus be viewed as the largest potential slum neighborhood in the city. In contrast, regions 3 (located at higher elevation and separated from rest of city by a steep escarpment) was generally associated with low distribution of household slum attributes and could be considered the strongest model of a non-slum or formal neighborhood. Both Regions 1 and 3 were also qualitatively correlated with two locally recognized (vernacular) neighborhoods. These neighborhoods, "Sisia" (for Region 1) and "Up Station" (for Region 3), are commonly perceived by local folk as occupying opposite ends of the socio-economic spectrum. The results obtained by successfully carrying the three major objectives have major implication for future research and policy. In the case of multi-index analysis of slum conditions, it affirms the notion the that slum phenomenon is diverse in the local context and that remediation efforts must be compartmentalized to be effective. The results of image based unsupervised mapping of slums from imagery show that it is a tool with high potential for rapid slum assessment even when there is no supporting field data. Finally, the results of analytic regionalization showed that the true extent of contiguous slum neighborhoods can be delineated objectively using land cover and terrain attributes. It thus presents an opportunity for local planning and policy actors to consider redesigning the city neighborhood districts as analytic units. Quantitively derived neighborhoods are likely to be more useful in the long term, be it for spatial sampling, mapping or planning purposes.

Sélection de corpus en traduction automatique statistique / Efficient corpus selection for statistical machine translation

Abdul Rauf, Sadaf 17 January 2012 (has links)
Dans notre monde de communications au niveau international, la traduction automatique est devenue une technologie clef incontournable. Plusieurs approches existent, mais depuis quelques années la dite traduction automatique statistique est considérée comme la plus prometteuse. Dans cette approche, toutes les connaissances sont extraites automatiquement à partir d'exemples de traductions, appelés textes parallèles, et des données monolingues en langue cible. La traduction automatique statistique est un processus guidé par les données. Ceci est communément avancé comme un grand avantage des approches statistiques puisque l'intervention d'être humains bilingues n'est pas nécessaire, mais peut se retourner en un problème lorsque ces données nécessaires au développement du système ne sont pas disponibles, de taille insuffisante ou dont le genre ne convient pas. Les recherches présentées dans cette thèse sont une tentative pour surmonter un des obstacles au déploiement massif de systèmes de traduction automatique statistique : le manque de corpus parallèles. Un corpus parallèle est une collection de phrases en langues source et cible qui sont alignées au niveau de la phrase. La plupart des corpus parallèles existants ont été produits par des traducteurs professionnels. Ceci est une tâche coûteuse, en termes d'argent, de ressources humaines et de temps. Dans la première partie de cette thèse, nous avons travaillé sur l'utilisation de corpus comparables pour améliorer les systèmes de traduction statistique. Un corpus comparable est une collection de données en plusieurs langues, collectées indépendamment, mais qui contiennent souvent des parties qui sont des traductions mutuelles. La taille et la qualité des contenus parallèles peuvent variées considérablement d'un corpus comparable à un autre, en fonction de divers facteurs, notamment la méthode de construction du corpus. Dans tous les cas, il n'est pas aisé d'identifier automatiquement des parties parallèles. Dans le cadre de cette thèse, nous avons développé une telle approche qui est entièrement basée sur des outils librement disponibles. L'idée principale de notre approche est l'utilisation d'un système de traduction automatique statistique pour traduire toutes les phrases en langue source du corpus comparable. Chacune de ces traductions est ensuite utilisée en tant que requête afin de trouver des phrases potentiellement parallèles. Cette recherche est effectuée à l'aide d'un outil de recherche d'information. En deuxième étape, les phrases obtenues sont comparées aux traductions automatiques afin de déterminer si elles sont effectivement parallèles à la phrase correspondante en langue source. Plusieurs critères ont été évalués tels que le taux d'erreur de mots ou le «translation edit rate (TER)». Nous avons effectué une analyse expérimentale très détaillée afin de démontrer l'intérêt de notre approche. Les corpus comparables utilisés se situent dans le domaine des actualités, plus précisément, des dépêches d'actualités des agences de presse telles que «Agence France Press (AFP)», «Associate press» ou «Xinua News». Ces agences publient quotidiennement des actualités en plusieurs langues. Nous avons pu extraire des textes parallèles à partir de grandes collections de plus de trois cent millions de mots pour les paires de langues français/anglais et arabe/anglais. Ces textes parallèles ont permis d'améliorer significativement nos systèmes de traduction statistique. Nous présentons également une comparaison théorique du modèle développé dans cette thèse avec une autre approche présentée dans la littérature. Diverses extensions sont également étudiées : l'extraction automatique de mots inconnus et la création d'un dictionnaire, la détection et suppression 1 d'informations supplémentaires, etc. Dans la deuxième partie de cette thèse, nous avons examiné la possibilité d'utiliser des données monolingues afin d'améliorer le modèle de traduction d'un système statistique... / In our world of international communications, machine translation has become a key technology essential. Several pproaches exist, but in recent years the so-called Statistical Machine Translation (SMT) is considered the most promising. In this approach, knowledge is automatically extracted from examples of translations, called parallel texts, and monolingual data in the target language. Statistical machine translation is a data driven process. This is commonly put forward as a great advantage of statistical approaches since no human intervention is required, but this can also turn into a problem when the necessary development data are not available, are too small or the domain is not appropriate. The research presented in this thesis is an attempt to overcome barriers to massive deployment of statistical machine translation systems: the lack of parallel corpora. A parallel corpus is a collection of sentences in source and target languages that are aligned at the sentence level. Most existing parallel corpora were produced by professional translators. This is an expensive task in terms of money, human resources and time. This thesis provides methods to overcome this need by exploiting the easily available huge comparable and monolingual data collections. We present two effective architectures to achieve this.In the first part of this thesis, we worked on the use of comparable corpora to improve statistical machine translation systems. A comparable corpus is a collection of texts in multiple languages, collected independently, but often containing parts that are mutual translations. The size and quality of parallel contents may vary considerably from one comparable corpus to another, depending on various factors, including the method of construction of the corpus. In any case, itis not easy to automatically identify the parallel parts. As part of this thesis, we developed an approach which is entirely based on freely available tools. The main idea of our approach is the use of a statistical machine translation system to translate all sentences in the source language comparable corpus to the target language. Each of these translations is then used as query to identify potentially parallel sentences from the target language comparable corpus. This research is carried out using an information retrieval toolkit. In the second step, the retrieved sentences are compared to the automatic translation to determine whether they are parallel to the corresponding sentence in source language. Several criteria wereevaluated such as word error rate or the translation edit rate (TER) and TERp. We conducted a very detailed experimental analysis to demonstrate the interest of our approach. We worked on comparable corpora from the news domain, more specifically, multilingual news agencies such as, "Agence France Press (AFP)", "Associate Press" or "Xinua News." These agencies publish daily news in several languages. We were able to extract parallel texts from large collections of over three hundred million words for French-English and Arabic-English language pairs. These parallel texts have significantly improved our statistical translation systems. We also present a theoretical comparison of the model developed in this thesis with another approach presented in the literature. Various extensions are also discussed: automatic extraction of unknown words and the creation of a dictionary, detection and suppression of extra information, etc.. In the second part of this thesis, we examined the possibility of using monolingual data to improve the translation model of a statistical system. The idea here is to replace parallel data by monolingual source or target language data. This research is thus placed in the context of unsupervised learning, since missing translations are produced by an automatic translation system, and after various filtering, reinjected into the system...

Avaliação de métodos não-supervisionados de seleção de atributos para mineração de textos / Evaluation of unsupervised feature selection methods for Text Mining

Bruno Magalhães Nogueira 27 March 2009 (has links)
Selecionar atributos é, por vezes, uma atividade necessária para o correto desenvolvimento de tarefas de aprendizado de máquina. Em Mineração de Textos, reduzir o número de atributos em uma base de textos é essencial para a eficácia do processo e a compreensibilidade do conhecimento extraído, uma vez que se lida com espaços de alta dimensionalidade e esparsos. Quando se lida com contextos nos quais a coleção de textos é não-rotulada, métodos não-supervisionados de redução de atributos são utilizados. No entanto, não existe forma geral predefinida para a obtenção de medidas de utilidade de atributos em métodos não-supervisionados, demandando um esforço maior em sua realização. Assim, este trabalho aborda a seleção não-supervisionada de atributos por meio de um estudo exploratório de métodos dessa natureza, comparando a eficácia de cada um deles na redução do número de atributos em aplicações de Mineração de Textos. Dez métodos são comparados - Ranking porTerm Frequency, Ranking por Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Método de Luhn, Método LuhnDF, Método de Salton e Zone-Scored Term Frequency - sendo dois deles aqui propostos - Método LuhnDF e Zone-Scored Term Frequency. A avaliação se dá em dois focos, supervisionado, pelo medida de acurácia de quatro classificadores (C4.5, SVM, KNN e Naïve Bayes), e não-supervisionado, por meio da medida estatística de Expected Mutual Information Measure. Aos resultados de avaliação, aplica-se o teste estatístico de Kruskal-Wallis para determinação de significância estatística na diferença de desempenho dos diferentes métodos de seleção de atributos comparados. Seis bases de textos são utilizadas nas avaliações experimentais, cada uma relativa a um grande domínio e contendo subdomínios, os quais correspondiam às classes usadas para avaliação supervisionada. Com esse estudo, este trabalho visa contribuir com uma aplicação de Mineração de Textos que visa extrair taxonomias de tópicos a partir de bases textuais não-rotuladas, selecionando os atributos mais representativos em uma coleção de textos. Os resultados das avaliações mostram que não há diferença estatística significativa entre os métodos não-supervisionados de seleção de atributos comparados. Além disso, comparações desses métodos não-supervisionados com outros supervisionados (Razão de Ganho e Ganho de Informação) apontam que é possível utilizar os métodos não-supervisionados em atividades supervisionadas de Mineração de Textos, obtendo eficiência compatível com os métodos supervisionados, dado que não detectou-se diferença estatística nessas comparações, e com um custo computacional menor / Feature selection is an activity sometimes necessary to obtain good results in machine learning tasks. In Text Mining, reducing the number of features in a text base is essential for the effectiveness of the process and the comprehensibility of the extracted knowledge, since it deals with high dimensionalities and sparse contexts. When dealing with contexts in which the text collection is not labeled, unsupervised methods for feature reduction have to be used. However, there aren\'t any general predefined feature quality measures for unsupervised methods, therefore demanding a higher effort for its execution. So, this work broaches the unsupervised feature selection through an exploratory study of methods of this kind, comparing their efficacies in the reduction of the number of features in the Text Mining process. Ten methods are compared - Ranking by Term Frequency, Ranking by Document Frequency, Term Frequency-Inverse Document Frequency, Term Contribution, Term Variance, Term Variance Quality, Luhn\'s Method, LuhnDF Method, Salton\'s Method and Zone-Scored Term Frequency - and two of them are proposed in this work - LuhnDF Method and Zone-Scored Term Frequency. The evaluation process is done in two ways, supervised, through the accuracy measure of four classifiers (C4.5, SVM, KNN and Naïve Bayes), and unsupervised, using the Expected Mutual Information Measure. The evaluation results are submitted to the statistical test of Kruskal-Wallis in order to determine the statistical significance of the performance difference of the different feature selection methods. Six text bases are used in the experimental evaluation, each one related to one domain and containing sub domains, which correspond to the classes used for supervised evaluation. Through this study, this work aims to contribute with a Text Mining application that extracts topic taxonomies from unlabeled text collections, through the selection of the most representative features in a text collection. The evaluation results show that there is no statistical difference between the unsupervised feature selection methods compared. Moreover, comparisons of these unsupervised methods with other supervised ones (Gain Ratio and Information Gain) show that it is possible to use unsupervised methods in supervised Text Mining activities, obtaining an efficiency compatible with supervised methods, since there isn\'t any statistical difference the statistical test detected in these comparisons, and with a lower computational effort

Resolução de correferência em múltiplos documentos utilizando aprendizado não supervisionado / Co-reference resolution in multiples documents through unsupervised learning

Jefferson Fontinele da Silva 05 May 2011 (has links)
Um dos problemas encontrados em sistemas de Processamento de Línguas Naturais (PLN) é a dificuldade de se identificar que elementos textuais referem-se à mesma entidade. Esse fenômeno, no qual o conjunto de elementos textuais remete a uma mesma entidade, é denominado de correferência. Sistemas de resolução de correferência podem melhorar o desempenho de diversas aplicações do PLN, como: sumarização, extração de informação, sistemas de perguntas e respostas. Recentemente, pesquisas em PLN têm explorado a possibilidade de identificar os elementos correferentes em múltiplos documentos. Neste contexto, este trabalho tem como foco o desenvolvimento de um método aprendizado não supervisionado para resolução de correferência em múltiplos documentos, utilizando como língua-alvo o português. Não se conhece, até o momento, nenhum sistema com essa finalidade para o português. Os resultados dos experimentos feitos com o sistema sugerem que o método desenvolvido é superior a métodos baseados em concordância de cadeias de caracteres / One of the problems found in Natural Language Processing (NLP) systems is the difficulty of identifying textual elements that refer to the same entity. This phenomenon, in which the set of textual elements refers to a single entity, is called coreference. Coreference resolution systems can improve the performance of various NLP applications, such as automatic summarization, information extraction systems, question answering systems. Recently, research in NLP has explored the possibility of identifying the coreferent elements in multiple documents. In this context, this work focuses on the development of an unsupervised method for coreference resolution in multiple documents, using Portuguese as the target language. Until now, it is not known any system for this purpose for the Portuguese. The results of the experiments with the system suggest that the developed method is superior to methods based on string matching

Graph neural networks for spatial gene expression analysis of the developing human heart

Yuan, Xiao January 2020 (has links)
Single-cell RNA sequencing and in situ sequencing were combined in a recent study of the developing human heart to explore the transcriptional landscape at three developmental stages. However, the method used in the study to create the spatial cellular maps has some limitations. It relies on image segmentation of the nuclei and cell types defined in advance by single-cell sequencing. In this study, we applied a new unsupervised approach based on graph neural networks on the in situ sequencing data of the human heart to find spatial gene expression patterns and detect novel cell and sub-cell types. In this thesis, we first introduce some relevant background knowledge about the sequencing techniques that generate our data, machine learning in single-cell analysis, and deep learning on graphs. We have explored several graph neural network models and algorithms to learn embeddings for spatial gene expression. Dimensionality reduction and cluster analysis were performed on the embeddings for visualization and identification of biologically functional domains. Based on the cluster gene expression profiles, locations of the clusters in the heart sections, and comparison with cell types defined in the previous study, the results of our experiments demonstrate that graph neural networks can learn meaningful representations of spatial gene expression in the human heart. We hope further validations of our clustering results could give new insights into cell development and differentiation processes of the human heart.

Three Facets of Online Political Networks: Communities, Antagonisms, and Polarization

January 2019 (has links)
abstract: Millions of users leave digital traces of their political engagements on social media platforms every day. Users form networks of interactions, produce textual content, like and share each others' content. This creates an invaluable opportunity to better understand the political engagements of internet users. In this proposal, I present three algorithmic solutions to three facets of online political networks; namely, detection of communities, antagonisms and the impact of certain types of accounts on political polarization. First, I develop a multi-view community detection algorithm to find politically pure communities. I find that word usage among other content types (i.e. hashtags, URLs) complement user interactions the best in accurately detecting communities. Second, I focus on detecting negative linkages between politically motivated social media users. Major social media platforms do not facilitate their users with built-in negative interaction options. However, many political network analysis tasks rely on not only positive but also negative linkages. Here, I present the SocLSFact framework to detect negative linkages among social media users. It utilizes three pieces of information; sentiment cues of textual interactions, positive interactions, and socially balanced triads. I evaluate the contribution of each three aspects in negative link detection performance on multiple tasks. Third, I propose an experimental setup that quantifies the polarization impact of automated accounts on Twitter retweet networks. I focus on a dataset of tragic Parkland shooting event and its aftermath. I show that when automated accounts are removed from the retweet network the network polarization decrease significantly, while a same number of accounts to the automated accounts are removed randomly the difference is not significant. I also find that prominent predictors of engagement of automatically generated content is not very different than what previous studies point out in general engaging content on social media. Last but not least, I identify accounts which self-disclose their automated nature in their profile by using expressions such as bot, chat-bot, or robot. I find that human engagement to self-disclosing accounts compared to non-disclosing automated accounts is much smaller. This observational finding can motivate further efforts into automated account detection research to prevent their unintended impact. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019


Junhan Zhao (11024559) 25 June 2021 (has links)
Enabling human understanding of high-dimensional (HD) data is critical for scientific research but highly challenging. To deal with large datasets, probabilistic-based non-linear DR models, like UMAP and t-SNE, lead the performance on reducing the high dimensionality. However, considering the trade-off between global and local structure preservation and the randomness initialized for computation, applying non-linear models in different parameter settings to unknown high-dimensional structure data may return different 2D visual forms. Much critical neighborhood relationship may be falsely imposed, and uncertainty may be introduced into the low-dimensional embedding visualizations, so-called distortion. In this work, a survey has been conducted to illustrate the most state-of-the-art layout enrichment works for interpreting dimensionality reduction methods and results. Responding to the lack of visual interpretation techniques to probabilistic-based DR methods, we propose a visualization technique called ManiGraph, which facilitates users to explore multi-view 2D embeddings via mesoscopic structure graphs. A dynamic mesoscopic structure first subsets HD data by a hexagonal grid in visual space from non-linear embedding (e.g., UMAP). Then, it measures the regional adapted trustworthiness/continuity and visualizes the restored missing and highlighted false connections between subsets from high-dimensional space to the low-dimensional in a node-linkage manner. The visualization helps users understand and interpret the distortion from both visualization and model stages. We further demonstrate the user cases tested on intuitive 3D toy datasets, fashion-MNIST, and single-cell RNA sequencing with domain experts in unsupervised scenarios. This work will potentially benefit the data science community, from toolkit users to DR algorithm developers.<br>


Bracci, Lorenzo, Namazi, Amirhossein January 2021 (has links)
With the advancement of the internet of things and the digitization of societies sensor recording time series data can be found in an always increasing number of places including among other proximity sensors on cars, temperature sensors in manufacturing plants and motion sensors inside smart homes. This always increasing reliability of society on these devices lead to a need for detecting unusual behaviour which could be caused by malfunctioning of the sensor or by the detection of an uncommon event. The unusual behaviour mentioned is often referred to as an anomaly. In order to detect anomalous behaviours, advanced technologies combining mathematics and computer science, which are often referred to as under the umbrella of machine learning, are frequently used to solve these problems. In order to help machines to learn valuable patterns often human supervision is needed, which in this case would correspond to use recordings which a person has already classified as anomalies or normal points. It is unfortunately time consuming to label data, especially the large datasets that are created from sensor recordings. Therefore in this thesis techniques that require no supervision are evaluated to perform anomaly detection. Several different machine learning models are trained on different datasets in order to gain a better understanding concerning which techniques perform better when different requirements are important such as presence of a smaller dataset or stricter requirements on inference time. Out of the models evaluated, OCSVM resulted in the best overall performance, achieving an accuracy of 85% and K- means was the fastest model as it took 0.04 milliseconds to run inference on one sample. Furthermore LSTM based models showed most possible improvements with larger datasets. / Med utvecklingen av Sakernas internet och digitaliseringen av samhället kan man registrera tidsseriedata på allt fler platser, bland annat igenom närhetssensorer på bilar, temperatursensorer i tillverkningsanläggningar och rörelsesensorer i smarta hem. Detta ständigt ökande beroende i samhället av dessa enheter leder till ett behov av att upptäcka ovanligt beteende som kan orsakas av funktionsstörning i sensorn eller genom upptäckt av en ovanlig händelse. Det ovanliga beteendet som nämns kallas ofta för en anomali. För att upptäcka avvikande beteenden används avancerad teknik som kombinerar matematik och datavetenskap, som ofta kallas maskininlärning. För att hjälpa maskiner att lära sig värdefulla mönster behövs ofta mänsklig tillsyn, vilket i detta fall skulle motsvara användningsinspelningar som en person redan har klassificerat som avvikelser eller normala punkter. Tyvärr är det tidskrävande att märka data, särskilt de stora datamängder som skapas från sensorinspelningar. Därför utvärderas tekniker som inte kräver någon handledning i denna avhandling för att utföra anomalidetektering. Flera olika maskininlärningsmodeller utbildas på olika datamängder för att få en bättre förståelse för vilka tekniker som fungerar bättre när olika krav är viktiga, t.ex. närvaro av en mindre dataset eller strängare krav på inferens tid. Av de utvärderade modellerna resulterade OCSVM i bästa totala prestanda, uppnådde en noggrannhet på 85% och K- means var den snabbaste modellen eftersom det hade en inferens tid av 0,04 millisekunder. Dessutom visade LSTM- baserade modeller de bästa möjliga förbättringarna med större datamängder.

Identifying Machine States and Sensor Properties for a Digital Machine Template : Automatically recognize states in a machine using multivariate time series cluster analysis

Viking, Jakob January 2021 (has links)
Digital twins have become a large part of new cyber-physical systems as they allow for the simulation of a physical object in the digital world. In addition to the new approaches of digital twins, machines have become more intelligent, allowing them to produce more data than ever before. Within the area of digital twins, there is a need for a less complex approach than a fully optimised digital twin. This approach is more like a digital shadow of the physical object. Therefore, the focus of this thesis is to study machine states and statistical distributions for all sensors in a machine. Where as majority of studies in the literature focuses on generating data from a digital twin, this study focuses on what characteristics a digital twin have. The solution is by defining a term named digital machine template that contains the states and statistical properties of each sensor in a given machine. The primary approach is to create a proof of work application that uses traditional data mining technologies and clustering to analyze how many states there are in a machine and how the sensor data is structured. It all results in a digital machine template with all of the information mentioned above. The results contain all the states a machine might have and the possible statistical distributions of each senor in each state. The digital machine template opens the possibility of using it as a basis for creating a digital twins. It allows the time of development to be shorter than that of a regular digital twin. More research still needs to be done as the less complex approach may lead to missing information or information not being interpreted correctly. It still shows promises as a less complex way of looking at digital twins since it may become necessary due to digital twins becoming even more complex by the day.

Page generated in 0.0506 seconds