Spelling suggestions: "subject:"data preprocessing"" "subject:"data preprocessing""
11 |
Using dynamic time warping for multi-sensor fusionKo, Ming Hsiao January 2009 (has links)
Fusion is a fundamental human process that occurs in some form at all levels of sense organs such as visual and sound information received from eyes and ears respectively, to the highest levels of decision making such as our brain fuses visual and sound information to make decisions. Multi-sensor data fusion is concerned with gaining information from multiple sensors by fusing across raw data, features or decisions. The traditional frameworks for multi-sensor data fusion only concern fusion at specific points in time. However, many real world situations change over time. When the multi-sensor system is used for situation awareness, it is useful not only to know the state or event of the situation at a point in time, but also more importantly, to understand the causalities of those states or events changing over time. / Hence, we proposed a multi-agent framework for temporal fusion, which emphasises the time dimension of the fusion process, that is, fusion of the multi-sensor data or events derived over a period of time. The proposed multi-agent framework has three major layers: hardware, agents, and users. There are three different fusion architectures: centralized, hierarchical, and distributed, for organising the group of agents. The temporal fusion process of the proposed framework is elaborated by using the information graph. Finally, the core of the proposed temporal fusion framework – Dynamic Time Warping (DTW) temporal fusion agent is described in detail. / Fusing multisensory data over a period of time is a challenging task, since the data to be fused consists of complex sequences that are multi–dimensional, multimodal, interacting, and time–varying in nature. Additionally, performing temporal fusion efficiently in real–time is another challenge due to the large amount of data to be fused. To address these issues, we proposed the DTW temporal fusion agent that includes four major modules: data pre-processing, DTW recogniser, class templates, and decision making. The DTW recogniser is extended in various ways to deal with the variability of multimodal sequences acquired from multiple heterogeneous sensors, the problems of unknown start and end points, multimodal sequences of the same class that hence has different lengths locally and/or globally, and the challenges of online temporal fusion. / We evaluate the performance of the proposed DTW temporal fusion agent on two real world datasets: 1) accelerometer data acquired from performing two hand gestures, and 2) a benchmark dataset acquired from carrying a mobile device and performing pre-defined user scenarios. Performance results of the DTW based system are compared with those of a Hidden Markov Model (HMM) based system. The experimental results from both datasets demonstrate that the proposed DTW temporal fusion agent outperforms HMM based systems, and has the capability to perform online temporal fusion efficiently and accurately in real–time.
|
12 |
Att hitta en nål i en höstack: Metoder och tekniker för att sålla och gradera stora mängder ostrukturerad textdataPettersson, Emeli, Carlson, Albin January 2019 (has links)
Big Data är i dagsläget ett populärt ämne som kan användas för en mängd olika syften. Bland annat kan det användas för att analysera data på webben i hopp om att identifiera brott mot mänskliga rättigheter. Genom att tillämpa tekniker inom områden som Artificiell Intelligens (AI), Information Retrieval (IR) samt data- visualisering, hoppas företaget Globalworks AB kunna identifiera röster vilka uttrycker sig om förtryck och kränkningar i social media. Artificiell intelligens och informationshämtning är dock breda områden och forskning som behandlar dem kan finnas långt tillbaka i tiden. Vi har därför valt att utföra en systematisk litteraturstudie i syfte att kartlägga existerande forskning inom dessa områden. Med en litterär sammanställning bistår vi med en ontologisk överblick i hur ett system som använder dessa tekniker är strukturerat, med vilka metoder och teknologier ett sådant system kan utvecklas, samt hur dessa kan kombineras. / Big Data is a popular topic these days which can be utilized for numerous purposes. It can, for instance, be used in order to analyse data made available online in hopes of identifying violations against human rights. By applying techniques within such areas as Artificial Intelligence (AI), Information Retrieval (IR), and Visual Analytics, the company Globalworks Ltd. aims to identify single voices in social media expressing grievances concerning such violations. Artificial Intelligence and Information Retrieval are broad topics however, and have been an active area of research for quite some time. We have therefore chosen to conduct a systematic literature review in hopes of mapping together existing research covering these areas. By presenting a literary compilation, we provide an ontological view of how an information system utilizing techniques within these areas could be structured, in addition to how such a system could deploy said techniques.
|
13 |
A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop : TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATIONWara, Ummul January 2018 (has links)
Due to the transformation of different recommendation system from contentbased to hybrid cross-domain-based, there is an urge to prepare a socialnetwork dataset which will provide sufficient data as well as detail-level annotation from a predefined hierarchical clothing category and attribute based vocabulary by considering user interactions. However, existing fashionbased datasets lack either in hierarchical-category based representation or user interactions of social network. The thesis intends to represent two datasets- one from photo-sharing platform Instagram which gathers fashionistas images with all possible user-interactions and another from online-shop Zalando with every cloths detail. We present a design of a customized crawler that enables the user to crawl data based on category or attributes. Moreover, an efficient and collaborative web-solution is designed and implemented to facilitate large-scale hierarchical category-based detaillevel annotation of Instagram data. By considering all user-interactions, the developed solution provides a detail-level annotation facility that reflects the user’s preference. The web-solution is evaluated by the team as well as the Amazon Turk Service. The annotated output from different users proofs the usability of the web-solution in terms of availability and clarity. In addition to data crawling and annotation web-solution development, this project analyzes the Instagram and Zalando data distribution in terms of cloth category, subcategory and pattern to provide meaningful insight over data. Researcher community will benefit by using these datasets if they intend to work on a rich annotated dataset that represents social network and resembles in-detail cloth information. / Med tanke på trenden inom forskning av rekommendationssystem, där allt fler rekommendationssystem blir hybrida och designade för flera domäner, så finns det ett behov att framställa en datamängd från sociala medier som innehåller detaljerad information om klädkategorier, klädattribut, samt användarinteraktioner. Nuvarande datasets med inriktning mot mode saknar antingen en hierarkisk kategoristruktur eller information om användarinteraktion från sociala nätverk. Detta projekt har syftet att ta fram två dataset, ett dataset som insamlats från fotodelningsplattformen Instagram, som innehåller foton, text och användarinteraktioner från fashionistas, samt ett dataset som insamlats från klädutbutdet som ges av onlinebutiken Zalando. Vi presenterar designen av en webbcrawler som är anpassad för att kunna hämta data från de nämnda domänerna och är optimiserad för mode och klädattribut. Vi presenterar även en effektiv webblösning som är designad och implementerad för att möjliggöra annotering av stora mängder data från Instagram med väldigt detaljerad information om kläder. Genom att vi inkluderar användarinteraktioner i applikationen så kan vår webblösning ge användaranpassad annotering av data. Webblösningen har utvärderats av utvecklarna samt genom AmazonTurk tjänsten. Den annoterade datan från olika användare demonstrerar användarvänligheten av webblösningen. Utöver insamling av data och utveckling av ett system för webb-baserad annotering av data så har datadistributionerna i två modedomäner, Instagram och Zalando, analyserats. Datadistributionerna analyserades utifrån klädkategorier och med syftet att ge datainsikter. Forskning inom detta område kan dra nytta av våra resultat och våra datasets. Specifikt så kan våra datasets användas i domäner som kräver information om detaljerad klädinformation och användarinteraktioner.
|
14 |
Comparision of Machine Learning Algorithms on Identifying Autism Spectrum DisorderAravapalli, Naga Sai Gayathri, Palegar, Manoj Kumar January 2023 (has links)
Background: Autism Spectrum Disorder (ASD) is a complex neurodevelopmen-tal disorder that affects social communication, behavior, and cognitive development.Patients with autism have a variety of difficulties, such as sensory impairments, at-tention issues, learning disabilities, mental health issues like anxiety and depression,as well as motor and learning issues. The World Health Organization (WHO) es-timates that one in 100 children have ASD. Although ASD cannot be completelytreated, early identification of its symptoms might lessen its impact. Early identifi-cation of ASD can significantly improve the outcome of interventions and therapies.So, it is important to identify the disorder early. Machine learning algorithms canhelp in predicting ASD. In this thesis, Support Vector Machine (SVM) and RandomForest (RF) are the algorithms used to predict ASD. Objectives: The main objective of this thesis is to build and train the models usingmachine learning(ML) algorithms with the default parameters and with the hyper-parameter tuning and find out the most accurate model based on the comparison oftwo experiments to predict whether a person is suffering from ASD or not. Methods: Experimentation is the method chosen to answer the research questions.Experimentation helped in finding out the most accurate model to predict ASD. Ex-perimentation is followed by data preparation with splitting of data and by applyingfeature selection to the dataset. After the experimentation followed by two exper-iments, the models were trained to find the performance metrics with the defaultparameters, and the models were trained to find the performance with the hyper-parameter tuning. Based on the comparison, the most accurate model was appliedto predict ASD. Results: In this thesis, we have chosen two algorithms SVM and RF algorithms totrain the models. Upon experimentation and training of the models using algorithmswith hyperparameter tuning. SVM obtained the highest accuracy score and f1 scoresfor test data are 96% and 97% compared to other model RF which helps in predictingASD. Conclusions: The models were trained using two ML algorithms SVM and RF andconducted two experiments, in experiment-1 the models were trained using defaultparameters and obtained accuracy, f1 scores for the test data, and in experiment-2the models were trained using hyper-parameter tuning and obtained the performancemetrics such as accuracy and f1 score for the test data. By comparing the perfor-mance metrics, we came to the conclusion that SVM is the most accurate algorithmfor predicting ASD.
|
15 |
Transformer-Based Networks for Fault Detection and Diagnostics of Rotating MachineryWong, Jonathan January 2024 (has links)
Machine health and condition monitoring are billion-dollar concerns for industry. Quality control and continuous improvement are some of the most important factors for manufacturers to consider in order to maintain a successful business. When work floor interruptions occur, engineers frequently employ “Band-Aid” fixes due to resource, timing, or technical constraints without solving for the root cause. Thus, a need for quick, reliable, and accurate fault detection and diagnosis methods are required.
Within complex rotating machinery, a fundamental component that accounts for large amounts of downtime and failure involves a very basic yet crucial element, the rolling-element bearing. A worn-out bearing constitutes to some of the most drastic failures in any mechanical system next to electrical failures associated with stator windings. The cyclical motion provides a way for measurements to be taken via vibration sensors and analyzed through signal processing techniques. Methods will be discussed to transform these acquired signals into usable input data for neural network training in order to classify the type of fault that is present within the system.
With the wide-spread utilization and adoption of neural networks, we turn our attention to the growing field of sequence-to-sequence deep learning architectures. Language based models have since been adapted to a multitude of tasks outside of text translation and word prediction. We now see powerful Transformers being used to accomplish generative modeling, computer vision, and anomaly detection -- spanning across all industries.
This research aims to determine the efficacy of the Transformer neural network for use in the detection and classification of faults within 3-phase induction motors for the automotive industry. We require a quick turnaround, often leading to small datasets in which methods such as data augmentation will be employed to improve the training process of our time-series signals. / Thesis / Master of Applied Science (MASc)
|
16 |
Tracking domain knowledge based on segmented textual sourcesKalledat, Tobias 11 May 2009 (has links)
Die hier vorliegende Forschungsarbeit hat zum Ziel, Erkenntnisse über den Einfluss der Vorverarbeitung auf die Ergebnisse der Wissensgenerierung zu gewinnen und konkrete Handlungsempfehlungen für die geeignete Vorverarbeitung von Textkorpora in Text Data Mining (TDM) Vorhaben zu geben. Der Fokus liegt dabei auf der Extraktion und der Verfolgung von Konzepten innerhalb bestimmter Wissensdomänen mit Hilfe eines methodischen Ansatzes, der auf der waagerechten und senkrechten Segmentierung von Korpora basiert. Ergebnis sind zeitlich segmentierte Teilkorpora, welche die Persistenzeigenschaft der enthaltenen Terme widerspiegeln. Innerhalb jedes zeitlich segmentierten Teilkorpus können jeweils Cluster von Termen gebildet werden, wobei eines diejenigen Terme enthält, die bezogen auf das Gesamtkorpus nicht persistent sind und das andere Cluster diejenigen, die in allen zeitlichen Segmenten vorkommen. Auf Grundlage einfacher Häufigkeitsmaße kann gezeigt werden, dass allein die statistische Qualität eines einzelnen Korpus es erlaubt, die Vorverarbeitungsqualität zu messen. Vergleichskorpora sind nicht notwendig. Die Zeitreihen der Häufigkeitsmaße zeigen signifikante negative Korrelationen zwischen dem Cluster von Termen, die permanent auftreten, und demjenigen das die Terme enthält, die nicht persistent in allen zeitlichen Segmenten des Korpus vorkommen. Dies trifft ausschließlich auf das optimal vorverarbeitete Korpus zu und findet sich nicht in den anderen Test Sets, deren Vorverarbeitungsqualität gering war. Werden die häufigsten Terme unter Verwendung domänenspezifischer Taxonomien zu Konzepten gruppiert, zeigt sich eine signifikante negative Korrelation zwischen der Anzahl unterschiedlicher Terme pro Zeitsegment und den einer Taxonomie zugeordneten Termen. Dies trifft wiederum nur für das Korpus mit hoher Vorverarbeitungsqualität zu. Eine semantische Analyse auf einem mit Hilfe einer Schwellenwert basierenden TDM Methode aufbereiteten Datenbestand ergab signifikant unterschiedliche Resultate an generiertem Wissen, abhängig von der Qualität der Datenvorverarbeitung. Mit den in dieser Forschungsarbeit vorgestellten Methoden und Maßzahlen ist sowohl die Qualität der verwendeten Quellkorpora, als auch die Qualität der angewandten Taxonomien messbar. Basierend auf diesen Erkenntnissen werden Indikatoren für die Messung und Bewertung von Korpora und Taxonomien entwickelt sowie Empfehlungen für eine dem Ziel des nachfolgenden Analyseprozesses adäquate Vorverarbeitung gegeben. / The research work available here has the goal of analysing the influence of pre-processing on the results of the generation of knowledge and of giving concrete recommendations for action for suitable pre-processing of text corpora in TDM. The research introduced here focuses on the extraction and tracking of concepts within certain knowledge domains using an approach of horizontally (timeline) and vertically (persistence of terms) segmenting of corpora. The result is a set of segmented corpora according to the timeline. Within each timeline segment clusters of concepts can be built according to their persistence quality in relation to each single time-based corpus segment and to the whole corpus. Based on a simple frequency measure it can be shown that only the statistical quality of a single corpus allows measuring the pre-processing quality. It is not necessary to use comparison corpora. The time series of the frequency measure have significant negative correlations between the two clusters of concepts that occur permanently and others that vary within an optimal pre-processed corpus. This was found to be the opposite in every other test set that was pre-processed with lower quality. The most frequent terms were grouped into concepts by the use of domain-specific taxonomies. A significant negative correlation was found between the time series of different terms per yearly corpus segments and the terms assigned to taxonomy for corpora with high quality level of pre-processing. A semantic analysis based on a simple TDM method with significant frequency threshold measures resulted in significant different knowledge extracted from corpora with different qualities of pre-processing. With measures introduced in this research it is possible to measure the quality of applied taxonomy. Rules for the measuring of corpus as well as taxonomy quality were derived from these results and advice suggested for the appropriate level of pre-processing.
|
17 |
Task Load Modelling for LTE Baseband Signal Processing with Artificial Neural Network ApproachWang, Lu January 2014 (has links)
This thesis gives a research on developing an automatic or guided-automatic tool to predict the hardware (HW) resource occupation, namely task load, with respect to the software (SW) application algorithm parameters in an LTE base station. For the signal processing in an LTE base station it is important to get knowledge of how many HW resources will be used when applying a SW algorithm on a specic platform. The information is valuable for one to know the system and platform better, which can facilitate a reasonable use of the available resources. The process of developing the tool is considered to be the process of building a mathematical model between HW task load and SW parameters, where the process is dened as function approximation. According to the universal approximation theorem, the problem can be solved by an intelligent method called articial neural networks (ANNs). The theorem indicates that any function can be approximated with a two-layered neural network as long as the activation function and number of hidden neurons are proper. The thesis documents a work ow on building the model with the ANN method, as well as some research on data subset selection with mathematical methods, such as Partial Correlation and Sequential Searching as a data pre-processing step for the ANN approach. In order to make the data selection method suitable for ANNs, a modication has been made on Sequential Searching method, which gives a better result. The results show that it is possible to develop such a guided-automatic tool for prediction purposes in LTE baseband signal processing under specic precision constraints. Compared to other approaches, this model tool with intelligent approach has a higher precision level and a better adaptivity, meaning that it can be used in any part of the platform even though the transmission channels are dierent. / Denna avhandling utvecklar ett automatiskt eller ett guidat automatiskt verktyg for att forutsaga behov av hardvaruresurser, ocksa kallat uppgiftsbelastning, med avseende pa programvarans algoritmparametrar i en LTE basstation. I signalbehandling i en LTE basstation, ar det viktigt att fa kunskap om hur mycket av hardvarans resurser som kommer att tas i bruk nar en programvara ska koras pa en viss plattform. Informationen ar vardefull for nagon att forsta systemet och plattformen battre, vilket kan mojliggora en rimlig anvandning av tillgangliga resurser. Processen att utveckla verktyget anses vara processen att bygga en matematisk modell mellan hardvarans belastning och programvaruparametrarna, dar processen denieras som approximation av en funktion. Enligt den universella approximationssatsen, kan problemet losas genom en intelligent metod som kallas articiella neuronnat (ANN). Satsen visar att en godtycklig funktion kan approximeras med ett tva-skiktS neuralt natverk sa lange aktiveringsfunktionen och antalet dolda neuroner ar korrekt. Avhandlingen dokumenterar ett arbets- ode for att bygga modellen med ANN-metoden, samt studerar matematiska metoder for val av delmangder av data, sasom Partiell korrelation och sekventiell sokning som dataforbehandlingssteg for ANN. For att gora valet av uppgifter som lampar sig for ANN har en andring gjorts i den sekventiella sokmetoden, som ger battre resultat. Resultaten visar att det ar mojligt att utveckla ett sadant guidat automatiskt verktyg for prediktionsandamal i LTE basbandssignalbehandling under specika precisions begransningar. Jamfort med andra metoder, har dessa modellverktyg med intelligent tillvagagangssatt en hogre precisionsniva och battre adaptivitet, vilket innebar att den kan anvandas i godtycklig del av plattformen aven om overforingskanalerna ar olika.
|
18 |
Machine Learning Based Prediction and Classification for Uplift Modeling / Maskininlärningsbaserad prediktion och klassificering för inkrementell responsanalysBörthas, Lovisa, Krange Sjölander, Jessica January 2020 (has links)
The desire to model the true gain from targeting an individual in marketing purposes has lead to the common use of uplift modeling. Uplift modeling requires the existence of a treatment group as well as a control group and the objective hence becomes estimating the difference between the success probabilities in the two groups. Efficient methods for estimating the probabilities in uplift models are statistical machine learning methods. In this project the different uplift modeling approaches Subtraction of Two Models, Modeling Uplift Directly and the Class Variable Transformation are investigated. The statistical machine learning methods applied are Random Forests and Neural Networks along with the standard method Logistic Regression. The data is collected from a well established retail company and the purpose of the project is thus to investigate which uplift modeling approach and statistical machine learning method that yields in the best performance given the data used in this project. The variable selection step was shown to be a crucial component in the modeling processes as so was the amount of control data in each data set. For the uplift to be successful, the method of choice should be either the Modeling Uplift Directly using Random Forests, or the Class Variable Transformation using Logistic Regression. Neural network - based approaches are sensitive to uneven class distributions and is hence not able to obtain stable models given the data used in this project. Furthermore, the Subtraction of Two Models did not perform well due to the fact that each model tended to focus too much on modeling the class in both data sets separately instead of modeling the difference between the class probabilities. The conclusion is hence to use an approach that models the uplift directly, and also to use a great amount of control data in each data set. / Behovet av att kunna modellera den verkliga vinsten av riktad marknadsföring har lett till den idag vanligt förekommande metoden inkrementell responsanalys. För att kunna utföra denna typ av metod krävs förekomsten av en existerande testgrupp samt kontrollgrupp och målet är således att beräkna differensen mellan de positiva utfallen i de två grupperna. Sannolikheten för de positiva utfallen för de två grupperna kan effektivt estimeras med statistiska maskininlärningsmetoder. De inkrementella responsanalysmetoderna som undersöks i detta projekt är subtraktion av två modeller, att modellera den inkrementella responsen direkt samt en klassvariabeltransformation. De statistiska maskininlärningsmetoderna som tillämpas är random forests och neurala nätverk samt standardmetoden logistisk regression. Datan är samlad från ett väletablerat detaljhandelsföretag och målet är därmed att undersöka vilken inkrementell responsanalysmetod och maskininlärningsmetod som presterar bäst givet datan i detta projekt. De mest avgörande aspekterna för att få ett bra resultat visade sig vara variabelselektionen och mängden kontrolldata i varje dataset. För att få ett lyckat resultat bör valet av maskininlärningsmetod vara random forests vilken används för att modellera den inkrementella responsen direkt, eller logistisk regression tillsammans med en klassvariabeltransformation. Neurala nätverksmetoder är känsliga för ojämna klassfördelningar och klarar därmed inte av att erhålla stabila modeller med den givna datan. Vidare presterade subtraktion av två modeller dåligt på grund av att var modell tenderade att fokusera för mycket på att modellera klassen i båda dataseten separat, istället för att modellera differensen mellan dem. Slutsatsen är således att en metod som modellerar den inkrementella responsen direkt samt en relativt stor kontrollgrupp är att föredra för att få ett stabilt resultat.
|
Page generated in 0.0742 seconds