Global ETD Search

61	Genomförbarhetsstudie av att känna igen två tankemönster i följd med EEG / Feasibility study of recognizing two subsequent thought patterns with EEG Wilhelmsson, Oskar, Wikén, Victor January 2015 (has links) Studien implementerade ett hjärna-dator-gränssnitt med hjälp av EEG-instrumentet MindWave Mobile Headset. Vi undersökte om det var möjligt att utföra fyra operationer genom att använda tankemönster. Fyra försökspersoner deltog i studien. Deras uppgift var att tänka i två tankemönster i följd som resulterade i en operation. EEG-signalen förbehandlas så att en mönsterigenkänningsmetod (k-NN) lättare kunde urskilja två tankemönster ur signalen. Denna undersökning har till vår vetskap inte tidigare utförts och är därmed kunskapsluckan vi ämnar fylla. Att fylla denna kunskapslucka är av intresse för bland annat användargrupperna: rörelsehindrade, spelintresserade och Virtual Reality-användare. Vi tog fram en modell som modellerade det bästa möjliga utfallet av metodiken i föreliggande studie. Undersökningens resultat kunde inte användas för att göra slutsatser angående frågeställningen då detta skulle vara att post hoc-teoretisera. I modellen visades dock tre av fyra operationer vara genomförbara, med en indikation om att även den fjärde var möjlig att utföra. Resultatet indikerar att det finns anledning att utföra en fortsatt studie. Den föreslagna fortsatta studien bör innefatta nya mätningar som testas av modellen för att fullt ut besvara problemformuleringen. / This study implements a Brain-Computer-Interface using the EEG-instrument MindWave Mobile Headset. We studied the feasibility of performing four operations using thought patterns. Four test subjects participated in the study. Their task was to think in two subsequent thought patterns that resulted in an operation. The EEG-signal was pre-processed in such a way that a pattern recognition algorithm (k-NN) more easily could recognize two thought patterns in the signal. This study has to our knowledge not been done before and thus aims to fill this lack of knowledge in the scientific community. User groups that have an interest in filling this gap are, amongst others; disabled people, gamers, and Virtual Reality users. We created a model that modeled the best possible outcome of the method used in this study. Conclusions drawn from the result can not be used to fully answer the problem statement, since it would be to post hoc-theorize. However, three out of four operations were possible to perform in the model, with an indication that the fourth also was possible to perform. These results indicate that there are grounds to continue this study. The proposed continued study should include new measurements that are tested by the model to determine if it is feasible to distinguish all four operations. EEG MindWave BCI feature vector pre-processing k-NN dimensionality reduction classification algorithm EEG MindWave BCI egenskapsvektor förbehandling k-NN dimensionalitetsreducering klassificeringsalgoritm Media and Communication Technology Medieteknik
62	Att hitta en nål i en höstack: Metoder och tekniker för att sålla och gradera stora mängder ostrukturerad textdata Pettersson, Emeli, Carlson, Albin January 2019 (has links) Big Data är i dagsläget ett populärt ämne som kan användas för en mängd olika syften. Bland annat kan det användas för att analysera data på webben i hopp om att identifiera brott mot mänskliga rättigheter. Genom att tillämpa tekniker inom områden som Artificiell Intelligens (AI), Information Retrieval (IR) samt data- visualisering, hoppas företaget Globalworks AB kunna identifiera röster vilka uttrycker sig om förtryck och kränkningar i social media. Artificiell intelligens och informationshämtning är dock breda områden och forskning som behandlar dem kan finnas långt tillbaka i tiden. Vi har därför valt att utföra en systematisk litteraturstudie i syfte att kartlägga existerande forskning inom dessa områden. Med en litterär sammanställning bistår vi med en ontologisk överblick i hur ett system som använder dessa tekniker är strukturerat, med vilka metoder och teknologier ett sådant system kan utvecklas, samt hur dessa kan kombineras. / Big Data is a popular topic these days which can be utilized for numerous purposes. It can, for instance, be used in order to analyse data made available online in hopes of identifying violations against human rights. By applying techniques within such areas as Artificial Intelligence (AI), Information Retrieval (IR), and Visual Analytics, the company Globalworks Ltd. aims to identify single voices in social media expressing grievances concerning such violations. Artificial Intelligence and Information Retrieval are broad topics however, and have been an active area of research for quite some time. We have therefore chosen to conduct a systematic literature review in hopes of mapping together existing research covering these areas. By presenting a literary compilation, we provide an ontological view of how an information system utilizing techniques within these areas could be structured, in addition to how such a system could deploy said techniques. Information Retrieval Artificial Intelligence Data pre-processing Data transformation Data scraping Machine learning Data visualisation Data storage Big data Insights generation Engineering and Technology Teknik och teknologier
63	Characterization of components of water supply systems from GPR images and tools of intelligent data analysis Ayala Cabrera, David 29 December 2015 (has links) [EN] Over time, due to multiple operational and maintenance activities, the networks of water supply systems (WSSs) undergo interventions, modifications or even are closed. In many cases, these activities are not properly registered. Knowledge of the paths and characteristics (status and age, etc.) of the WSS pipes is obviously necessary for efficient and dynamic management of such systems. This problem is greatly augmented by considering the detection and control of leaks. Access to reliable leakage information is a complex task. In many cases, leaks are detected when the damage is already considerable, which brings high social and economic costs. In this sense, non-destructive methods (e.g., ground penetrating radar - GPR) may be a constructive response to these problems, since they allow, as evidenced in this thesis, to ascertain paths of pipes, identify component characteristics, and detect primordial water leaks. Selection of GPR in this work is justified by its characteristics as non-destructive technique that allows studying both metallic and non-metallic objects. Although the capture of information with GPR is usually successful, such aspects as the capture settings, the large volume of generated information, and the use and interpretation of such information require high level of skill and experience. This dissertation may be seen as a step forward towards the development of tools able to tackle the problem of lack of knowledge on the WSS buried assets. The main objective of this doctoral work is thus to generate tools and assess their feasibility of application to the characterization of components of WSSs from GPR images. In this work we have carried out laboratory tests specifically designed to propose, develop and evaluate methods for the characterization of the WSS buried components. Additionally, we have conducted field tests, which have enabled us to determine the feasibility of implementing such methodologies under uncontrolled conditions. The methodologies developed are based on techniques of intelligent data analysis. The basic principle of this work has involved the processing of data obtained through the GPR to look for useful information about WSS components, with special emphasis on the pipes. After performing numerous activities, one can conclude that, using GPR images, it is feasible to obtain more information than the typical identification of hyperbolae currently performed. In addition, this information can be observed directly, e.g. more simply, using the methodologies proposed in this doctoral work. These methodologies also prove that it is feasible to identify patterns (especially with the preprocessing algorithm termed Agent race) that provide fairly good approximation of the location of leaks in WSSs. Also, in the case of pipes, one can obtain such other characteristics as diameter and material. The main outcomes of this thesis consist in a series of tools we have developed to locate, identify and visualize WSS components from GPR images. Most interestingly, the data are synthesized and reduced so that the characteristics of the different components of the images recorded in GPR are preserved. The ultimate goal is that the developed tools facilitate decision-making in the technical management of WSSs, and that such tools can even be operated by personnel with limited experience in handling non-destructive methodologies, specifically GPR. / [ES] Con el paso del tiempo, y debido a múltiples actividades operacionales y de mantenimiento, las redes de los sistemas de abastecimiento de agua (SAAs) sufren intervenciones, modificaciones o incluso, son clausuradas, sin que, en muchos casos, estas actividades sean correctamente registradas. El conocimiento de los trazados y características (estado y edad, entre otros) de las tuberías en los SAAs es obviamente necesario para una gestión eficiente y dinámica de tales sistemas. A esta problemática se suma la detección y el control de las fugas de agua. El acceso a información fiable sobre las fugas es una tarea compleja. En muchos casos, las fugas son detectadas cuando los daños en la red son ya considerables, lo que trae consigo altos costes sociales y económicos. En este sentido, los métodos no destructivos (por ejemplo, ground penetrating radar - GPR), pueden ser una respuesta a estas problemáticas, ya que permiten, como se pone de manifiesto en esta tesis, localizar los trazados de las tuberías, identificar características de los componentes y detectar las fugas de agua cuando aún no son significativas. La selección del GPR, en este trabajo se justifica por sus características como técnica no destructiva, que permite estudiar tanto objetos metálicos como no metálicos. Aunque la captura de información con GPR suele ser exitosa, la configuración de la captura, el gran volumen de información, y el uso y la interpretación de la información requieren de alto nivel de habilidad y experiencia por parte del personal. Esta tesis doctoral se plantea como un avance hacia el desarrollo de herramientas que permitan responder a la problemática del desconocimiento de los activos enterrados de los SAAs. El objetivo principal de este trabajo doctoral es, pues, generar herramientas y evaluar la viabilidad de su aplicación en la caracterización de componentes de un SAA, a partir de imágenes GPR. En este trabajo hemos realizado ensayos de laboratorio específicamente diseñados para plantear, elaborar y evaluar metodologías para la caracterización de los componentes enterrados de los SAAs. Adicionalmente, hemos realizado ensayos de campo, que han permitido determinar la viabilidad de aplicación de tales metodologías bajo condiciones no controladas. Las metodologías elaboradas están basadas en técnicas de análisis inteligentes de datos. El principio básico de este trabajo ha consistido en el tratamiento adecuado de los datos obtenidos mediante el GPR, a fin de buscar información de utilidad para los SAAs respecto a sus componentes, con especial énfasis en las tuberías. Tras la realización de múltiples actividades, se puede concluir que es viable obtener más información de las imágenes de GPR que la que actualmente se obtiene con la típica identificación de hipérbolas. Esta información, además, puede ser observada directamente, de manera más sencilla, mediante las metodologías planteadas en este trabajo doctoral. Con estas metodologías se ha probado que también es viable la identificación de patrones (especialmente el pre-procesado con el algoritmo Agent race) que proporcionan aproximación bastante acertada de la localización de las fugas de agua en los SAAs. También, en el caso de las tuberías, se puede obtener otro tipo de características tales como el diámetro y el material. Como resultado de esta tesis se han desarrollado una serie de herramientas que permiten visualizar, identificar y localizar componentes de los SAAs a partir de imágenes de GPR. El resultado más interesante es que los resultados obtenidos son sintetizados y reducidos de manera que preservan las características de los diferentes componentes registrados en las imágenes de GPR. El objetivo último es que las herramientas desarrolladas faciliten la toma de decisiones en la gestión técnica de los SAAs y que tales herramientas puedan ser operadas incluso por personal con una experiencia limitada en el manejo / [CA] Amb el temps, a causa de les múltiples activitats d'operació i manteniment, les xarxes de sistemes d'abastament d'aigua (SAAs) se sotmeten a intervencions, modificacions o fins i tot estan tancades. En molts casos, aquestes activitats no estan degudament registrats. El coneixement dels camins i característiques (estat i edat, etc.) de les canonades d'aigua i sanejament fa evident la necessitat d'una gestió eficient i dinàmica d'aquests sistemes. Aquest problema es veu augmentat en gran mesura tenint en compte la detecció i control de fuites. L'accés a informació fiable sobre les fuites és una tasca complexa. En molts casos, les fugues es detecten quan el dany ja és considerable, el que porta costos socials i econòmics. En aquest sentit, els mètodes no destructius (per exemple, ground penetrating radar - GPR) poden ser una resposta constructiva a aquests problemes, ja que permeten, com s'evidencia en aquesta tesi, per determinar rutes de canonades, identificar les característiques dels components, i detectar les fuites d'aigua quan encara no són significatives. La selecció del GPR en aquest treball es justifica per les seves característiques com a tècnica no destructiva que permet estudiar tant objectes metàl·lics i no metàl·lics. Tot i que la captura d'informació amb GPR sol ser reeixida, aspectes com ara la configuració de captura, el gran volum d'informació que es genera, i l'ús i la interpretació d'aquesta informació requereix alt nivell d'habilitat i experiència. Aquesta tesi pot ser vista com un pas endavant cap al desenvolupament d'eines capaces d'abordar el problema de la manca de coneixement sobre els actius d'aigua i sanejament enterrat. L'objectiu principal d'aquest treball doctoral és, doncs, generar eines i avaluar la seva factibilitat d'aplicació a la caracterització dels components de los SAAs, a partir d'imatges GPR. En aquest treball s'han dut a terme proves de laboratori específicament dissenyats per proposar, desenvolupar i avaluar mètodes per a la caracterització dels components d'aigua i sanejament soterrat. A més, hem dut a terme proves de camp, que ens han permès determinar la viabilitat de la implementació d'aquestes metodologies en condicions no controlades. Les metodologies desenvolupades es basen en tècniques d'anàlisi intel·ligent de dades. El principi bàsic d'aquest treball ha consistit en el tractament de dades obtingudes a través del GPR per buscar informació útil sobre els components d'SAA, amb especial èmfasi en la canonades. Després de realitzar nombroses activitats, es pot concloure que, amb l'ús d'imatges de GPR, és factible obtenir més informació que la identificació típica d'hipèrboles realitzat actualment. A més, aquesta informació pot ser observada directament, per exemple, més simplement, utilitzant les metodologies proposades en aquest treball doctoral. Aquestes metodologies també demostren que és factible per identificar patrons (especialment el pre-processat amb l'algoritme Agent race) que proporcionen bastant bona aproximació de la localització de fuites en SAAs. També, en el cas de tubs, es pot obtenir altres característiques com ara el diàmetre i el material. Els principals resultats d'aquesta tesi consisteixen en una sèrie d'eines que hem desenvolupat per localitzar, identificar i visualitzar els components dels SAAS a partir d'imatges GPR. El resultat més interessant és que els resultats obtinguts són sintetitzats i reduïts de manera que preserven les característiques dels diferents components registrats en les imatges de GPR. L'objectiu final és que les eines desenvolupades faciliten la presa de decisions en la gestió tècnica de SAA, i que tals eines poden fins i tot ser operades per personal amb poca experiència en el maneig de metodologies no destructives, específicament GPR. / Ayala Cabrera, D. (2015). Characterization of components of water supply systems from GPR images and tools of intelligent data analysis [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/59235 / TESIS / Premios Extraordinarios de tesis doctorales Water supply systems Ground penetrating radar (GPR) Intelligent data analysis Pre-processing of GPR images Signal and image processing and analysis MATEMATICA APLICADA INGENIERIA HIDRAULICA
64	PTC Creo Simulate 4 Roadmap Coronado, Jose 22 July 2016 (has links) This presentation is intended to inform about the enhancements to Creo Simulate 4.0 and the Roadmap for the future (5.0 +) info:eu-repo/classification/ddc/629 ddc:629 Creo Simulate Simulation, Creo Simulate 4.0
65	Multi-User Methods for FEA Pre-Processing Weerakoon, Prasad 13 June 2012 (has links) (PDF) Collaboration in engineering product development leads to shorter product development times and better products. In product development, considerable time is spent preparing the CAD model or assembly for Finite Element Analysis (FEA). In general Computer-Aided Applications (CAx) such as FEA deter collaboration because they allow only a single user to check out and make changes to the model at a given time. Though most of these software applications come with some collaborative tools, they are limited to simple tasks such as screen sharing and instant messaging. This thesis discusses methods to convert a current commercial FEA pre-processing program into a multi-user program, where multiple people are allowed to work on a single FEA model simultaneously. This thesis discusses a method for creating a multi-user FEA pre-processor and a robust, stable multi-user FEA program with full functionality has been developed using CUBIT. A generalized method for creating a networking architecture for a multi-user FEA pre-processor is discussed and the chosen client-server architecture is demonstrated. Furthermore, a method for decomposing a model/assembly using geometry identification tags is discussed. A working prototype which consists of workspace management Graphical User Interfaces (GUI) is demonstrated. A method for handling time-consuming tasks in an asynchronous multi-user environment is presented using Central Processing Unit (CPU) time as a time indicator. Due to architectural limitations of CUBIT, this is not demonstrated. Moreover, a method for handling undo sequences in a multi-user environment is discussed. Since commercial FEA pre-processors do not allow mesh related actions to be undone using an undo option, this undo handling method is not demonstrated. multi-user collaboration collaborative design multi-user decomposition multi-user architectures collaborative architectures CAx multi-user FEA pre-processing CUBIT CUBIT Connect workspace assignment workspace decomposition Mechanical Engineering
66	A Confidence-Prioritization Approach to Data Processing in Noisy Data Sets and Resulting Estimation Models for Predicting Streamflow Diel Signals in the Pacific Northwest Gustafson, Nathaniel Lee 09 August 2012 (has links) (PDF) Streams in small watersheds are often known to exhibit diel fluctuations, in which streamflow oscillates on a 24-hour cycle. Streamflow diel fluctuations, which we investigate in this study, are an informative indicator of environmental processes. However, in Environmental Data sets, as well as many others, there is a range of noise associated with individual data points. Some points are extracted under relatively clear and defined conditions, while others may include a range of known or unknown confounding factors, which may decrease those points' validity. These points may or may not remain useful for training, depending on how much uncertainty they contain. We submit that in situations where some variability exists in the clarity or 'Confidence' associated with individual data points – Notably environmental data – an approach that factors this confidence into account during the training phase is beneficial. We propose a methodological framework for assigning confidence to individual data records and augmenting training with that information. We then exercise this methodology on two separate datasets: A simulated data set, and a real-world, Environmental Science data set with a focus on streamflow diel signals. The simulated data set provides integral understanding of the nature of the data involved, and the Environmental Science data set provides a real-world case study of an application of this methodology against noisy data. Both studies' results indicate that applying and utilizing confidence in training increases performance and assists in the Data Mining Process. machine learning data mining data data processing pre-processing confidence prioritization environmental science hydrology diel diel fluctuation diel signal streamflow hydrogeology watershed Computer Sciences
67	A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop : TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION Wara, Ummul January 2018 (has links) Due to the transformation of different recommendation system from contentbased to hybrid cross-domain-based, there is an urge to prepare a socialnetwork dataset which will provide sufficient data as well as detail-level annotation from a predefined hierarchical clothing category and attribute based vocabulary by considering user interactions. However, existing fashionbased datasets lack either in hierarchical-category based representation or user interactions of social network. The thesis intends to represent two datasets- one from photo-sharing platform Instagram which gathers fashionistas images with all possible user-interactions and another from online-shop Zalando with every cloths detail. We present a design of a customized crawler that enables the user to crawl data based on category or attributes. Moreover, an efficient and collaborative web-solution is designed and implemented to facilitate large-scale hierarchical category-based detaillevel annotation of Instagram data. By considering all user-interactions, the developed solution provides a detail-level annotation facility that reflects the user’s preference. The web-solution is evaluated by the team as well as the Amazon Turk Service. The annotated output from different users proofs the usability of the web-solution in terms of availability and clarity. In addition to data crawling and annotation web-solution development, this project analyzes the Instagram and Zalando data distribution in terms of cloth category, subcategory and pattern to provide meaningful insight over data. Researcher community will benefit by using these datasets if they intend to work on a rich annotated dataset that represents social network and resembles in-detail cloth information. / Med tanke på trenden inom forskning av rekommendationssystem, där allt fler rekommendationssystem blir hybrida och designade för flera domäner, så finns det ett behov att framställa en datamängd från sociala medier som innehåller detaljerad information om klädkategorier, klädattribut, samt användarinteraktioner. Nuvarande datasets med inriktning mot mode saknar antingen en hierarkisk kategoristruktur eller information om användarinteraktion från sociala nätverk. Detta projekt har syftet att ta fram två dataset, ett dataset som insamlats från fotodelningsplattformen Instagram, som innehåller foton, text och användarinteraktioner från fashionistas, samt ett dataset som insamlats från klädutbutdet som ges av onlinebutiken Zalando. Vi presenterar designen av en webbcrawler som är anpassad för att kunna hämta data från de nämnda domänerna och är optimiserad för mode och klädattribut. Vi presenterar även en effektiv webblösning som är designad och implementerad för att möjliggöra annotering av stora mängder data från Instagram med väldigt detaljerad information om kläder. Genom att vi inkluderar användarinteraktioner i applikationen så kan vår webblösning ge användaranpassad annotering av data. Webblösningen har utvärderats av utvecklarna samt genom AmazonTurk tjänsten. Den annoterade datan från olika användare demonstrerar användarvänligheten av webblösningen. Utöver insamling av data och utveckling av ett system för webb-baserad annotering av data så har datadistributionerna i två modedomäner, Instagram och Zalando, analyserats. Datadistributionerna analyserades utifrån klädkategorier och med syftet att ge datainsikter. Forskning inom detta område kan dra nytta av våra resultat och våra datasets. Specifikt så kan våra datasets användas i domäner som kräver information om detaljerad klädinformation och användarinteraktioner. Computer and Information Sciences Data- och informationsvetenskap
68	Comparision of Machine Learning Algorithms on Identifying Autism Spectrum Disorder Aravapalli, Naga Sai Gayathri, Palegar, Manoj Kumar January 2023 (has links) Background: Autism Spectrum Disorder (ASD) is a complex neurodevelopmen-tal disorder that affects social communication, behavior, and cognitive development.Patients with autism have a variety of difficulties, such as sensory impairments, at-tention issues, learning disabilities, mental health issues like anxiety and depression,as well as motor and learning issues. The World Health Organization (WHO) es-timates that one in 100 children have ASD. Although ASD cannot be completelytreated, early identification of its symptoms might lessen its impact. Early identifi-cation of ASD can significantly improve the outcome of interventions and therapies.So, it is important to identify the disorder early. Machine learning algorithms canhelp in predicting ASD. In this thesis, Support Vector Machine (SVM) and RandomForest (RF) are the algorithms used to predict ASD. Objectives: The main objective of this thesis is to build and train the models usingmachine learning(ML) algorithms with the default parameters and with the hyper-parameter tuning and find out the most accurate model based on the comparison oftwo experiments to predict whether a person is suffering from ASD or not. Methods: Experimentation is the method chosen to answer the research questions.Experimentation helped in finding out the most accurate model to predict ASD. Ex-perimentation is followed by data preparation with splitting of data and by applyingfeature selection to the dataset. After the experimentation followed by two exper-iments, the models were trained to find the performance metrics with the defaultparameters, and the models were trained to find the performance with the hyper-parameter tuning. Based on the comparison, the most accurate model was appliedto predict ASD. Results: In this thesis, we have chosen two algorithms SVM and RF algorithms totrain the models. Upon experimentation and training of the models using algorithmswith hyperparameter tuning. SVM obtained the highest accuracy score and f1 scoresfor test data are 96% and 97% compared to other model RF which helps in predictingASD. Conclusions: The models were trained using two ML algorithms SVM and RF andconducted two experiments, in experiment-1 the models were trained using defaultparameters and obtained accuracy, f1 scores for the test data, and in experiment-2the models were trained using hyper-parameter tuning and obtained the performancemetrics such as accuracy and f1 score for the test data. By comparing the perfor-mance metrics, we came to the conclusion that SVM is the most accurate algorithmfor predicting ASD. Autism Spectrum Disorder(ASD) Classification Data pre-processing Feature selection Machine learning algorithms Random Forest Classifier Support Vector Classifier. Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
69	Viewership forecast on a Twitch broadcast : Using machine learning to predict viewers on sponsored Twitch streams Malm, Jonas, Friberg, Martin January 2022 (has links) Today, the video game industry is larger than the sports and film industries combined, and the largest streaming platform Twitch with an average of 2.8 million concurrent viewers offers the possibility for gaming and non-gaming brands to market their products. Estimating streamers’ viewership is central in these marketing campaigns, but no large-scale studies have been conducted to predict viewership previously. This paper evaluates three different machine learning algorithms with regard to the three different error metrics MAE, MAPE and RMSE; and presents novel features for predicting viewership. Different models are chosen through recursive feature elimination using k-fold cross-validation with respect to both MAE and MAPE separately. The models are evaluated on an independent test and show promising results, on par with manual expert predictions. None of the models can be said to be significantly better than another. XGBoost optimized for MAPE obtained the lowest MAE error score of 282.54 and lowest MAPE error score of 41.36% on the test set, in comparison to expert predictions with 288.06 MAE and 83.05% MAPE. Furthermore, the study illustrates the importance of past viewership and streamer variety to predict future viewership. twitch viewership prediction regression machine learning XGBoost streaming distance metrics feature selection cross-validation feature engineering pre-processing Computer and Information Sciences Data- och informationsvetenskap
70	O algoritmo de aprendizado semi-supervisionado co-training e sua aplicação na rotulação de documentos / The semi-supervised learning algorithm co-training applied to label text documents Matsubara, Edson Takashi 26 May 2004 (has links) Em Aprendizado de Máquina, a abordagem supervisionada normalmente necessita de um número significativo de exemplos de treinamento para a indução de classificadores precisos. Entretanto, a rotulação de dados é freqüentemente realizada manualmente, o que torna esse processo demorado e caro. Por outro lado, exemplos não-rotulados são facilmente obtidos se comparados a exemplos rotulados. Isso é particularmente verdade para tarefas de classificação de textos que envolvem fontes de dados on-line tais como páginas de internet, email e artigos científicos. A classificação de textos tem grande importância dado o grande volume de textos disponível on-line. Aprendizado semi-supervisionado, uma área de pesquisa relativamente nova em Aprendizado de Máquina, representa a junção do aprendizado supervisionado e não-supervisionado, e tem o potencial de reduzir a necessidade de dados rotulados quando somente um pequeno conjunto de exemplos rotulados está disponível. Este trabalho descreve o algoritmo de aprendizado semi-supervisionado co-training, que necessita de duas descrições de cada exemplo. Deve ser observado que as duas descrições necessárias para co-training podem ser facilmente obtidas de documentos textuais por meio de pré-processamento. Neste trabalho, várias extensões do algoritmo co-training foram implementadas. Ainda mais, foi implementado um ambiente computacional para o pré-processamento de textos, denominado PreTexT, com o objetivo de utilizar co-training em problemas de classificação de textos. Os resultados experimentais foram obtidos utilizando três conjuntos de dados. Dois conjuntos de dados estão relacionados com classificação de textos e o outro com classificação de páginas de internet. Os resultados, que variam de excelentes a ruins, mostram que co-training, similarmente a outros algoritmos de aprendizado semi-supervisionado, é afetado de maneira bastante complexa pelos diferentes aspectos na indução dos modelos. / In Machine Learning, the supervised approach usually requires a large number of labeled training examples to learn accurately. However, labeling is often manually performed, making this process costly and time-consuming. By contrast, unlabeled examples are often inexpensive and easier to obtain than labeled examples. This is especially true for text classification tasks involving on-line data sources, such as web pages, email and scientific papers. Text classification is of great practical importance today given the massive volume of online text available. Semi-supervised learning, a relatively new area in Machine Learning, represents a blend of supervised and unsupervised learning, and has the potential of reducing the need of expensive labeled data whenever only a small set of labeled examples is available. This work describes the semi-supervised learning algorithm co-training, which requires a partitioned description of each example into two distinct views. It should be observed that the two different views required by co-training can be easily obtained from textual documents through pre-processing. In this works, several extensions of co-training algorithm have been implemented. Furthermore, we have also implemented a computational environment for text pre-processing, called PreTexT, in order to apply the co-training algorithm to text classification problems. Experimental results using co-training on three data sets are described. Two data sets are related to text classification and the other one to web-page classification. Results, which range from excellent to poor, show that co-training, similarly to other semi-supervised learning algorithms, is affected by modelling assumptions in a rather complicated way. aprendizado de máquina aprendizado multi-visão aprendizado semi-supervisionado co-training co-training machine learning mineração de textos multi-view learning pré-processamento de textos semi-supervised learning text mining text pre-processing

Search results