• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 95
  • 80
  • 11
  • 11
  • 10
  • 4
  • 3
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 255
  • 92
  • 80
  • 69
  • 60
  • 57
  • 53
  • 52
  • 47
  • 47
  • 44
  • 41
  • 38
  • 37
  • 36
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
201

Maskininlärning: avvikelseklassificering på sekventiell sensordata. En jämförelse och utvärdering av algoritmer för att klassificera avvikelser i en miljövänlig IoT produkt med sekventiell sensordata

Heidfors, Filip, Moltedo, Elias January 2019 (has links)
Ett företag har tagit fram en miljövänlig IoT produkt med sekventiell sensordata och vill genom maskininlärning kunna klassificera avvikelser i sensordatan. Det har genom åren utvecklats ett flertal väl fungerande algoritmer för klassificering men det finns emellertid ingen algoritm som fungerar bäst för alla olika problem. Syftet med det här arbetet var därför att undersöka, jämföra och utvärdera olika klassificerare inom "supervised machine learning" för att ta reda på vilken klassificerare som ger högst träffsäkerhet att klassificera avvikelser i den typ av IoT produkt som företaget tagit fram. Genom en litteraturstudie tog vi först reda på vilka klassificerare som vanligtvis använts och fungerat bra i tidigare vetenskapliga arbeten med liknande applikationer. Vi kom fram till att jämföra och utvärdera Random Forest, Naïve Bayes klassificerare och Support Vector Machines ytterligare. Vi skapade sedan ett dataset på 513 exempel som vi använde för träning och validering för respektive klassificerare. Resultatet visade att Random Forest hade betydligt högre träffsäkerhet med 95,7% jämfört med Naïve Bayes klassificerare (81,5%) och Support Vector Machines (78,6%). Slutsatsen för arbetet är att Random Forest med sina 95,7% ger en tillräckligt hög träffsäkerhet så att företaget kan använda maskininlärningsmodellen för att förbättra sin produkt. Resultatet pekar också på att Random Forest, för det här arbetets specifika klassificeringsproblem, är den klassificerare som fungerar bäst inom "supervised machine learning" men att det eventuellt finns möjlighet att få ännu högre träffsäkerhet med andra tekniker som till exempel "unsupervised machine learning" eller "semi-supervised machine learning". / A company has developed a environment-friendly IoT device with sequential sensor data and want to use machine learning to classify anomalies in their data. Throughout the years, several well working algorithms for classifications have been developed. However, there is no optimal algorithm for every problem. The purpose of this work was therefore to investigate, compare and evaluate different classifiers within supervised machine learning to find out which classifier that gives the best accuracy to classify anomalies in the kind of IoT device that the company has developed. With a literature review we first wanted to find out which classifiers that are commonly used and have worked well in related work for similar purposes and applications. We concluded to further compare and evaluate Random Forest, Naïve Bayes and Support Vector Machines. We created a dataset of 513 examples that we used for training and evaluation for each classifier. The result showed that Random Forest had superior accuracy with 95.7% compared to Naïve Bayes (81.5%) and Support Vector Machines (78.6%). The conclusion for this work is that Random Forest, with 95.7%, gives a high enough accuracy for the company to have good use of the machine learning model. The result also indicates that Random Forest, for this thesis specific classification problem, is the best classifier within supervised machine learning but that there is a potential possibility to get even higher accuracy with other techniques such as unsupervised machine learning or semi-supervised machine learning.
202

Noun categorisation in North Halmahera

Asplund, Leif January 2015 (has links)
The languages spoken on northern Halmahera and surrounding small islands constitute a group of related ‘Papuan’ languages called North Halmahera. They are also, together with other Papuan and Austronesian languages, included in a proposed sprachbund which is called East Nusantara. Neuter gender and numeral classifiers have both been proposed to characterize the sprachbund. Consequently,an investigation of the noun categorisation systems in the North Halmahera languages, which is the subject of this study, can be of interest for the characterization of the sprachbund. The method for the investigation is to search for information about seven languages in existing grammatical descriptions, complemented with information which can be culled from published texts in the languages. There are mainly two categorisation systems in all the investigated languages: genders and numeral classifiers. The numerals often contain fossilized prefixes. Among the numeral classifiers, the human classifiers are special because of their origin from pronominal undergoer prefixes and the limitations of its use in some languages. Except in West Makian, there is a default classifier and a classifier for trees, and secondarily for houses, in all languages. A classifier for two-dimensional objects is also quite common. The other classifiers are used with a very limited number of nouns. / Språken som talas på norra Halmahera och omkringliggande småöar utgör en grupp av besläktade ’papuanska’ språk som kallas Nord-Halmahera-språk. De ingår också, ihop med andra papuanska och austronesiska språk, i ett antaget sprachbund som kallas för Östra Nusantara. Neutrum-genus ochnumeriska klassificerare har båda föreslagits karakterisera sprachbundet. Således kan en undersökning av substantivklassificering från ett historiskt och typologiskt perspektiv i Nord-Halmahera-språken, som är ämnet för den här studien, vara av intresse för karakteriseringen av sprachbundet. Metoden för undersökningen är att söka efter information för sju språk i existerande grammatiska beskrivningar, kompletterat med information som kan fås från publicerade texter på språken. Det förekommer huvudsakligen två klassificeringssystem i alla de undersökta språken: genus och numeriska klassificerare. Räkneorden innehåller ofta fossiliserade prefix. Bland de numeriska klassificerarna ärmännisko-klassificerarna speciella genom sitt ursprung i pronominella undergoer-prefix och den begränsade användnings-möjligheten i vissa språk. Utom i västmakianska, förekommer en allmän klassificerare och en klassificerare för träd, och sekundärt för hus, i alla språk. En klassificerare för två-dimensionella objekt är också ganska vanlig. Övriga klassificerare används oftast med ett mycket begränsat antal substantiv.
203

Applying Natural Language Processing to document classification / Tillämpning av Naturlig Språkbehandling för dokumentklassificering

Kragbé, David January 2022 (has links)
In today's digital world, we produce and use more electronic documents than ever before. And this trend is far from slowing down. Particularly, more and more companies and businesses now need to treat a considerable amount of documents to deal with their clients' requests. Scaling this process often requires building an automatic document treatment pipeline. Since the treatment of a document depends on its content, those pipelines heavily rely on an automatic document classifier to correctly process the documents received. Such document classifier should be able to receive a document of any type and output its class based on the text content of the document. In this thesis, we designed and implemented a machine learning pipeline for automated insurance claims documents classification. In order to find the best pipeline, we created several combination of different classifiers (logistic regressor and random forest classifier) and embedding models (Fasttext and Doc2vec). We then compared the performances of all of the pipelines using a the precision and accuracy metrics. We found that a pipeline composed of a Fasttext embedding model combined with a logistic regressor classifier was the most performant, yielding a precision of 85% and an accuracy of 86% on our dataset. / I dagens digitala värld, producerar och använder vi fler elektroniska dokument än någonsin tidigare. Denna trend är långt ifrån att sakta ner sig. Särskilt fler och fler företag behöver nu behandla en stor mängd dokument för att hantera sina kunders önskemål. Att skala denna process kräver ofta att man bygger en pipeline för automatisk dokumentbehandling. Eftersom behandlingen av ett dokument beror på dess innehåll, är dessa pipelines starkt beroende av en automatisk dokumentklassificerare för att korrekt bearbeta de mottagna dokumenten. En sådan dokumentklassificerare skall kunna ta emot ett dokument av vilken typ som helst och mata ut dess klass baserat på dokumentets textinnehåll. I detta examensarbete, designade och implementerade vi en maskininlärningspipeline för automatiserad klassificering av försäkringskrav-dokument. För att hitta den bästa pipelinen, skapade vi flera kombinationer av olika klassificerare (logistisk regressor och random forest klassificerare) och inbäddningsmodeller (Fasttext och Doc2vec). Vi jämförde sedan prestandan för alla pipelines med hjälp av precisions- och noggrannhetsmåtten. Vi fann att en pipeline bestående av en Fasttext-inbäddningsmodell kombinerad med en logistisk regressorklassificerare var den mest presterande, vilket gav en precision på 85% och en noggrannhet på 86% på vår datauppsättning.
204

[en] INTELLIGENT SYSTEM FOR THE IDENTIFICATION OF FRAUD SUSPECTS IN WATER CONSUMPTION / [pt] SISTEMA INTELIGENTE PARA IDENTIFICAÇÃO DE SUSPEITOS DE FRAUDE NO CONSUMO DE ÁGUA

GUILHERME VINICIUS LIMA DOS ANJOS 11 January 2023 (has links)
[pt] Um dos maiores problemas de todas as empresas prestadoras de serviço de sanea-mento e distribuição de água é o de perdas oriundas de irregularidades (comerciais). Dentre os países com mais de 20 milhões de habitantes que mais sofrem desse tipo de perdas, o Brasil ocupa a 14º posição com 40% de perdas na distribuição. A Em-presa A, estudo de caso deste trabalho, é uma companhia brasileira que atua no setor de saneamento e distribuição de água e, atua, principalmente, em 3 regiões, com valores de médias percentuais de perdas, em 2021, de 19%, 30% e 43%, respecti-vamente. Essas perdas são derivadas de muitos problemas, mas as principais são oriundas das fraudes nas ligações dos medidores de água, por exemplo: ligações clandestinas, by-pass e derivação de ramal. A principal forma de combater esse tipo de fraude é através de inspeções nos clientes. Geralmente utiliza-se um conjunto de heurísticas para identificar o suspeito de tal fraude ou irregularidade, porém esses métodos não retornam boas precisões. Na Empresa A, a precisão alcançada através das inspeções varia de 3% a 17% de região para região. Com isso, conclui-se que o procedimento não é eficaz. Sendo assim, o objetivo deste trabalho é desenvolver um sistema inteligente que possa identificar, com maior exatidão, o perfil de con-sumo do cliente que possui a fraude. O sistema desenvolvido é composto por duas metodologias baseadas em diversos algoritmos supervisionados de aprendizado de máquina. A primeira utiliza um filtro com intuito de agrupar os clientes com perfis similares. A segunda faz uso de um algoritmo evolutivo inspirado em computação quântica para a busca de hiperparâmetros e atributos. Além disso, ambas conside-ram comitês e exploram a utilização de variáveis históricas e exógenas pertinentes ao contexto. Os resultados obtidos mostraram-se superiores nas avaliações, quando comparadas aos verificados na Empresa A, alcançando até 44% de taxa de acerto. / [en] One of the biggest problems faced by all companies that provide sanitation and water distribution services is that of losses arising from (commercial) irregularities. Among the countries with more than 20 million inhabitants that suffer the most from this type of loss, Brazil occupies the 14th position with 40% of losses in dis-tribution. Company A, the case study of this work, is a Brazilian company that ope-rates in the sanitation and water distribution sector and operates mainly in 3 regions, with average percentage values of losses, in 2021, of 19%, 30 % and 43%, respec-tively. These losses derive from many problems, but the main ones arise from fraud in the connections of water meters, for example: clandestine connections, by-pass and branch derivation. The main way to combat this type of fraud is through custo-mer inspections. Generally, a set of heuristics is used to identify the suspect of such fraud or irregularity, but these methods do not return good accuracy. At Company A, the accuracy achieved through inspections varies from 3% to 17% from region to region. Thus, it is concluded that the procedure is not effective. Therefore, the objective of this work is to develop an intelligent system that can identify, with greater accuracy, the consumption profile of the customer who has the fraud. The developed system is composed of two methodologies based on several supervised machine learning algorithms. The first uses a filter in order to group customers with similar profiles. The second makes use of an evolutionary algorithm inspired by quantum computing to search for hyperparameters and attributes. In addition, both consider committees and explore the use of historical and exogenous variables re-levant to the context. The results obtained were superior in the evaluations, when compared to those verified in Company A, reaching up to 44% of success rate.
205

Реконфигурабилне архитектуре за хардверску акцелерацију предиктивних модела машинског учења / Rekonfigurabilne arhitekture za hardversku akceleraciju prediktivnih modela mašinskog učenja / Reconfigurable Architectures for Hardware Acceleration of Machine Learning Classifiers

Vranjković Vuk 02 July 2015 (has links)
<p>У овој дисертацији представљене су универзалне реконфигурабилне<br />архитектуре грубог степена гранулације за хардверску имплементацију<br />DT (decision trees), ANN (artificial neural networks) и SVM (support vector<br />machines) предиктивних модела као и хомогених и хетерогених<br />ансамбала. Коришћењем ових архитектура реализоване су две врсте<br />DT модела, две врсте ANN модела, две врсте SVM модела и седам<br />врста ансамбала на FPGA (field programmable gate arrays) чипу.<br />Експерименти, засновани на скуповима из стандардне UCI базе скупова<br />за машинско учење, показују да FPGA имплементација омогућава<br />значајно убрзање (од 1 до 6 редова величине) просечног времена<br />потребног за предикцију, у поређењу са софтверским решењима.</p> / <p>U ovoj disertaciji predstavljene su univerzalne rekonfigurabilne<br />arhitekture grubog stepena granulacije za hardversku implementaciju<br />DT (decision trees), ANN (artificial neural networks) i SVM (support vector<br />machines) prediktivnih modela kao i homogenih i heterogenih<br />ansambala. Korišćenjem ovih arhitektura realizovane su dve vrste<br />DT modela, dve vrste ANN modela, dve vrste SVM modela i sedam<br />vrsta ansambala na FPGA (field programmable gate arrays) čipu.<br />Eksperimenti, zasnovani na skupovima iz standardne UCI baze skupova<br />za mašinsko učenje, pokazuju da FPGA implementacija omogućava<br />značajno ubrzanje (od 1 do 6 redova veličine) prosečnog vremena<br />potrebnog za predikciju, u poređenju sa softverskim rešenjima.</p> / <p>This thesis proposes universal coarse-grained reconfigurable computing<br />architectures for hardware implementation of decision trees (DTs), artificial<br />neural networks (ANNs), support vector machines (SVMs), and<br />homogeneous and heterogeneous ensemble classifiers (HHESs). Using<br />these universal architectures, two versions of DTs, two versions of SVMs,<br />two versions of ANNs, and seven versions of HHESs machine learning<br />classifiers, have been implemented in field programmable gate arrays<br />(FPGA). Experimental results, based on datasets of standard UCI machine<br />learning repository database, show that FPGA implementation provides<br />significant improvement (1&ndash;6 orders of magnitude) in the average instance<br />classification time, in comparison with software implementations.</p>
206

L’extraction de phrases en relation de traduction dans Wikipédia

Rebout, Lise 06 1900 (has links)
Afin d'enrichir les données de corpus bilingues parallèles, il peut être judicieux de travailler avec des corpus dits comparables. En effet dans ce type de corpus, même si les documents dans la langue cible ne sont pas l'exacte traduction de ceux dans la langue source, on peut y retrouver des mots ou des phrases en relation de traduction. L'encyclopédie libre Wikipédia constitue un corpus comparable multilingue de plusieurs millions de documents. Notre travail consiste à trouver une méthode générale et endogène permettant d'extraire un maximum de phrases parallèles. Nous travaillons avec le couple de langues français-anglais mais notre méthode, qui n'utilise aucune ressource bilingue extérieure, peut s'appliquer à tout autre couple de langues. Elle se décompose en deux étapes. La première consiste à détecter les paires d’articles qui ont le plus de chance de contenir des traductions. Nous utilisons pour cela un réseau de neurones entraîné sur un petit ensemble de données constitué d'articles alignés au niveau des phrases. La deuxième étape effectue la sélection des paires de phrases grâce à un autre réseau de neurones dont les sorties sont alors réinterprétées par un algorithme d'optimisation combinatoire et une heuristique d'extension. L'ajout des quelques 560~000 paires de phrases extraites de Wikipédia au corpus d'entraînement d'un système de traduction automatique statistique de référence permet d'améliorer la qualité des traductions produites. Nous mettons les données alignées et le corpus extrait à la disposition de la communauté scientifique. / Working with comparable corpora can be useful to enhance bilingual parallel corpora. In fact, in such corpora, even if the documents in the target language are not the exact translation of those in the source language, one can still find translated words or sentences. The free encyclopedia Wikipedia is a multilingual comparable corpus of several millions of documents. Our task is to find a general endogenous method for extracting a maximum of parallel sentences from this source. We are working with the English-French language pair but our method -- which uses no external bilingual resources -- can be applied to any other language pair. It can best be described in two steps. The first one consists of detecting article pairs that are most likely to contain translations. This is achieved through a neural network trained on a small data set composed of sentence aligned articles. The second step is to perform the selection of sentence pairs through another neural network whose outputs are then re-interpreted by a combinatorial optimization algorithm and an extension heuristic. The addition of the 560~000 pairs of sentences extracted from Wikipedia to the training set of a baseline statistical machine translation system improves the quality of the resulting translations. We make both the aligned data and the extracted corpus available to the scientific community.
207

Técnica experimental para quantificar a eficiência de distribuidores de líquidos industriais do tipo tubos perfurados paralelos. / Liquid aspersion effuciency quantification experiment: application in ladder type distributors.

Moraes, Marlene Silva de 07 July 2008 (has links)
O presente texto descreve um método experimental simples para comparar a eficiência de distribuidores de líquido empregados nas indústrias de tratamento de minérios em lavadores, classificadores e moinhos e nas indústrias de processos químicos. A técnica consiste basicamente em analisar a dispersão pelo desvio padrão da massa do líquido coletado em tubos verticais dispostos em arranjo quadrático colocados abaixo do distribuidor. Como exemplo de aplicação, empregouse para a coleta da massa de líquido uma unidade piloto, montada no Laboratório de Engenharia Química da Universidade Santa Cecília em Santos, com um banco de 21 tubos verticais de 52 mm de diâmetro interno e 800 mm de comprimento. Uma manta acrílica que não dispersa o líquido com 50 mm de espessura foi fixada entre o distribuidor e o banco de tubos para evitar respingos. Foram realizados ensaios com nove distribuidores do tipo espinha de peixe de 4 tubos paralelos cada, para uma coluna piloto com 400 mm de diâmetro. A literatura é discordante no que concerne aos parâmetros de projeto e eficiência destes distribuidores. Variaram-se o número (n) de orifícios (95, 127 e 159 furos/m2, 12, 16 e 20 furos por distribuidor) o diâmetro (d) dos orifícios (2, 3 e 4 mm) e as vazões de entrada indicadas por rotâmetro nos distribuidores (q) de 1,2; 1,4 e 1,6 m3/h. A melhor eficiência de espalhamento pelo menor desvio padrão (0,302) foi obtida com n de 159 furos/m2, d de 2 mm e q de 1,4 m3/h indicando as limitações dos parâmetros de projeto da literatura. A pressão (p), na entrada do distribuidor para esta condição, foi de apenas 0,51 kgf/cm2. A relação adimensional entre a área da seção do tubo de alimentação e a somatória da área dos furos foi de 5,81, a vazão volumétrica total por unidade de área da seção da coluna para esta melhor condição foi de 11,32 m3/(h.m2) e a velocidade média (v) em cada orifício foi de 6,31 m/s. Portanto, o método proposto permite comparar e quantificar a eficiência de distribuidores além de demonstrar a não validade de alguns parâmetros de projeto recomendados pela literatura. / The current text describes a simple experimental method in order to compare the efficiency of the liquid distributors applied at the ore treatment industries in washers, classifiers and mills as well as at the chemical processing industries. The technique basically consist of analyzing the dispersion through the standard deviation of the liquid mass which was collected in vertical pipes placed in a square way under the distributor. As an example of us usage, it has been applied a pilot scale for collecting the liquid mass, installed at the Santa Cecília Universitys Chemical Engineering Laboratory in Santos, with a setting of 21 vertical tubes measuring 52 mm in internal diameter and 800 mm in length. A 50 mm thick acrylic blanket was fixed between the distributor and the pipe setting in order to avoid splashes. Some experiments have been made with a ladder-tipe distributors containing 4 parallel tubes each, for a pilot column of 400 mm in diameter. The literature shows disagreement regarding the characteristics of the project and the efficiency of the distributors. The number of holes has varied (n) 95, 127 and 159 holes/m2; 12, 16 and 20 holes for distributor, the diameter of the holes (d) 2, 3 and 4 mm and the flow of entrance in the distributors (q) of 1,2; 1,4 and 1,6 m3/h. The best efficiency of splashing of the lowest deviation pattern (0,302) was achieved with n of 159 holes/m2, d of 2 mm and q of 1,4 m3/h showing the limitation of characteristics of the project literature. The pressure (p), for this condition in the distributor entrance, was only 0,51 kgf/cm2. The measuring relation between the area of the section of the feeding pipe and the addition of the area of the roles was 5,81, the total volume of the out flow for unit of the area of the column section for this better condition was 11,32 m3/(h m2) and the average speed (v), in each hole was 6,31 m/s. Finally, the indicated method permits the comparison and quantification of the efficiency of the distributors, besides showing that some of the project concepts are not valid and the literature does not recommend them.
208

Hardware Acceleration of Nonincremental Algorithms for the Induction of Decision Trees and Decision Tree Ensembles / Хардверска акцелерација неинкременталних алгоритама за формирање стабала одлуке и њихових ансамбала / Hardverska akceleracija neinkrementalnih algoritama za formiranje stabala odluke i njihovih ansambala

Vukobratović Bogdan 22 February 2017 (has links)
<p>The thesis proposes novel full decision tree and decision tree ensemble<br />induction algorithms EFTI and EEFTI, and various possibilities for their<br />implementations are explored. The experiments show that the proposed EFTI<br />algorithm is able to infer much smaller DTs on average, without the<br />significant loss in accuracy, when compared to the top-down incremental DT<br />inducers. On the other hand, when compared to other full tree induction<br />algorithms, it was able to produce more accurate DTs, with similar sizes, in<br />shorter times. Also, the hardware architectures for acceleration of these<br />algorithms (EFTIP and EEFTIP) are proposed and it is shown in experiments<br />that they can offer substantial speedups.</p> / <p>У овоj дисертациjи, представљени су нови алгоритми EFTI и EEFTI за<br />формирање стабала одлуке и њихових ансамбала неинкременталном<br />методом, као и разне могућности за њихову имплементациjу.<br />Експерименти показуjу да jе предложени EFTI алгоритам у могућности<br />да произведе драстично мања стабла без губитка тачности у односу на<br />постојеће top-down инкременталне алгоритме, а стабла знатно веће<br />тачности у односу на постојеће неинкременталне алгоритме. Такође су<br />предложене хардверске архитектуре за акцелерацију ових алгоритама<br />(EFTIP и EEFTIP) и показано је да је уз помоћ ових архитектура могуће<br />остварити знатна убрзања.</p> / <p>U ovoj disertaciji, predstavljeni su novi algoritmi EFTI i EEFTI za<br />formiranje stabala odluke i njihovih ansambala neinkrementalnom<br />metodom, kao i razne mogućnosti za njihovu implementaciju.<br />Eksperimenti pokazuju da je predloženi EFTI algoritam u mogućnosti<br />da proizvede drastično manja stabla bez gubitka tačnosti u odnosu na<br />postojeće top-down inkrementalne algoritme, a stabla znatno veće<br />tačnosti u odnosu na postojeće neinkrementalne algoritme. Takođe su<br />predložene hardverske arhitekture za akceleraciju ovih algoritama<br />(EFTIP i EEFTIP) i pokazano je da je uz pomoć ovih arhitektura moguće<br />ostvariti znatna ubrzanja.</p>
209

Uma investigação empírica e comparativa da aplicação de RNAs ao problema de mineração de opiniões e análise de sentimentos

Moraes, Rodrigo de 26 March 2013 (has links)
Submitted by Silvana Teresinha Dornelles Studzinski (sstudzinski) on 2015-05-04T17:25:43Z No. of bitstreams: 1 Rodrigo Morais.pdf: 5083865 bytes, checksum: 69563cc7178422ac20ff08fe38ee97de (MD5) / Made available in DSpace on 2015-05-04T17:25:43Z (GMT). No. of bitstreams: 1 Rodrigo Morais.pdf: 5083865 bytes, checksum: 69563cc7178422ac20ff08fe38ee97de (MD5) Previous issue date: 2013 / Nenhuma / A área de Mineração de Opiniões e Análise de Sentimentos surgiu da necessidade de processamento automatizado de informações textuais referentes a opiniões postadas na web. Como principal motivação está o constante crescimento do volume desse tipo de informação, proporcionado pelas tecnologia trazidas pela Web 2.0, que torna inviável o acompanhamento e análise dessas opiniões úteis tanto para usuários com pretensão de compra de novos produtos quanto para empresas para a identificação de demanda de mercado. Atualmente, a maioria dos estudos em Mineração de Opiniões e Análise de Sentimentos que fazem o uso de mineração de dados se voltam para o desenvolvimentos de técnicas que procuram uma melhor representação do conhecimento e acabam utilizando técnicas de classificação comumente aplicadas, não explorando outras que apresentam bons resultados em outros problemas. Sendo assim, este trabalho tem como objetivo uma investigação empírica e comparativa da aplicação do modelo clássico de Redes Neurais Artificiais (RNAs), o multilayer perceptron , no problema de Mineração de Opiniões e Análise de Sentimentos. Para isso, bases de dados de opiniões são definidas e técnicas de representação de conhecimento textual são aplicadas sobre essas objetivando uma igual representação dos textos para os classificadores através de unigramas. A partir dessa reresentação, os classificadores Support Vector Machines (SVM), Naïve Bayes (NB) e RNAs são aplicados considerandos três diferentes contextos de base de dados: (i) bases de dados balanceadas, (ii) bases com diferentes níveis de desbalanceamento e (iii) bases em que a técnica para o tratamento do desbalanceamento undersampling randômico é aplicada. A investigação do contexto desbalanceado e de outros originados dele se mostra relevante uma vez que bases de opiniões disponíveis na web normalmente apresentam mais opiniões positivas do que negativas. Para a avaliação dos classificadores são utilizadas métricas tanto para a mensuração de desempenho de classificação quanto para a de tempo de execução. Os resultados obtidos sobre o contexto balanceado indicam que as RNAs conseguem superar significativamente os resultados dos demais classificadores e, apesar de apresentarem um grande custo computacional para treinamento, proporcionam tempos de classificação significantemente inferiores aos do classificador que apresentou os resultados de classificação mais próximos aos dos resultados das RNAs. Já para o contexto desbalanceado, as RNAs se mostram sensíveis ao aumento de ruído na representação dos dados e ao aumento do desbalanceamento, se destacando nestes experimentos, o classificador NB. Com a aplicação de undersampling as RNAs conseguem ser equivalentes aos demais classificadores apresentando resultados competitivos. Porém, podem não ser o classificador mais adequado de se adotar nesse contexto quando considerados os tempos de treinamento e classificação, e também a diferença pouco expressiva de acerto de classificação. / The area of Opinion Mining and Sentiment Analysis emerges from the need for automated processing of textual information about reviews posted in the web. The main motivation of this area is the constant volume growth of such information, provided by the technologies brought by Web 2.0, that makes impossible the monitoring and analysis of these reviews that are useful for users, who desire to purchase new products, and for companies to identify market demand as well. Currently, the most studies of Opinion Mining and Sentiment Analysis that make use of data mining aims to the development of techniques that seek a better knowledge representation and using classification techniques commonly applied and they not explore others classifiers that work well in other problems. Thus, this work aims a comparative empirical research of the ap-plication of the classical model of Artificial Neural Networks (ANN), the multilayer perceptron, in the Opinion Mining and Sentiment Analysis problem. For this, reviews datasets are defined and techniques for textual knowledge representation applied to these aiming an equal texts rep-resentation for the classifiers. From this representation, the classifiers Support Vector Machines (SVM), Naïve Bayes (NB) and ANN are applied considering three data context: (i) balanced datasets, (ii) datasets with different unbalanced ratio and (iii) datasets with the application of random undersampling technique for the unbalanced handling. The unbalanced context inves-tigation and of others originated from it becomes relevant once datasets available in the web ordinarily contain more positive opinions than negative. For the classifiers evaluation, metrics both for the classification perform and for run time are used. The results obtained in the bal-anced context indicate that ANN outperformed significantly the others classifiers and, although it has a large computation cost for the training fase, the ANN classifier provides classification time (real-time) significantly less than the classifier that obtained the results closer than ANN. For the unbalanced context, the ANN are sensitive to the growth of noise representation and the unbalanced growth while the NB classifier stood out. With the undersampling application, the ANN classifier is equivalent to the others classifiers attaining competitive results. However, it can not be the most appropriate classifier to this context when the training and classification time and its little advantage of classification accuracy are considered.
210

Aspects of Online Learning

Harrington, Edward, edwardharrington@homemail.com.au January 2004 (has links)
Online learning algorithms have several key advantages compared to their batch learning algorithm counterparts: they are generally more memory efficient, and computationally mor efficient; they are simpler to implement; and they are able to adapt to changes where the learning model is time varying. Online algorithms because of their simplicity are very appealing to practitioners. his thesis investigates several online learning algorithms and their application. The thesis has an underlying theme of the idea of combining several simple algorithms to give better performance. In this thesis we investigate: combining weights, combining hypothesis, and (sort of) hierarchical combining.¶ Firstly, we propose a new online variant of the Bayes point machine (BPM), called the online Bayes point machine (OBPM). We study the theoretical and empirical performance of the OBPm algorithm. We show that the empirical performance of the OBPM algorithm is comparable with other large margin classifier methods such as the approximately large margin algorithm (ALMA) and methods which maximise the margin explicitly, like the support vector machine (SVM). The OBPM algorithm when used with a parallel architecture offers potential computational savings compared to ALMA. We compare the test error performance of the OBPM algorithm with other online algorithms: the Perceptron, the voted-Perceptron, and Bagging. We demonstrate that the combinationof the voted-Perceptron algorithm and the OBPM algorithm, called voted-OBPM algorithm has better test error performance than the voted-Perceptron and Bagging algorithms. We investigate the use of various online voting methods against the problem of ranking, and the problem of collaborative filtering of instances. We look at the application of online Bagging and OBPM algorithms to the telecommunications problem of channel equalization. We show that both online methods were successful at reducing the effect on the test error of label flipping and additive noise.¶ Secondly, we introduce a new mixture of experts algorithm, the fixed-share hierarchy (FSH) algorithm. The FSH algorithm is able to track the mixture of experts when the switching rate between the best experts may not be constant. We study the theoretical aspects of the FSH and the practical application of it to adaptive equalization. Using simulations we show that the FSH algorithm is able to track the best expert, or mixture of experts, in both the case where the switching rate is constant and the case where the switching rate is time varying.

Page generated in 0.0515 seconds