Spelling suggestions: "subject:"[een] TEXT CLASSIFICATION"" "subject:"[enn] TEXT CLASSIFICATION""
131 |
[pt] CLASSIFICAÇÃO DE FALHAS DE EQUIPAMENTOS DE UNIDADE DE INTERVENÇÃO EM CONSTRUÇÃO DE POÇOS MARÍTIMOS POR MEIO DE MINERAÇÃO TEXTUAL / [en] TEXT CLASSIFICATION OF OFFSHORE RIG EQUIPMENT FAILURE07 April 2020 (has links)
[pt] A construção de poços marítimos tem se mostrado uma atividade complexa
e de alto risco. Para efetuar esta atividade as empresas se valem principalmente
das unidades de intervenção de poços, também conhecidas como sondas. Estas
possuem altos valores de taxas diárias de uso devido à manutenção preventiva da
unidade em si, mas também por falhas as quais seus equipamentos estão sujeitos.
No cenário específico da Petrobras, em junho de 2011, foi implantado no banco de
dados da empresa um maior detalhamento na classificação das falhas de
equipamentos de sonda. Com isso gerou-se uma descontinuidade nos registros da
empresa e a demanda para adequar estes casos menos detalhados à classificação
atual, mais completa. Os registros são compostos basicamente de informação
textual. Para um passivo de 3384 registros, seria inviável alocar uma pessoa para
classificá-los. Com isso vislumbrou-se uma ferramenta que pudesse efetuar esta
classificação da forma mais automatizada possível, utilizando os registros feitos
após junho de 2011 como base. O objetivo principal deste trabalho é de sanar esta
descontinuidade nos registros de falha de equipamentos de sonda. Os dados foram
tratados e transformados por meio de ferramentas de mineração textual bem como
processados pelo algoritmo de aprendizado supervisionado SVM (Support Vector
Machines). Ao final, após obter a melhor configuração do modelo, este foi
aplicado às informações textuais do passivo de anormalidades, atribuindo suas
classes de acordo com o novo sistema de classificação. / [en] Off-shore well construction has shown to be a complex and risky activity. In
order to build off-shore wells, operators rely mainly on off-shore rigs. These rigs
have an expensive day rate, related to their rental and maintenance, but also due to
their equipment failure. At off-shore Petrobras scenario, on June of 2011, was
implemented at the company database a better detailing on the classification of rig
equipment failure. That brought a discontinuity to the database records and
created a demand for adequacy of the former classification to the new
classification structure. Basically, rig equipment failure records are based on
textual information. For a liability of 3384 records, it was unable for one person to
manage the task. Therefore, an urge came for a tool that could classify these
records automatically, using database records already classified under the new
labels. The main purpose of this work is to overcome this database discontinuity.
Data was treated and transformed through text mining tools and then processed by
supervised learning algorithm SVM (Support Vector Machines). After obtaining
the best model configuration, the old records were submitted under this model and
were classified according to the new classification structure.
|
132 |
Sledovač aktuálního dění / Actual Events TrackerOdstrčilík, Martin January 2013 (has links)
The goal of the master thesis project was to develop an application for tracking of actual events in the surrounding area of the users. This application should allow the users to view events, create new events and add comments to existing ones. Beyond the implementation of developed application, this project deals with an analysis of the presented problem. The analysis includes a comparison with existing solutions and search for available technologies and frameworks applicable for implementation. Another part inside this work is description of the theory in behind of data classification that is internally used for event and comment analysis. This work also includes a design of appliction including design of user interface, software architecture, database, communication protocol and data classifiers. The main part of this project, the implementation, is described aftewards. At the end of this work, there is a summary of the whole process and also there are given some ideas about enhancing the application in the future.
|
133 |
Deep Learning för klassificering av kundsupport-ärendenJonsson, Max January 2020 (has links)
Företag och organisationer som tillhandahåller kundsupport via e-post kommer över tid att samla på sig stora mängder textuella data. Tack vare kontinuerliga framsteg inom Machine Learning ökar ständigt möjligheterna att dra nytta av tidigare insamlat data för att effektivisera organisationens framtida supporthantering. Syftet med denna studie är att analysera och utvärdera hur Deep Learning kan användas för att automatisera processen att klassificera supportärenden. Studien baseras på ett svenskt företags domän där klassificeringarna sker inom företagets fördefinierade kategorier. För att bygga upp ett dataset extraherades supportärenden inkomna via e-post (par av rubrik och meddelande) från företagets supportdatabas, där samtliga ärenden tillhörde en av nio distinkta kategorier. Utvärderingen gjordes genom att analysera skillnaderna i systemets uppmätta precision då olika metoder för datastädning användes, samt då de neurala nätverken byggdes upp med olika arkitekturer. En avgränsning gjordes att endast undersöka olika typer av Convolutional Neural Networks (CNN) samt Recurrent Neural Networks (RNN) i form av både enkel- och dubbelriktade Long Short Time Memory (LSTM) celler. Resultaten från denna studie visar ingen ökning i precision för någon av de undersökta datastädningsmetoderna. Dock visar resultaten att en begränsning av den använda ordlistan heller inte genererar någon negativ effekt. En begränsning av ordlistan kan fortfarande vara användbar för att minimera andra effekter så som exempelvis träningstiden, och eventuellt även minska risken för överanpassning. Av de undersökta nätverksarkitekturerna presterade CNN bättre än RNN på det använda datasetet. Den mest gynnsamma nätverksarkitekturen var ett nätverk med en konvolution per pipeline som för två olika test-set genererade precisioner på 79,3 respektive 75,4 procent. Resultaten visar också att några kategorier är svårare för nätverket att klassificera än andra, eftersom dessa inte är tillräckligt distinkta från resterande kategorier i datasetet. / Companies and organizations providing customer support via email will over time grow a big corpus of text documents. With advances made in Machine Learning the possibilities to use this data to improve the customer support efficiency is steadily increasing. The aim of this study is to analyze and evaluate the use of Deep Learning methods for automizing the process of classifying support errands. This study is based on a Swedish company’s domain where the classification was made within the company’s predefined categories. A dataset was built by obtaining email support errands (subject and body pairs) from the company’s support database. The dataset consisted of data belonging to one of nine separate categories. The evaluation was done by analyzing the alteration in classification accuracy when using different methods for data cleaning and by using different network architectures. A delimitation was set to only examine the effects by using different combinations of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) in the shape of both unidirectional and bidirectional Long Short Time Memory (LSTM) cells. The results of this study show no increase in classification accuracy by any of the examined data cleaning methods. However, a feature reduction of the used vocabulary is proven to neither have any negative impact on the accuracy. A feature reduction might still be beneficial to minimize other side effects such as the time required to train a network, and possibly to help prevent overfitting. Among the examined network architectures CNN were proven to outperform RNN on the used dataset. The most accurate network architecture was a single convolutional network which on two different test sets reached classification rates of 79,3 and 75,4 percent respectively. The results also show some categories to be harder to classify than others, due to them not being distinct enough towards the rest of the categories in the dataset.
|
134 |
Multilingual identification of offensive content in social mediaPàmies Massip, Marc January 2020 (has links)
In today’s society there is a large number of social media users that are free to express their opinion on shared platforms. The socio-cultural differences between the people behind those accounts (in terms of ethnicity, gender, sexual orientation, religion, politics, . . . ) give rise to an important percentage of online discussions that make use of offensive language, which often affects in a negative way the psychological well-being of the victims. In order to address the problem, the endless stream of user-generated content engenders a need to find an accurate and scalable solution to detect offensive language using automated methods. This thesis explores different approaches to the offensiveness detection task focusing on five different languages: Arabic, Danish, English, Greek and Turkish. The results obtained using Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT) are compared, achieving state-of-the-art results with some of the methods tested. The effect of the embeddings used, the dataset size, the class imbalance percentage and the addition of sentiment features are studied and analysed, as well as the cross-lingual capabilities of pre-trained multilingual models.
|
135 |
The past, present or future? : A comparative NLP study of Naive Bayes, LSTM and BERT for classifying Swedish sentences based on their tenseNavér, Norah January 2021 (has links)
Natural language processing is a field in computer science that is becoming increasingly important. One important part of NLP is the ability to sort text to the past, present or future, depending on when the event came or will come about. The objective of this thesis was to use text classification to classify Swedish sentences based on their tense, either past, present or future. Furthermore, the objective was also to compare how lemmatisation would affect the performance of the models. The problem was tackled by implementing three machine learning models on both lemmatised and not lemmatised data. The machine learning models were Naive Bayes, LSTM and BERT. The result showed that the overall performance was affected negatively when the data was lemmatised. The best performing model was BERT with an accuracy of 96.3\%. The result was useful as the best performing model had very high accuracy and performed well on newly constructed sentences. / Språkteknologi är område inom datavetenskap som som har blivit allt viktigare. En viktig del av språkteknologi är förmågan att sortera texter till det förflutna, nuet eller framtiden, beroende på när en händelse skedde eller kommer att ske. Syftet med denna avhandling var att använda textklassificering för att klassificera svenska meningar baserat på deras tempus, antingen dåtid, nutid eller framtid. Vidare var syftet även att jämföra hur lemmatisering skulle påverka modellernas prestanda. Problemet hanterades genom att implementera tre maskininlärningsmodeller på både lemmatiserade och icke lemmatiserade data. Maskininlärningsmodellerna var Naive Bayes, LSTM och BERT. Resultatet var att den övergripande prestandan påverkades negativt när datan lemmatiserade. Den bäst presterande modellen var BERT med en träffsäkerhet på 96,3 \%. Resultatet var användbart eftersom den bäst presterande modellen hade mycket hög träffsäkerhet och fungerade bra på nybyggda meningar.
|
136 |
Leveraging Sequential Nature of Conversations for Intent ClassificationGotteti, Shree January 2021 (has links)
No description available.
|
137 |
Klassificering av kvitton med hjälp av maskininlärningEnerstrand, Simon January 2019 (has links)
Maskininlärning nyttjas inom fler och fler områden. Det har potential att ersätta många repetitiva arbetsuppgifter, eller åtminstone förenkla dem. Dokumenthantering inom ekonomisystem är ett område maskininlärning kan hjälpa till med. Det behövs ofta mycket manuell input i olika fält genom att avläsa fakturor eller kvitton. Målet med projektet är att skapa en applikation som nyttjar maskininlärning åt företaget Centsoft AB. Applikationen ska ta emot OCR-tolkad textmassa från en bild på ett kvitto och sedan, med hög säkerhet, kunna avgöra vilken kategori kvittot tillhör. Den här rapporten syftar till att visa utvecklingen av maskininlärningsmodellen i applikationen. Rapporten svarar på frågeställningen: ”Hur kan kvitton klassificeras med hjälp av maskininlärning?”.Undersökningsmetoden fallstudie och projektmetoden MoSCoW tillämpas i projektet. Projektet tar även hänsyn till åtagandetriangeln. Maskininlärningsramverk används för att utvärdera den upptränade modellen. Den tränade modellen klarar av att, med hög säkerhet, tolka kvitton den inte stött på tidigare. För att få en meningsfull tolkning måste kvitton ha i avsikt att tillhöra någon av de åtta tränade kategorierna.Valet av metoder passade bra till projektet för att besvara frågeställningen. Applikationen kan utvecklas vidare och implementeras i fakturahanteringssystemet. Genomförandet av projektet ger kunskap att arbeta med maskininlärningslösningar. Tekniken kan i framtiden appliceras på flera områden. / Machine learning is used in more and more areas. It has the potential to replace many repetitive tasks, or at least simplify them. Document management within financial systems is an area machine learning can help with. A lot of manual input is often needed in different fields by reading invoices or receipts. The goal of the project is to create an application that uses machine learning for the company Centsoft AB. The application should receive OCR-interpreted texts from an image of a receipt and then, with high certainty, be able to determine which category the receipt belongs to. This report aims to show the development of the machine learning model in the application. The report answers the question: "How can receipts be classified using machine learning?".The methodology case study and the research method MoSCoW will be applied during the project. The project also considers the triangle method described by Eklund. Machine learning frameworks are used to evaluate the trained model. The trained model can, with high certainty, interpret receipts it has not encountered before. In order to get a meaningful interpretation, receipts must have the intention of belonging to one of the eight trained categories.The choice of methods suited the project well to answer the question. The application can be further developed and be implemented in the invoice management system. The implementation of the project gives knowledge about how to work with machine learning solutions. In the future, the technology can be applied in several areas.
|
138 |
[pt] APLICANDO APRENDIZADO DE MÁQUINA À SUPERVISÃO DO MERCADO DE CAPITAIS: CLASSIFICAÇÃO E EXTRAÇÃO DE INFORMAÇÕES DE DOCUMENTOS FINANCEIROS / [en] APPLYING MACHINE LEARNING TO CAPITAL MARKETS SUPERVISION: CLASSIFICATION AND INFORMATION EXTRACTION FROM FINANCIAL DOCUMENTFREDERICO SHU 06 January 2022 (has links)
[pt] A análise de documentos financeiros não estruturados é uma atividade
essencial para a supervisão do mercado de capitais realizada pela Comissão de
Valores Mobiliários (CVM). Formas de automatização que reduzam o esforço
humano despendido no processo de triagem de documentos são vitais para a CVM
lidar com a escassez de recursos humanos e a expansão do mercado de valores
mobiliários. Nesse contexto, a dissertação compara sistematicamente diversos
algoritmos de aprendizado de máquina e técnicas de processamento de texto, a
partir de sua aplicação em duas tarefas de processamento de linguagem natural –
classificação de documentos e extração de informações – desempenhadas em
ambiente real de supervisão de mercados. Na tarefa de classificação, os algoritmos
clássicos proporcionaram melhor desempenho que as redes neurais profundas, o
qual foi potencializado pela aplicação de técnicas de subamostragem e comitês de
máquinas (ensembles). A precisão atual, estimada entre 20 por cento, e 40 por cento, pode ser
aumentada para mais de 90 por cento, com a aplicação dos algoritmos testados. A
arquitetura BERT foi capaz de extrair informações sobre aumento de capital e
incorporação societária de documentos financeiros. Os resultados satisfatórios
obtidos em ambas as tarefas motivam a implementação futura em regime de
produção dos modelos estudados, sob a forma de um sistema de apoio à decisão.
Outra contribuição da dissertação é o CVMCorpus, um corpus constituído para o
escopo deste trabalho com documentos financeiros entregues por companhias
abertas brasileiras à CVM entre 2009 e 2019, que abre possibilidades de pesquisa
futura linguística e de finanças. / [en] The analysis of unstructured financial documents is key to the capital
markets supervision performed by Comissão de Valores Mobiliários (Brazilian
SEC or CVM). Systems capable of reducing human effort involved in the task of
screening documents and outlining relevant information, for further manual
review, are important tools for CVM to deal with the shortage of human resources
and expansion of the Brazilian securities market. In this regard, this dissertation
presents and discusses the application of several machine learning algorithms and
text processing techniques to perform two natural language processing tasks—
document classification and information extraction—in a real market supervision
environment. In the classification exercise, classic algorithms achieved a better
performance than deep neural networks, which was enhanced by applying undersampling techniques and ensembles. Using the tested algorithms can improve the
current precision rate from 20 percent–40 percent to more than 90 percent. The BERT network
architecture was able to extract information from financial documents on capital
increase and mergers. The successful results obtained in both tasks encourage
future implementation of the studied models in the form of a decision support
system. Another contribution of this work is the CVMCorpus, a corpus built to
produce datasets for the tasks, with financial documents released between 2009
and 2019 by Brazilian companies, which opens possibilities of future linguistic
and finance research.
|
139 |
Multilabel text classification of public procurements using deep learning intent detection / Textklassificering av offentliga upphandlingar med djupa artificiella neuronnät och avsåtsdetekteringSuta, Adin January 2019 (has links)
Textual data is one of the most widespread forms of data and the amount of such data available in the world increases at a rapid rate. Text can be understood as either a sequence of characters or words, where the latter approach is the most common. With the breakthroughs within the area of applied artificial intelligence in recent years, more and more tasks are aided by automatic processing of text in various applications. The models introduced in the following sections rely on deep-learning sequence-processing in order to process and text to produce a regression algorithm for classification of what the text input refers to. We investigate and compare the performance of several model architectures along with different hyperparameters. The data set was provided by e-Avrop, a Swedish company which hosts a web platform for posting and bidding of public procurements. It consists of titles and descriptions of Swedish public procurements posted on the website of e-Avrop, along with the respective category/categories of each text. When the texts are described by several categories (multi label case) we suggest a deep learning sequence-processing regression algorithm, where a set of deep learning classifiers are used. Each model uses one of the several labels in the multi label case, along with the text input to produce a set of text - label observation pairs. The goal becomes to investigate whether these classifiers can carry out different levels of intent, an intent which should theoretically be imposed by the different training data sets used by each of the individual deep learning classifiers. / Data i form av text är en av de mest utbredda formerna av data och mängden tillgänglig textdata runt om i världen ökar i snabb takt. Text kan tolkas som en följd av bokstäver eller ord, där tolkning av text i form av ordföljder är absolut vanligast. Genombrott inom artificiell intelligens under de senaste åren har medfört att fler och fler arbetsuppgifter med koppling till text assisteras av automatisk textbearbetning. Modellerna som introduceras i denna uppsats är baserade på djupa artificiella neuronnät med sekventiell bearbetning av textdata, som med hjälp av regression förutspår tillhörande ämnesområde för den inmatade texten. Flera modeller och tillhörande hyperparametrar utreds och jämförs enligt prestanda. Datamängden som använts är tillhandahållet av e-Avrop, ett svenskt företag som erbjuder en webbtjänst för offentliggörande och budgivning av offentliga upphandlingar. Datamängden består av titlar, beskrivningar samt tillhörande ämneskategorier för offentliga upphandlingar inom Sverige, tagna från e-Avrops webtjänst. När texterna är märkta med ett flertal kategorier, föreslås en algoritm baserad på ett djupt artificiellt neuronnät med sekventiell bearbetning, där en mängd klassificeringsmodeller används. Varje sådan modell använder en av de märkta kategorierna tillsammans med den tillhörande texten, som skapar en mängd av text - kategori par. Målet är att utreda huruvida dessa klassificerare kan uppvisa olika former av uppsåt som teoretiskt sett borde vara medfört från de olika datamängderna modellerna mottagit.
|
140 |
ML enhanced interpretation of failed test resultPechetti, Hiranmayi January 2023 (has links)
This master thesis addresses the problem of classifying test failures in Ericsson AB’s BAIT test framework, specifically distinguishing between environment faults and product faults. The project aims to automate the initial defect classification process, reducing manual work and facilitating faster debugging. The significance of this problem lies in the potential time and cost savings it offers to Ericsson and other companies utilizing similar test frameworks. By automating the classification of test failures, developers can quickly identify the root cause of an issue and take appropriate action, leading to improved efficiency and productivity. To solve this problem, the thesis employs machine learning techniques. A dataset of test logs is utilized to evaluate the performance of six classification models: logistic regression, support vector machines, k-nearest neighbors, naive Bayes, decision trees, and XGBoost. Precision and macro F1 scores are used as evaluation metrics to assess the models’ performance. The results demonstrate that all models perform well in classifying test failures, achieving high precision values and macro F1 scores. The decision tree and XGBoost models exhibit perfect precision scores for product faults, while the naive Bayes model achieves the highest macro F1 score. These findings highlight the effectiveness of machine learning in accurately distinguishing between environment faults and product faults within the Bait framework. Developers and organizations can benefit from the automated defect classification system, reducing manual effort and expediting the debugging process. The successful application of machine learning in this context opens up opportunities for further research and development in automated defect classification algorithms. / Detta examensarbete tar upp problemet med att klassificera testfel i Ericsson AB:s BAIT-testramverk, där man specifikt skiljer mellan miljöfel och produktfel. Projektet syftar till att automatisera den initiala defekten klassificeringsprocessen, vilket minskar manuellt arbete och underlättar snabbare felsökning. Betydelsen av detta problem ligger i de potentiella tids- och kostnadsbesparingarna det erbjuder till Ericsson och andra företag som använder liknande testramar. Förbi automatisera klassificeringen av testfel, kan utvecklare snabbt identifiera grundorsaken till ett problem och vidta lämpliga åtgärder, vilket leder till förbättrad effektivitet och produktivitet. För att lösa detta problem använder avhandlingen maskininlärningstekniker. A datauppsättning av testloggar används för att utvärdera prestandan för sex klassificeringar modeller: logistisk regression, stödvektormaskiner, k-närmaste grannar, naiva Bayes, beslutsträd och XGBoost. Precision och makro F1 poäng används som utvärderingsmått för att bedöma modellernas prestanda. Resultaten visar att alla modeller presterar bra i klassificeringstest misslyckanden, uppnå höga precisionsvärden och makro F1-poäng. Beslutet tree- och XGBoost-modeller uppvisar perfekta precision-spoäng för produktfel, medan den naiva Bayes-modellen uppnår högsta makro F1-poäng. Dessa resultat belyser effektiviteten av maskininlärning när det gäller att exakt särskilja mellan miljöfel och produktfel inom Bait-ramverket. Utvecklare och organisationer kan dra nytta av den automatiska defektklassificeringen system, vilket minskar manuell ansträngning och påskyndar felsöknings-processen. De framgångsrik tillämpning av maskininlärning i detta sammanhang öppnar möjligheter för vidare forskning och utveckling inom automatiserade defektklassificeringsalgoritmer.
|
Page generated in 0.0522 seconds