Spelling suggestions: "subject:"naive bayes""
1 |
Simultaneous prediction of symptom severity and cause in data from a test battery for Parkinson patients, using machine learning methodsKhan, Imran Qayyum January 2009 (has links)
The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes.
|
2 |
Automatic Document Classification Applied to Swedish NewsBlein, Florent January 2005 (has links)
<p>The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).</p><p>The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.</p><p>This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.</p><p>The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.</p><p>The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.</p>
|
3 |
Automatic Document Classification Applied to Swedish NewsBlein, Florent January 2005 (has links)
The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN). The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories. This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories. The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment. The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.
|
4 |
Predictive models for chronic renal disease using decision trees, naïve bayes and case-based methodsKhan, Saqib Hussain January 2010 (has links)
Data mining can be used in healthcare industry to “mine” clinical data to discover hidden information for intelligent and affective decision making. Discovery of hidden patterns and relationships often goes intact, yet advanced data mining techniques can be helpful as remedy to this scenario. This thesis mainly deals with Intelligent Prediction of Chronic Renal Disease (IPCRD). Data covers blood, urine test, and external symptoms applied to predict chronic renal disease. Data from the database is initially transformed to Weka (3.6) and Chi-Square method is used for features section. After normalizing data, three classifiers were applied and efficiency of output is evaluated. Mainly, three classifiers are analyzed: Decision Tree, Naïve Bayes, K-Nearest Neighbour algorithm. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. Efficiency of Decision Tree and KNN was almost same but Naïve Bayes proved a comparative edge over others. Further sensitivity and specificity tests are used as statistical measures to examine the performance of a binary classification. Sensitivity (also called recall rate in some fields) measures the proportion of actual positives which are correctly identified while Specificity measures the proportion of negatives which are correctly identified. CRISP-DM methodology is applied to build the mining models. It consists of six major phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
|
5 |
Automated invoice handling with machine learning and OCR / Automatiserad fakturahantering med maskininlärning och OCRLarsson, Andreas, Segerås, Tony January 2016 (has links)
Companies often process invoices manually, therefore automation could reduce manual labor. The aim of this thesis is to evaluate which OCR-engine, Tesseract or OCRopus, performs best at interpreting invoices. This thesis also evaluates if it is possible to use machine learning to automatically process invoices based on previously stored data. By interpreting invoices with the OCR-engines, it results in the output text having few spelling errors. However, the invoice structure is lost, making it impossible to interpret the corresponding fields. If Naïve Bayes is chosen as the algorithm for machine learning, the prototype can correctly classify recurring invoice lines after a set of data has been processed. The conclusion is, neither of the two OCR-engines can interpret the invoices to plain text making it understandable. Machine learning with Naïve Bayes works on invoices if there is enough previously processed data. The findings in this thesis concludes that machine learning and OCR can be utilized to automatize manual labor. / Företag behandlar oftast fakturor manuellt och en automatisering skulle kunna minska fysiskt arbete. Målet med examensarbetet var att undersöka vilken av OCR-läsarna, Tesseract och OCRopus som fungerar bäst på att tolka en inskannad faktura. Även undersöka om det är möjligt med maskininlärning att automatiskt behandla fakturor utifrån tidigare sparad data. Genom att tolka text med hjälp av OCR-läsarna visade resultaten att den producerade texten blev språkligt korrekt, men att strukturen i fakturan inte behölls vilket gjorde det svårt att tolka vilka fält som hör ihop. Naïve Bayes valdes som algoritm till maskininlärningen och resultatet blev en prototyp som korrekt kunde klassificera återkommande fakturarader, efter att en mängd träningsdata var behandlad. Slutsatsen är att ingen av OCR-läsarna kunde tolka fakturor så att resultatet kunde användas vidare, och att maskininlärning med Naïve Bayes fungerar på fakturor om tillräckligt med tidigare behandlad data finns. Utfallet av examensarbetet är att maskininlärning och OCR kan användas för att automatisera fysiskt arbete.
|
6 |
Uma comparação da aplicação de métodos computacionais de classificação de dados aplicados ao consumo de cinema no Brasil / A comparison of the application of data classification computational methods to the consumption of film at theaters in BrazilNieuwenhoff, Nathalia 13 April 2017 (has links)
As técnicas computacionais de aprendizagem de máquina para classificação ou categorização de dados estão sendo cada vez mais utilizadas no contexto de extração de informações ou padrões em bases de dados volumosas em variadas áreas de aplicação. Em paralelo, a aplicação destes métodos computacionais para identificação de padrões, bem como a classificação de dados relacionados ao consumo dos bens de informação é considerada uma tarefa complexa, visto que tais padrões de decisão do consumo estão relacionados com as preferências dos indivíduos e dependem de uma composição de características individuais, variáveis culturais, econômicas e sociais segregadas e agrupadas, além de ser um tópico pouco explorado no mercado brasileiro. Neste contexto, este trabalho realizou o estudo experimental a partir da aplicação do processo de Descoberta do conhecimento (KDD), o que inclui as etapas de seleção e Mineração de Dados, para um problema de classificação binária, indivíduos brasileiros que consomem e não consomem um bem de informação, filmes em salas de cinema, a partir dos dados obtidos na Pesquisa de Orçamento Familiar (POF) 2008-2009, pelo Instituto Brasileiro de Geografia e Estatística (IBGE). O estudo experimental resultou em uma análise comparativa da aplicação de duas técnicas de aprendizagem de máquina para classificação de dados, baseadas em aprendizado supervisionado, sendo estas Naïve Bayes (NB) e Support Vector Machine (SVM). Inicialmente, a revisão sistemática realizada com o objetivo de identificar estudos relacionados a aplicação de técnicas computacionais de aprendizado de máquina para classificação e identificação de padrões de consumo indica que a utilização destas técnicas neste contexto não é um tópico de pesquisa maduro e desenvolvido, visto que não foi abordado em nenhum dos trabalhos estudados. Os resultados obtidos a partir da análise comparativa realizada entre os algoritmos sugerem que a escolha dos algoritmos de aprendizagem de máquina para Classificação de Dados está diretamente relacionada a fatores como: (i) importância das classes para o problema a ser estudado; (ii) balanceamento entre as classes; (iii) universo de atributos a serem considerados em relação a quantidade e grau de importância destes para o classificador. Adicionalmente, os atributos selecionados pelo algoritmo de seleção de variáveis Information Gain sugerem que a decisão de consumo de cultura, mais especificamente do bem de informação, filmes em cinema, está fortemente relacionada a aspectos dos indivíduos relacionados a renda, nível de educação, bem como suas preferências por bens culturais / Machine learning techniques for data classification or categorization are increasingly being used for extracting information or patterns from volumous databases in various application areas. Simultaneously, the application of these computational methods to identify patterns, as well as data classification related to the consumption of information goods is considered a complex task, since such decision consumption paterns are related to the preferences of individuals and depend on a composition of individual characteristics, cultural, economic and social variables segregated and grouped, as well as being not a topic explored in the Brazilian market. In this context, this study performed an experimental study of application of the Knowledge Discovery (KDD) process, which includes data selection and data mining steps, for a binary classification problem, Brazilian individuals who consume and do not consume a information good, film at theaters in Brazil, from the microdata obtained from the Brazilian Household Budget Survey (POF), 2008-2009, performed by the Brazilian Institute of Geography and Statistics (IBGE). The experimental study resulted in a comparative analysis of the application of two machine-learning techniques for data classification, based on supervised learning, such as Naïve Bayes (NB) and Support Vector Machine (SVM). Initially, a systematic review with the objective of identifying studies related to the application of computational techniques of machine learning to classification and identification of consumption patterns indicates that the use of these techniques in this context is not a mature and developed research topic, since was not studied in any of the papers analyzed. The results obtained from the comparative analysis performed between the algorithms suggest that the choice of the machine learning algorithms for data classification is directly related to factors such as: (i) importance of the classes for the problem to be studied; (ii) balancing between classes; (iii) universe of attributes to be considered in relation to the quantity and degree of importance of these to the classifiers. In addition, the attributes selected by the Information Gain variable selection algorithm suggest that the decision to consume culture, more specifically information good, film at theaters, is directly related to aspects of individuals regarding income, educational level, as well as preferences for cultural goods
|
7 |
Método de entrada de texto baseada em gestos para dispositivos com telas sensíveis ao toque / Text input method based gestures for devices with touch screensNascimento, Thamer Horbylon 13 October 2015 (has links)
Submitted by Cláudia Bueno (claudiamoura18@gmail.com) on 2016-03-03T20:22:51Z
No. of bitstreams: 2
Dissertação - Thamer Horbylon Nascimento - 2015.pdf: 3404276 bytes, checksum: 8b96b87a1ad94259e867e586b0cdde9b (MD5)
license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2016-03-04T11:20:16Z (GMT) No. of bitstreams: 2
Dissertação - Thamer Horbylon Nascimento - 2015.pdf: 3404276 bytes, checksum: 8b96b87a1ad94259e867e586b0cdde9b (MD5)
license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5) / Made available in DSpace on 2016-03-04T11:20:16Z (GMT). No. of bitstreams: 2
Dissertação - Thamer Horbylon Nascimento - 2015.pdf: 3404276 bytes, checksum: 8b96b87a1ad94259e867e586b0cdde9b (MD5)
license_rdf: 23148 bytes, checksum: 9da0b6dfac957114c6a7714714b86306 (MD5)
Previous issue date: 2015-10-13 / Fundação de Amparo à Pesquisa do Estado de Goiás - FAPEG / This paper proposes a method for gesture-based text input to be used in devices with touch
screens. The steps required for this are: recognizing gestures, identify actions needed
to enter letters and recognize letters with up to two interactions. To the recognition of
gestures, we used the incremental recognition algorithm, which works so that it is not
necessary to finish a gesture for this to be recognized, that is, works with continuous
gesture recognition. Furthermore, a template is used as a reference for the recognition
of gestures. The algorithm identifies the likelihood of the gesture being a running the
template. As the incremental recognition algorithm did not have curves in templates using
the reduced equation of the circle, was created a template with curves and straight lines
were also added in it. The template created served to create a database for entering the
letters of the alphabet from A to Z, using user gestures. This base was used for training
of a Naïve Bayes classifier, which identifies the probability of inserting a letter entered
by the user based on the gestures. Three experiments were conducted to test the method
developed. It was found that most users entered a letter using up to two interactions when
inserted the five vowels and the five most frequent consonants. When inserting the five
vowels and five consonants less frequent, users were also able to enter the letters with up
to two interactions. Thus, there is evidence that the method to solve the problem which
has been proposed as a solution. / Este trabalho propõe um método para entrada de texto baseado em gestos para ser
usado em dispositivos com telas sensíveis ao toque. Os passos necessários para isso são:
reconhecer gestos, identificar gestos necessários para inserir letras e reconhecer letras com
até duas interações. Para fazer o reconhecimento dos gestos, foi utilizado o algoritmo de
reconhecimento incremental, o qual trabalha de forma que não seja necessário terminar
um gesto para que este seja reconhecido, ou seja, trabalha com reconhecimento contínuo
de gestos. Além disso, um template é utilizado como referência para o reconhecimento
dos gestos, assim, o algoritmo identifica a probabilidade de o gesto em execução ser um do
template. Como o algoritmo de reconhecimento incremental não possuía curvas em seus
templates, utilizando a equação reduzida da circunferência, foi criado um template com
curvas e foram também adicionadas retas nele. O template criado serviu para a criação
de uma base de dados para a inserção das letras do alfabeto de A a Z, utilizando gestos
de usuários. Essa base foi utilizada para treinamento de um classificador Naïve Bayes,
que identifica a probabilidade de inserção de uma letra baseado nos gestos inseridos pelo
usuário. Foram realizados três experimentos para testar o método desenvolvido. Verificouse
que a maioria dos usuários inseriu uma letra utilizando até duas interações, quando
inseriram as cinco vogais e as cinco consoantes mais frequentes. Ao inserir as cinco vogais
e cinco consoantes menos frequentes, os usuários também conseguiram inserir as letras
com até duas interações. Assim, há evidências de que o método resolva o problema para
qual foi proposto como solução.
|
8 |
Uma comparação da aplicação de métodos computacionais de classificação de dados aplicados ao consumo de cinema no Brasil / A comparison of the application of data classification computational methods to the consumption of film at theaters in BrazilNathalia Nieuwenhoff 13 April 2017 (has links)
As técnicas computacionais de aprendizagem de máquina para classificação ou categorização de dados estão sendo cada vez mais utilizadas no contexto de extração de informações ou padrões em bases de dados volumosas em variadas áreas de aplicação. Em paralelo, a aplicação destes métodos computacionais para identificação de padrões, bem como a classificação de dados relacionados ao consumo dos bens de informação é considerada uma tarefa complexa, visto que tais padrões de decisão do consumo estão relacionados com as preferências dos indivíduos e dependem de uma composição de características individuais, variáveis culturais, econômicas e sociais segregadas e agrupadas, além de ser um tópico pouco explorado no mercado brasileiro. Neste contexto, este trabalho realizou o estudo experimental a partir da aplicação do processo de Descoberta do conhecimento (KDD), o que inclui as etapas de seleção e Mineração de Dados, para um problema de classificação binária, indivíduos brasileiros que consomem e não consomem um bem de informação, filmes em salas de cinema, a partir dos dados obtidos na Pesquisa de Orçamento Familiar (POF) 2008-2009, pelo Instituto Brasileiro de Geografia e Estatística (IBGE). O estudo experimental resultou em uma análise comparativa da aplicação de duas técnicas de aprendizagem de máquina para classificação de dados, baseadas em aprendizado supervisionado, sendo estas Naïve Bayes (NB) e Support Vector Machine (SVM). Inicialmente, a revisão sistemática realizada com o objetivo de identificar estudos relacionados a aplicação de técnicas computacionais de aprendizado de máquina para classificação e identificação de padrões de consumo indica que a utilização destas técnicas neste contexto não é um tópico de pesquisa maduro e desenvolvido, visto que não foi abordado em nenhum dos trabalhos estudados. Os resultados obtidos a partir da análise comparativa realizada entre os algoritmos sugerem que a escolha dos algoritmos de aprendizagem de máquina para Classificação de Dados está diretamente relacionada a fatores como: (i) importância das classes para o problema a ser estudado; (ii) balanceamento entre as classes; (iii) universo de atributos a serem considerados em relação a quantidade e grau de importância destes para o classificador. Adicionalmente, os atributos selecionados pelo algoritmo de seleção de variáveis Information Gain sugerem que a decisão de consumo de cultura, mais especificamente do bem de informação, filmes em cinema, está fortemente relacionada a aspectos dos indivíduos relacionados a renda, nível de educação, bem como suas preferências por bens culturais / Machine learning techniques for data classification or categorization are increasingly being used for extracting information or patterns from volumous databases in various application areas. Simultaneously, the application of these computational methods to identify patterns, as well as data classification related to the consumption of information goods is considered a complex task, since such decision consumption paterns are related to the preferences of individuals and depend on a composition of individual characteristics, cultural, economic and social variables segregated and grouped, as well as being not a topic explored in the Brazilian market. In this context, this study performed an experimental study of application of the Knowledge Discovery (KDD) process, which includes data selection and data mining steps, for a binary classification problem, Brazilian individuals who consume and do not consume a information good, film at theaters in Brazil, from the microdata obtained from the Brazilian Household Budget Survey (POF), 2008-2009, performed by the Brazilian Institute of Geography and Statistics (IBGE). The experimental study resulted in a comparative analysis of the application of two machine-learning techniques for data classification, based on supervised learning, such as Naïve Bayes (NB) and Support Vector Machine (SVM). Initially, a systematic review with the objective of identifying studies related to the application of computational techniques of machine learning to classification and identification of consumption patterns indicates that the use of these techniques in this context is not a mature and developed research topic, since was not studied in any of the papers analyzed. The results obtained from the comparative analysis performed between the algorithms suggest that the choice of the machine learning algorithms for data classification is directly related to factors such as: (i) importance of the classes for the problem to be studied; (ii) balancing between classes; (iii) universe of attributes to be considered in relation to the quantity and degree of importance of these to the classifiers. In addition, the attributes selected by the Information Gain variable selection algorithm suggest that the decision to consume culture, more specifically information good, film at theaters, is directly related to aspects of individuals regarding income, educational level, as well as preferences for cultural goods
|
9 |
Application of Machine Learning Techniques for Real-time Classification of Sensor Array DataLi, Sichu 15 May 2009 (has links)
There is a significant need to identify approaches for classifying chemical sensor array data with high success rates that would enhance sensor detection capabilities. The present study attempts to fill this need by investigating six machine learning methods to classify a dataset collected using a chemical sensor array: K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Classification and Regression Trees (CART), Random Forest (RF), Naïve Bayes Classifier (NB), and Principal Component Regression (PCR). A total of 10 predictors that are associated with the response from 10 sensor channels are used to train and test the classifiers. A training dataset of 4 classes containing 136 samples is used to build the classifiers, and a dataset of 4 classes with 56 samples is used for testing. The results generated with the six different methods are compared and discussed. The RF, CART, and KNN are found to have success rates greater than 90%, and to outperform the other methods.
|
10 |
Using Machine Learning to Categorize Documents in a Construction ProjectBjörkendal, Nicklas January 2019 (has links)
Automation of document handling in the construction industries could save large amounts of time, effort and money and classifying a document is an important step in that automation. In the field of machine learning, lots of research have been done on perfecting the algorithms and techniques, but there are many areas where those techniques could be used that has not yet been studied. In this study I looked at how effectively the machine learning algorithm multinomial Naïve-Bayes would be able to classify 1427 documents split up into 19 different categories from a construction project. The experiment achieved an accuracy of 92.7% and the paper discusses some of the ways that accuracy can be improved. However, data extraction proved to be a bottleneck and only 66% of the original documents could be used for testing the classifier.
|
Page generated in 0.029 seconds