Patents have a great research value and it is also beneficial to the community of industrial, commercial, legal and policymaking. Effective analysis of patent literature can reveal important technical details and relationships, and it can also explain business trends, propose novel industrial solutions, and make crucial investment decisions. Therefore, we should carefully analyze patent documents and use the value of patents. Generally, patent analysts need to have a certain degree of expertise in various research fields, including information retrieval, data processing, text mining, field-specific technology, and business intelligence. In real life, it is difficult to find and nurture such an analyst in a relatively short period of time, enabling him or her to meet the requirement of multiple disciplines. Patent classification is also crucial in processing patent applications because it will empower people with the ability to manage and maintain patent texts better and more flexible. In recent years, the number of patents worldwide has increased dramatically, which makes it very important to design an automatic patent classification system. This system can replace the time-consuming manual classification, thus providing patent analysis managers with an effective method of managing patent texts. This paper designs a patent classification system based on data mining methods and machine learning techniques and use KNIME software to conduct a comparative analysis. This paper will research by using different machine learning methods and different parts of a patent. The purpose of this thesis is to use text data processing methods and machine learning techniques to classify patents automatically. It mainly includes two parts, the first is data preprocessing and the second is the application of machine learning techniques. The research questions include: Which part of a patent as input data performs best in relation to automatic classification? And which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords? This thesis will use design science research as a method to research and analyze this topic. It will use the KNIME platform to apply the machine learning techniques, which include decision tree, XGBoost linear, XGBoost tree, SVM, and random forest. The implementation part includes collection data, preprocessing data, feature word extraction, and applying classification techniques. The patent document consists of many parts such as description, abstract, and claims. In this thesis, we will feed separately these three group input data to our models. Then, we will compare the performance of those three different parts. Based on the results obtained from these three experiments and making the comparison, we suggest using the description part data in the classification system because it shows the best performance in English patent text classification. The abstract can be as the auxiliary standard for classification. However, the classification based on the claims part proposed by some scholars has not achieved good performance in our research. Besides, the BoW and TFIDF methods can be used together to extract efficiently the features words in our research. In addition, we found that the SVM and XGBoost techniques have better performance in the automatic patent classification system in our research.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:hj-49594 |
Date | January 2020 |
Creators | Yehe, Nala |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0022 seconds