Spelling suggestions: "subject:"tfidf"" "subject:"tf.idf""
21 |
Sledovač aktuálního dění / Actual Events TrackerOdstrčilík, Martin January 2013 (has links)
The goal of the master thesis project was to develop an application for tracking of actual events in the surrounding area of the users. This application should allow the users to view events, create new events and add comments to existing ones. Beyond the implementation of developed application, this project deals with an analysis of the presented problem. The analysis includes a comparison with existing solutions and search for available technologies and frameworks applicable for implementation. Another part inside this work is description of the theory in behind of data classification that is internally used for event and comment analysis. This work also includes a design of appliction including design of user interface, software architecture, database, communication protocol and data classifiers. The main part of this project, the implementation, is described aftewards. At the end of this work, there is a summary of the whole process and also there are given some ideas about enhancing the application in the future.
|
22 |
Metody klasifikace webových stránek / Methods of Web Page ClassificationNachtnebl, Viktor January 2012 (has links)
This work deals with methods of web page classification. It explains the concept of classification and different features of web pages used for their classification. Further it analyses representation of a page and in detail describes classification method that deals with hierarchical category model and is able to dynamically create new categories. In the second half it shows implementation of chosen method and describes the results.
|
23 |
Odvození slovníku pro nástroj Process Inspector na platformě SharePoint / Derivation of Dictionary for Process Inspector Tool on SharePoint PlatformPavlín, Václav January 2012 (has links)
This master's thesis presents methods for mining important pieces of information from text. It analyses the problem of terms extraction from large document collection and describes the implementation using C# language and Microsoft SQL Server. The system uses stemming and a number of statistical methods for term extraction. This project also compares used methods and suggests the process of the dictionary derivation.
|
24 |
Finding Relevant PDF Medical Journal Articles by the Content of Their Figures as well as Their TextChristiansen, Ammon J. 17 April 2007 (has links) (PDF)
This work addresses the need for an alternative to keyword-based search for sifting through large PDF medical journal article document collections for literature review purposes. Despite users' best efforts to form precise and accurate queries, it is often difficult to guess the right keywords to find all the related articles while finding a minimum number of unrelated ones. Failure during literature review to find relevant, related research results in wasted research time and effort in addition to missing significant work in the related area which could affect the quality of the research work being conducted. The purpose of this work is to explore the benefits of a retrieval system for professional journal articles in PDF format that supports hybrid queries composed of both text and images. PDF medical journal articles contain formatting and layout information that imply the structure and organization of the document. They also contain figures and tables rich with content and meaning. Stripping a PDF into “full-text” for indexing purposes disregards these important features. Specifically, this work investigated the following: (1) what effect the incorporation of a document's embedded figures into the query (in addition to its text) has on retrieval performance (precision) compared to plain keyword-based search; (2) how current text-based document-query similarity methods can be enhanced by using formatting and font-size information as a structure and organization model for a PDF document; (3) whether to use the standard Euclidean distance function or the matrix distance function for content-based image retrieval; (4) how to convert a PDF into a structured, formatted, reflowable XML representation given a pure-layout PDF document; (5) what document views (such as a term frequency cloud, a document outline, or a document's figures) would help users wade through search results to quickly select those that are worth a closer look. While the results of the experiments were unexpectedly worse than their baselines of comparison (see the conclusion for a summary), the experimental methods are very valuable in showing others what directions have already been pursued and why they did not work and what remaining problems need to be solved in order to achieve the goal of improving literature review through use of a hybrid text and image retrieval system.
|
25 |
Sentiment Analysis Of IMDB Movie Reviews : A comparative study of Lexicon based approach and BERT Neural Network modelDomadula, Prashuna Sai Surya Vishwitha, Sayyaparaju, Sai Sumanwita January 2023 (has links)
Background: Movies have become an important marketing and advertising tool that can influence consumer behaviour and trends. Reading film reviews is an im- important part of watching a movie, as it can help viewers gain a general under- standing of the film. And also, provide filmmakers with feedback on how their work is being received. Sentiment analysis is a method of determining whether a review has positive or negative sentiment, and this study investigates a machine learning method for classifying sentiment from film reviews. Objectives: This thesis aims to perform comparative sentiment analysis on textual IMDb movie reviews using lexicon-based and BERT neural network models. Later different performance evaluation metrics are used to identify the most effective learning model. Methods: This thesis employs a quantitative research technique, with data analysed using traditional machine learning. The labelled data set comes from an online website called Kaggle (https://www.kaggle.com/datasets), which contains movie review information. Algorithms like the lexicon-based approach and the BERT neural networks are trained using the chosen IMDb movie reviews data set. To discover which model performs the best at predicting the sentiment analysis, the constructed models will be assessed on the test set using evaluation metrics such as accuracy, precision, recall and F1 score. Results: From the conducted experimentation the BERT neural network model is the most efficient algorithm in classifying the IMDb movie reviews into positive and negative sentiments. This model achieved the highest accuracy score of 90.67% over the trained data set, followed by the BoW model with an accuracy of 79.15%, whereas the TF-IDF model has 78.98% accuracy. BERT model has the better precision and recall with 0.88 and 0.92 respectively, followed by both BoW and TF-IDF models. The BoW model has a precision and recall of 0.79 and the TF-IDF has a precision of 0.79 and a recall of 0.78. And also the BERT model has the highest F1 score of 0.88, followed by the BoW model having a F1 score of 0.79 whereas, TF-IDF has 0.78. Conclusions: Among the two models evaluated, the lexicon-based approach and the BERT transformer neural network, the BERT neural network is the most efficient, having a good performance score based on the measured performance criteria.
|
26 |
Vyhledávání informací TRECVid Search / TRECVid Search Information RetrievalČeloud, David January 2010 (has links)
The master's thesis deals with Information Retrieval. It summarizes the knowledge in the field of Information Retrieval theory. Furthermore, the work gives an overview of models used in Information Retrieval, the data and the actual issues and their possible solutions. The practical part of the master's thesis is focused on the implementation of methods of information retrieval in textual data. The last part is dedicated to experiments validating the implementation and its possible improvements.
|
27 |
關鍵查核事項與會計師事務所特性 / The Relationship between Key Audit Matters and Audit Firm Characteristics陳品芊 Unknown Date (has links)
本文旨在探討關鍵查核事項與會計師事務所特性之關聯性。其中,會計師事務所特性係指其獨立性與專業能力,並分別以任期與產業專家衡量之。
本文實證研究結果如下:其一,主查會計師之任期對關鍵查核事項幾無影響。其二,產業專家會計師事務所與關鍵查核事項之數量及品質僅有部分試驗呈正相關。其三,產業專家主查會計師其對關鍵查核事項之數量及品質均有正面影響。
在增額測試的部分,結果如下:其一,對產業專家主查會計師而言,任期對關鍵查核事項有正面效果。其二,會計師事務所與主查會計師俱為產業專家的會計師對關鍵查核事項之正面影響力大於僅有會計師事務所為產業專家的會計師。最後,會計師的專業能力使其更有能力以簡明扼要的文字呈現關鍵查核事項。 / The objective of this thesis is to investigate the relationship between key audit matters (KAMs) and audit firm characteristics. In this study, audit firm characteristics are focused on their independence and ability, and are measured by tenure and industry expertise, respectively.
The empirical results can be summarized as follows. Firstly, lead partners tenure has little effect on KAMs. Secondly, partial evidence is found on the association between firm-level industry specialist auditors and KAMs. Thirdly, partner-level industry specialist auditors have positive effects on both the quantity and quality of KAMs.
In further examinations, the results are as follows. Firstly, tenure has positive effects on KAMs when the auditors are partner-level industry specialist. Secondly, industry experts at both firm- and partner-levels have stronger positive effects on KAMs then industry experts at firm-level alone. Lastly, auditors’ capacity allows them to present KAMs more concisely.
|
28 |
Multi-label klasifikace textových dokumentů / Multi-Label Classification of Text DocumentsPrůša, Petr January 2012 (has links)
The master's thesis deals with automatic classifi cation of text document. It explains basic terms and problems of text mining. The thesis explains term clustering and shows some basic clustering algoritms. The thesis also shows some methods of classi fication and deals with matrix regression closely. Application using matrix regression for classifi cation was designed and developed. Experiments were focused on normalization and thresholding.
|
29 |
Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on CategorizationŠabatka, Ondřej January 2010 (has links)
The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.
|
30 |
Semantic Topic Modeling and Trend AnalysisMann, Jasleen Kaur January 2021 (has links)
This thesis focuses on finding an end-to-end unsupervised solution to solve a two-step problem of extracting semantically meaningful topics and trend analysis of these topics from a large temporal text corpus. To achieve this, the focus is on using the latest develop- ments in Natural Language Processing (NLP) related to pre-trained language models like Google’s Bidirectional Encoder Representations for Transformers (BERT) and other BERT based models. These transformer-based pre-trained language models provide word and sentence embeddings based on the context of the words. The results are then compared with traditional machine learning techniques for topic modeling. This is done to evalu- ate if the quality of topic models has improved and how dependent the techniques are on manually defined model hyperparameters and data preprocessing. These topic models provide a good mechanism for summarizing and organizing a large text corpus and give an overview of how the topics evolve with time. In the context of research publications or scientific journals, such analysis of the corpus can give an overview of research/scientific interest areas and how these interests have evolved over the years. The dataset used for this thesis is research articles and papers from a journal, namely ’Journal of Cleaner Productions’. This journal has more than 24000 research articles at the time of working on this project. We started with implementing Latent Dirichlet Allocation (LDA) topic modeling. In the next step, we implemented LDA along with document clus- tering to get topics within these clusters. This gave us an idea of the dataset and also gave us a benchmark. After having some base results, we explored transformer-based contextual word and sentence embeddings to evaluate if this leads to more meaningful, contextual, and semantic topics. For document clustering, we have used K-means clustering. In this thesis, we also discuss methods to optimally visualize the topics and the trend changes of these topics over the years. Finally, we conclude with a method for leveraging contextual embeddings using BERT and Sentence-BERT to solve this problem and achieve semantically meaningful topics. We also discuss the results from traditional machine learning techniques and their limitations.
|
Page generated in 0.0267 seconds