Spelling suggestions: "subject:"pinocchio"" "subject:"batocchio""
1 |
Automatic Document Classification Applied to Swedish NewsBlein, Florent January 2005 (has links)
<p>The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).</p><p>The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.</p><p>This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.</p><p>The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.</p><p>The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.</p>
|
2 |
Automatic Document Classification Applied to Swedish NewsBlein, Florent January 2005 (has links)
The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN). The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories. This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories. The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment. The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.
|
3 |
De l'usage de la sémantique dans la classification supervisée de textes : application au domaine médical / On the use of semantics in supervised text classification : application in the medical domainAlbitar, Shereen 12 December 2013 (has links)
Cette thèse porte sur l’impact de l’usage de la sémantique dans le processus de la classification supervisée de textes. Cet impact est évalué au travers d’une étude expérimentale sur des documents issus du domaine médical et en utilisant UMLS (Unified Medical Language System) en tant que ressource sémantique. Cette évaluation est faite selon quatre scénarii expérimentaux d’ajout de sémantique à plusieurs niveaux du processus de classification. Le premier scénario correspond à la conceptualisation où le texte est enrichi avant indexation par des concepts correspondant dans UMLS ; le deuxième et le troisième scénario concernent l’enrichissement des vecteurs représentant les textes après indexation dans un sac de concepts (BOC – bag of concepts) par des concepts similaires. Enfin le dernier scénario utilise la sémantique au niveau de la prédiction des classes, où les concepts ainsi que les relations entre eux, sont impliqués dans la prise de décision. Le premier scénario est testé en utilisant trois des méthodes de classification: Rocchio, NB et SVM. Les trois autres scénarii sont uniquement testés en utilisant Rocchio qui est le mieux à même d’accueillir les modifications nécessaires. Au travers de ces différentes expérimentations nous avons tout d’abord montré que des améliorations significatives pouvaient être obtenues avec la conceptualisation du texte avant l’indexation. Ensuite, à partir de représentations vectorielles conceptualisées, nous avons constaté des améliorations plus modérées avec d’une part l’enrichissement sémantique de cette représentation vectorielle après indexation, et d’autre part l’usage de mesures de similarité sémantique en prédiction. / The main interest of this research is the effect of using semantics in the process of supervised text classification. This effect is evaluated through an experimental study on documents related to the medical domain using the UMLS (Unified Medical Language System) as a semantic resource. This evaluation follows four scenarios involving semantics at different steps of the classification process: the first scenario incorporates the conceptualization step where text is enriched with corresponding concepts from UMLS; both the second and the third scenarios concern enriching vectors that represent text as Bag of Concepts (BOC) with similar concepts; the last scenario considers using semantics during class prediction, where concepts as well as the relations between them are involved in decision making. We test the first scenario using three popular classification techniques: Rocchio, NB and SVM. We choose Rocchio for the other scenarios for its extendibility with semantics. According to experiment, results demonstrated significant improvement in classification performance using conceptualization before indexing. Moderate improvements are reported using conceptualized text representation with semantic enrichment after indexing or with semantic text-to-text semantic similarity measures for prediction.
|
4 |
Rocchio, Ide, Okapi och BIM : En komparativ studie av fyra metoder för relevance feedback / Rocchio, Ide, Okapi and BIM : A comparative study of four methods for relevance feedbackEriksen, Martin January 2008 (has links)
This thesis compares four relevance feedback methods. The Rocchio and Ide dec-hi algorithms for the vector space model and the binary independence model and Okapi BM25 within the probabilistic framework. This is done in a custom-made Information Retrieval system utilizing a collection containing 131 896 LA-Times articles which is part of the TREC ad-hoc collection. The methods are compared on two grounds, using only the relevance information from the 20 highest ranked documents from an initial search and also by using all available relevance information. Although a significant effect of choice of method could be found on the first ground, post-hoc analysis could not determine any statistically significant differences between the methods where Rocchio, Ide dec-hi and Okapi BM25 performed equivalent. All methods except the binary independence model performed significantly better than using no relevance feedback. It was also revealed that although the binary independence model performed far worse on average than the other methods it did outperform them on nearly 20 % of the topics. Further analysis argued that this depends on the lack of query expansion in the binary independence model which is advantageous for some topics although has a negative effect on retrieval efficiency in general. On the second ground Okapi BM25 performed significantly better than the other methods with the binary independence model once again being the worst performer. It was argued that the other methods have problems scaling to large amounts of relevance information where Okapi BM25 has no such issues. / Uppsatsnivå: D
|
5 |
Text-Based Information Retrieval Using Relevance FeedbackKrishnan, Sharenya January 2011 (has links)
Europeana, a freely accessible digital library with an idea to make Europe's cultural and scientific heritage available to the public was founded by the European Commission in 2008. The goal was to deliver a semantically enriched digital content with multilingual access to it. Even though they managed to increase the content of data they slowly faced the problem of retrieving information in an unstructured form. So to complement the Europeana portal services, ASSETS (Advanced Search Service and Enhanced Technological Solutions) was introduced with services that sought to improve the usability and accessibility of Europeana. My contribution is to study different text-based information retrieval models, their relevance feedback techniques and to implement one simple model. The thesis explains a detailed overview of the information retrieval process along with the implementation of the chosen strategy for relevance feedback that generates automatic query expansion. Finally, the thesis concludes with the analysis made using relevance feedback, discussion on the model implemented and then an assessment on future use of this model both as a continuation of my work and using this model in ASSETS.
|
Page generated in 0.0575 seconds