Spelling suggestions: "subject:"[een] SUMMARIZATION"" "subject:"[enn] SUMMARIZATION""
181 |
CorrefSum: revisão da coesão referencial em sumários extrativosGonçalves, Patrícia Nunes 28 February 2008 (has links)
Made available in DSpace on 2015-03-05T13:59:43Z (GMT). No. of bitstreams: 0
Previous issue date: 28 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Com o avanço da Internet, cada vez mais convivemos com a sobrecarga de informação. É nesse contexto que a área de sumarização automática de textos tem se tornado uma área proeminente de pesquisa. A sumarização é o processo de discernir as informações mais importantes dos textos para produzir uma versão resumida. Sumarizadores extrativos escolhem as sentenças mais relevantes do texto e as reagrupam para formar o sumário. Muitas vezes, as frases selecionadas do texto não preservam a coesão referencial necessária para o entendimento do texto. O foco deste trabalho é, portanto, na análise e recuperação da coesão referencial desses sumários. O objetivo é desenvolver
um sistema que realiza a manutenção da coesão referencial dos sumários extrativos usando como fonte de informação as cadeias de correferência presentes no texto-fonte.
Para experimentos e avaliação dos resultados foram utilizados dois sumarizadores: Gist-Summ e SuPor-2. Foram utilizadas duas formas de avaliação: automática e subjetiva. Os
resultados / With the advance of Internet technology we see the problem of information overload. In this context, automatic summarization is an important research area. Summarization
is the process of identifying the most relevant information brought about in a text and on that basis to rewrite a short version of it. Extractive summarizers choose the most relevant sentences in a text and regroup them to form the summary. Usually the juxtaposition of the selected sentences violate the referential cohesion that is needed for the interpretation of the text. This work focuses on the analysis and recovery of referential cohesion of extractive summaries on the basis of knowledge about correference chains as presented in the source text. Some experiments were undertaken considering the summarizers GistSumm and SuPor-2. Evaluation was done in two ways, automatically and subjectively. The results indicate that this is a promising area of work and ways of advancing in this research are discussed
|
182 |
Resource Lean and Portable Automatic Text SummarizationHassel, Martin January 2007 (has links)
Today, with digitally stored information available in abundance, even for many minor languages, this information must by some means be filtered and extracted in order to avoid drowning in it. Automatic summarization is one such technique, where a computer summarizes a longer text to a shorter non-rendundant form. Apart from the major languages of the world there are a lot of languages for which large bodies of data aimed at language technology research to a high degree are lacking. There might also not be resources available to develop such bodies of data, since it is usually time consuming and requires substantial manual labor, hence being expensive. Nevertheless, there will still be a need for automatic text summarization for these languages in order to subdue this constantly increasing amount of electronically produced text. This thesis thus sets the focus on automatic summarization of text and the evaluation of summaries using as few human resources as possible. The resources that are used should to as high extent as possible be already existing, not specifically aimed at summarization or evaluation of summaries and, preferably, created as part of natural literary processes. Moreover, the summarization systems should be able to be easily assembled using only a small set of basic language processing tools, again, not specifically aimed at summarization/evaluation. The summarization system should thus be near language independent as to be quickly ported between different natural languages. The research put forth in this thesis mainly concerns three computerized systems, one for near language independent summarization – The HolSum summarizer; one for the collection of large-scale corpora – The KTH News Corpus; and one for summarization evaluation – The KTH eXtract Corpus. These three systems represent three different aspects of transferring the proposed summarization method to a new language. One aspect is the actual summarization method and how it relates to the highly irregular nature of human language and to the difference in traits among language groups. This aspect is discussed in detail in Chapter 3. This chapter also presents the notion of “holistic summarization”, an approach to self-evaluative summarization that weighs the fitness of the summary as a whole, by semantically comparing it to the text being summarized, before presenting it to the user. This approach is embodied as the text summarizer HolSum, which is presented in this chapter and evaluated in Paper 5. A second aspect is the collection of large-scale corpora for languages where few or none such exist. This type of corpora is on the one hand needed for building the language model used by HolSum when comparing summaries on semantic grounds, on the other hand a large enough set of (written) language use is needed to guarantee the randomly selected subcorpus used for evaluation to be representative. This topic briefly touched upon in Chapter 4, and detailed in Paper 1. The third aspect is, of course, the evaluation of the proposed summarization method on a new language. This aspect is investigated in Chapter 4. Evaluations of HolSum have been run on English as well as on Swedish, using both well established data and evaluation schemes (English) as well as with corpora gathered “in the wild” (Swedish). During the development of the latter corpora, which is discussed in Paper 4, evaluations of a traditional sentence ranking text summarizer, SweSum, have also been run. These can be found in Paper 2 and 3. This thesis thus contributes a novel approach to highly portable automatic text summarization, coupled with methods for building the needed corpora, both for training and evaluation on the new language. / Idag, med ett överflöd av digitalt lagrad information även för många mindre språk, är det nära nog omöjligt att manuellt sålla och välja ut vilken information man ska ta till sig. Denna information måste istället filteras och extraheras för att man inte ska drunkna i den. En teknik för detta är automatisk textsammanfattning, där en dator sammanfattar en längre text till en kortare icke-redundant form. Vid sidan av de stora världsspråken finns det många små språk för vilka det saknas stora datamängder ämnade för språkteknologisk forskning. För dessa saknas det också ofta resurser för att bygga upp sådana datamängder då detta är tidskrävande och ofta dessutom kräver en ansenlig mängd manuellt arbete. Likväl behövs automatisk textsammanfattning för dessa språk för att tämja denna konstant ökande mängd elektronsikt producerad text. Denna avhandling sätter således fokus på automatisk sammanfattning av text med så liten mänsklig insats som möjligt. De använda resurserna bör i så hög grad som möjligt redan existera, inte behöva vara skapade för automatisk textsammanfattning och helst även ha kommit till som en naturlig del av en litterär process. Vidare, sammanfattningssystemet bör utan större ansträngning kunna sättas samman med hjälp av ett mindre antal mycket grundläggande språkteknologiska verktyg, vilka inte heller de är specifikt ämnade för textsammanfattning. Textsammanfattaren bör således vara nära nog språkoberoende för att det med enkelhet kunna att flyttas mellan ett språk och ett annat. Den forskning som läggs fram i denna avhandling berör i huvudsak tre datorsystem, ett för nära nog språkoberoende sammanfattning – HolSum; ett för insamlande av stora textmängder – KTH News Corpus; och ett för utvärdering av sammanfattning – KTH eXtract Corpus. Dessa tre system representerar tre olika aspekter av att föra över den framlagda sammanfattningsmetoden till ett nytt språk. En aspekt är den faktiska sammanfattningsmetoden och hur den påverkas av mänskliga språks högst oregelbundna natur och de skillnader som uppvisas mellan olika språkgrupper. Denna aspekt diskuteras i detalj i kapitel tre. I detta kapitel presenteras också begreppet “holistisk sammanfattning”, en ansats tillsjälvutvärderande sammanfattning vilken gör en innehållslig bedömning av sammanfattningen som en helhet innan den presenteras för användaren. Denna ansats förkroppsligas i textsammanfattaren HolSum, som presenteras i detta kapitel samt utvärderas i artikel fem. En andra aspekt är insamlandet av stora textmängder för språk där sådana saknas. Denna typ av datamängder behövs dels för att bygga den språkmodell som HolSum använder sig av när den gör innehållsliga jämförelser sammanfattningar emellan, dels behövs dessa för att ha en tillräckligt stor mängd text att kunna slumpmässigt extrahera en representativ delmängd lämpad för utvärdering ur. Denna aspekt berörs kortfattat i kapitel fyra och i mer önskvärd detalj i artikel ett. Den tredje aspekten är, naturligtvis, utvärdering av den framlagda sammanfattningsmetoden på ett nytt språk. Denna aspekt ges en översikt i kapitel 4. Utvärderingar av HolSum har utförts både med väl etablerade datamängder och utvärderingsmetoder (för engelska) och med data- och utvärderingsmängder insamlade specifikt för detta ändamål (för svenska). Under sammanställningen av denna senare svenska datamängd, vilken beskrivs i artikel fyra, så utfördes även utvärderingar av en traditionell meningsextraherande textsammanfattare, SweSum. Dessa återfinns beskrivna i artikel två och tre. Denna avhandling bidrar således med ett nydanande angreppssätt för nära nog språkoberoende textsammanfattning, uppbackad av metoder för sammansättning av erforderliga datamängder för såväl modellering av som utvärdering på ett nytt språk. / QC 20100712
|
183 |
Association Pattern Analysis for Pattern Pruning, Clustering and SummarizationLi, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered.
Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining.
In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived.
The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns).
The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
|
184 |
Association Pattern Analysis for Pattern Pruning, Clustering and SummarizationLi, Chung Lam 12 September 2008 (has links)
Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered.
Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining.
In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived.
The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns).
The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support.
|
185 |
Multilayer background modeling under occlusions for spatio-temporal scene analysisAzmat, Shoaib 21 September 2015 (has links)
This dissertation presents an efficient multilayer background modeling approach to distinguish among midground objects, the objects whose existence occurs over varying time scales between the extremes of short-term ephemeral appearances (foreground) and long-term stationary persistences (background). Traditional background modeling separates a given scene into foreground and background regions. However, the real world can be much more complex than this simple classification, and object appearance events often occur over varying time scales. There are situations in which objects appear on the scene at different points in time and become stationary; these objects can get occluded by one another, and can change positions or be removed from the scene. Inability to deal with such scenarios involving midground objects results in errors, such as ghost objects, miss-detection of occluding objects, aliasing caused by the objects that have left the scene but are not removed from the model, and new objects’ detection when existing objects are displaced. Modeling temporal layers of multiple objects allows us to overcome these errors, and enables the surveillance and summarization of scenes containing multiple midground objects.
|
186 |
Προσωποποιημένη προβολή περιεχομένου του Διαδικτύου με τεχνικές προ-επεξεργασίας, αυτόματης κατηγοριοποίησης και αυτόματης εξαγωγής περίληψηςΠουλόπουλος, Βασίλειος 22 November 2007 (has links)
Σκοπός της Μεταπτυχιακής Εργασίας είναι η επέκταση και αναβάθμιση του μηχανισμού που είχε δημιουργηθεί στα πλαίσια της Διπλωματικής Εργασίας που εκπόνησα με τίτλο «Δημιουργία Πύλης Προσωποποιημένης Πρόσβασης σε Περιεχόμενο του WWW».
Η παραπάνω Διπλωματική εργασία περιλάμβανε τη δημιουργία ενός μηχανισμού που ξεκινούσε με ανάκτηση πληροφορίας από το Διαδίκτυο (HTML σελίδες από news portals), εξαγωγή χρήσιμου κειμένου και προεπεξεργασία της πληροφορίας, αυτόματη κατηγοριοποίηση της πληροφορίας και τέλος παρουσίαση στον τελικό χρήστη με προσωποποίηση με στοιχεία που εντοπίζονταν στις επιλογές του χρήστη.
Στην παραπάνω εργασία εξετάστηκαν διεξοδικά θέματα που είχαν να κάνουν με τον τρόπο προεπεξεργασίας της πληροφορίας καθώς και με τον τρόπο αυτόματης κατηγοριοποίησης ενώ υλοποιήθηκαν αλγόριθμοι προεπεξεργασίας πληροφορίας τεσσάρων σταδίων και αλγόριθμος αυτόματης κατηγοριοποίησης βασισμένος σε πρότυπες κατηγορίες.
Τέλος υλοποιήθηκε portal το οποίο εκμεταλλευόμενο την επεξεργασία που έχει πραγματοποιηθεί στην πληροφορία παρουσιάζει το περιεχόμενο στους χρήστες προσωποποιημένο βάσει των επιλογών που αυτοί πραγματοποιούν.
Σκοπός της μεταπτυχιακής εργασίας είναι η εξέταση περισσοτέρων αλγορίθμων για την πραγματοποίηση της παραπάνω διαδικασίας αλλά και η υλοποίησή τους προκειμένου να γίνει σύγκριση αλγορίθμων και παραγωγή ποιοτικότερου αποτελέσματος.
Πιο συγκεκριμένα αναβαθμίζονται όλα τα στάδια λειτουργίας του μηχανισμού. Έτσι, το στάδιο λήψης πληροφορίας βασίζεται σε έναν απλό crawler λήψης HTML σελίδων από αγγλόφωνα news portals. Η διαδικασία βασίζεται στο γεγονός πως για κάθε σελίδα υπάρχουν RSS feeds. Διαβάζοντας τα τελευταία νέα που προκύπτουν από τις εγγραφές στα RSS feeds μπορούμε να εντοπίσουμε όλα τα URL που περιέχουν HTML σελίδες με τα άρθρα. Οι HTML σελίδες φιλτράρονται προκειμένου από αυτές να γίνει εξαγωγή μόνο του κειμένου και πιο αναλυτικά του χρήσιμου κειμένου ούτως ώστε το κείμενο που εξάγεται να αφορά αποκλειστικά άρθρα. Η τεχνική εξαγωγής χρήσιμου κειμένου βασίζεται στην τεχνική web clipping. Ένας parser, ελέγχει την HTML δομή προκειμένου να εντοπίσει τους κόμβους που περιέχουν μεγάλη ποσότητα κειμένου και βρίσκονται κοντά σε άλλους κόμβους που επίσης περιέχουν μεγάλες ποσότητες κειμένου.
Στα εξαγόμενα άρθρα πραγματοποιείται προεπεξεργασία πέντε σταδίων με σκοπό να προκύψουν οι λέξεις κλειδιά που είναι αντιπροσωπευτικές του άρθρου. Πιο αναλυτικά, αφαιρούνται όλα τα σημεία στίξης, όλοι οι αριθμοί, μετατρέπονται όλα τα γράμματα σε πεζά, αφαιρούνται όλες οι λέξεις που έχουν λιγότερους από 4 χαρακτήρες, αφαιρούνται όλες οι κοινότυπες λέξεις και τέλος εφαρμόζονται αλγόριθμοι εύρεσης της ρίζας μίας λέξεις. Οι λέξεις κλειδιά που απομένουν είναι stemmed το οποίο σημαίνει πως από τις λέξεις διατηρείται μόνο η ρίζα.
Από τις λέξεις κλειδιά ο μηχανισμός οδηγείται σε δύο διαφορετικά στάδια ανάλυσης. Στο πρώτο στάδιο υπάρχει μηχανισμός ο οποίος αναλαμβάνει να δημιουργήσει μία αντιπροσωπευτική περίληψη του κειμένου ενώ στο δεύτερο στάδιο πραγματοποιείται αυτόματη κατηγοριοποίηση του κειμένου βασισμένη σε πρότυπες κατηγορίες που έχουν δημιουργηθεί από επιλεγμένα άρθρα που συλλέγονται καθ’ όλη τη διάρκεια υλοποίησης του μηχανισμού. Η εξαγωγή περίληψης βασίζεται σε ευρεστικούς αλγορίθμους. Πιο συγκεκριμένα προσπαθούμε χρησιμοποιώντας λεξικολογική ανάλυση του κειμένου αλλά και γεγονότα για τις λέξεις του κειμένου αν δημιουργήσουμε βάρη για τις προτάσεις του κειμένου. Οι προτάσεις με τα μεγαλύτερη βάρη μετά το πέρας της διαδικασίας είναι αυτές που επιλέγονται για να διαμορφώσουν την περίληψη. Όπως θα δούμε και στη συνέχεια για κάθε άρθρο υπάρχει μία γενική περίληψη αλλά το σύστημα είναι σε θέση να δημιουργήσει προσωποποιημένες περιλήψεις για κάθε χρήστη. Η διαδικασία κατηγοριοποίησης βασίζεται στη συσχέτιση συνημίτονου συγκριτικά με τις πρότυπες κατηγορίες. Η κατηγοριοποίηση δεν τοποθετεί μία ταμπέλα σε κάθε άρθρο αλλά μας δίνει τα αποτελέσματα συσχέτισης του άρθρου με κάθε κατηγορία.
Ο συνδυασμός των δύο παραπάνω σταδίων δίνει την πληροφορία που εμφανίζεται σε πρώτη φάση στο χρήστη που επισκέπτεται το προσωποποιημένο portal. Η προσωποποίηση στο portal βασίζεται στις επιλογές που κάνουν οι χρήστες, στο χρόνο που παραμένουν σε μία σελίδα αλλά και στις επιλογές που δεν πραγματοποιούν προκειμένου να δημιουργηθεί προφίλ χρήστη και να είναι εφικτό με την πάροδο του χρόνου να παρουσιάζεται στους χρήστες μόνο πληροφορία που μπορεί να τους ενδιαφέρει. / The scope of this MsC thesis is the extension and upgrade of the mechanism that was constructed during my undergraduate studies under my undergraduate thesis entitled “Construction of a Web Portal with Personalized Access to WWW content”.
The aforementioned thesis included the construction of a mechanism that would begin with information retrieval from the WWW and would conclude to representation of information through a portal after applying useful text extraction, text pre-processing and text categorization techniques.
The scope of the MsC thesis is to locate the problematic parts of the system and correct them with better algorithms and also include more modules on the complete mechanism.
More precisely, all the modules are upgraded while more of them are constructed in every aspect of the mechanism. The information retrieval module is based on a simple crawler. The procedure is based on the fact that all the major news portals include RSS feeds. By locating the latest articles that are added to the RSS feeds we are able to locate all the URLs of the HTML pages that include articles. The crawler then visits every simple URL and downloads the HTML page. These pages are filtered by the useful text extraction mechanism in order to extract only the body of the article from the HTML page. This procedure is based on the web-clipping technique. An HTML parser analyzes the DOM model of HTML and locates the nodes (leafs) that include large amounts of text and are close to nodes with large amounts of text. These nodes are considered to include the useful text.
In the extracted useful text we apply a 5 level preprocessing technique in order to extract the keywords of the article. More analytically, we remove the punctuation, the numbers, the words that are smaller than 4 letters, the stopwords and finally we apply a stemming algorithm in order to produce the root of the word.
The keywords are utilized into two different interconnected levels. The first is the categorization subsystem and the second is the summarization subsystem. During the summarization stage the system constructs a summary of the article while the second stage tries to label the article. The labeling is not unique but the categorization applies multi-labeling techniques in order to detect the relation with each of the standard categories of the system. The summarization technique is based on heuristics. More specifically, we try, by utilizing language processing and facts that concern the keywords, to create a score for each of the sentences of the article. The more the score of a sentence, the more the probability of it to be included to the summary which consists of sentences of the text.
The combination of the categorization and summarization provides the information that is shown to our web portal called perssonal. The personalization issue of the portal is based on the selections of the user, on the non-selections of the user, on the time that the user remains on an article, on the time that spends reading similar or identical articles. After a short period of time, the system is able to adopt on the user’s needs and is able to present articles that match the preferences of the user only.
|
187 |
Supervised Learning Approaches for Automatic Structuring of Videos / Méthodes d'apprentissage supervisé pour la structuration automatique de vidéosPotapov, Danila 22 July 2015 (has links)
L'Interprétation automatique de vidéos est un horizon qui demeure difficile a atteindre en utilisant les approches actuelles de vision par ordinateur. Une des principales difficultés est d'aller au-delà des descripteurs visuels actuels (de même que pour les autres modalités, audio, textuelle, etc) pour pouvoir mettre en oeuvre des algorithmes qui permettraient de reconnaitre automatiquement des sections de vidéos, potentiellement longues, dont le contenu appartient à une certaine catégorie définie de manière sémantique. Un exemple d'une telle section de vidéo serait une séquence ou une personne serait en train de pêcher; un autre exemple serait une dispute entre le héros et le méchant dans un film d'action hollywoodien. Dans ce manuscrit, nous présentons plusieurs contributions qui vont dans le sens de cet objectif ambitieux, en nous concentrant sur trois tâches d'analyse de vidéos: le résumé automatique, la classification, la localisation temporelle.Tout d'abord, nous introduisons une approche pour le résumé automatique de vidéos, qui fournit un résumé de courte durée et informatif de vidéos pouvant être très longues, résumé qui est de plus adapté à la catégorie de vidéos considérée. Nous introduisons également une nouvelle base de vidéos pour l'évaluation de méthodes de résumé automatique, appelé MED-Summaries, ou chaque plan est annoté avec un score d'importance, ainsi qu'un ensemble de programmes informatiques pour le calcul des métriques d'évaluation.Deuxièmement, nous introduisons une nouvelle base de films de cinéma annotés, appelée Inria Action Movies, constitué de films d'action hollywoodiens, dont les plans sont annotés suivant des catégories sémantiques non-exclusives, dont la définition est suffisamment large pour couvrir l'ensemble du film. Un exemple de catégorie est "course-poursuite"; un autre exemple est "scène sentimentale". Nous proposons une approche pour localiser les sections de vidéos appartenant à chaque catégorie et apprendre les dépendances temporelles entre les occurrences de chaque catégorie.Troisièmement, nous décrivons les différentes versions du système développé pour la compétition de détection d'événement vidéo TRECVID Multimédia Event Detection, entre 2011 et 2014, en soulignant les composantes du système dont l'auteur du manuscrit était responsable. / Automatic interpretation and understanding of videos still remains at the frontier of computer vision. The core challenge is to lift the expressive power of the current visual features (as well as features from other modalities, such as audio or text) to be able to automatically recognize typical video sections, with low temporal saliency yet high semantic expression. Examples of such long events include video sections where someone is fishing (TRECVID Multimedia Event Detection), or where the hero argues with a villain in a Hollywood action movie (Inria Action Movies). In this manuscript, we present several contributions towards this goal, focusing on three video analysis tasks: summarization, classification, localisation.First, we propose an automatic video summarization method, yielding a short and highly informative video summary of potentially long videos, tailored for specified categories of videos. We also introduce a new dataset for evaluation of video summarization methods, called MED-Summaries, which contains complete importance-scorings annotations of the videos, along with a complete set of evaluation tools.Second, we introduce a new dataset, called Inria Action Movies, consisting of long movies, and annotated with non-exclusive semantic categories (called beat-categories), whose definition is broad enough to cover most of the movie footage. Categories such as "pursuit" or "romance" in action movies are examples of beat-categories. We propose an approach for localizing beat-events based on classifying shots into beat-categories and learning the temporal constraints between shots.Third, we overview the Inria event classification system developed within the TRECVID Multimedia Event Detection competition and highlight the contributions made during the work on this thesis from 2011 to 2014.
|
188 |
A storytelling machine ? : automatic video summarization : the case of TV series / Une machine à raconter des histoires ? : Analyse et modélisation des processus de ré-éditorialisation de vidéosBost, Xavier 23 November 2016 (has links)
Ces dix dernières années, les séries télévisées sont devenues de plus en plus populaires. Par opposition aux séries TV classiques composées d’épisodes autosuffisants d’un point de vue narratif, les séries TV modernes développent des intrigues continues sur des dizaines d’épisodes successifs. Cependant, la continuité narrative des séries TV modernes entre directement en conflit avec les conditions usuelles de visionnage : en raison des technologies modernes de visionnage, les nouvelles saisons des séries TV sont regardées sur de courtes périodes de temps. Par conséquent, les spectateurs sur le point de visionner de nouvelles saisons sont largement désengagés de l’intrigue, à la fois d’un point de vue cognitif et affectif. Une telle situation fournit au résumé de vidéos des scénarios d’utilisation remarquablement réalistes, que nous détaillons dans le Chapitre 1. De plus, le résumé automatique de films, longtemps limité à la génération de bande-annonces à partir de descripteurs de bas niveau, trouve dans les séries TV une occasion inédite d’aborder dans des conditions bien définies ce qu’on appelle le fossé sémantique : le résumé de médias narratifs exige des approches orientées contenu, capables de jeter un pont entre des descripteurs de bas niveau et le niveau humain de compréhension. Nous passons en revue dans le Chapitre 2 les deux principales approches adoptées jusqu’ici pour aborder le problème du résumé automatique de films de fiction. Le Chapitre 3 est consacré aux différentes sous-tâches requises pour construire les représentations intermédiaires sur lesquelles repose notre système de génération de résumés : la Section 3.2 se concentre sur la segmentation de vidéos,tandis que le reste du chapitre est consacré à l’extraction de descripteurs de niveau intermédiaire,soit orientés saillance (échelle des plans, musique de fond), soit en relation avec le contenu (locuteurs). Dans le Chapitre 4, nous utilisons l’analyse des réseaux sociaux comme une manière possible de modéliser l’intrigue des séries TV modernes : la dynamique narrative peut être adéquatement capturée par l’évolution dans le temps du réseau des personnages en interaction. Cependant, nous devons faire face ici au caractère séquentiel de la narration lorsque nous prenons des vues instantanées de l’état des relations entre personnages. Nous montrons que les approches classiques par fenêtrage temporel ne peuvent pas traiter convenablement ce cas, et nous détaillons notre propre méthode pour extraire des réseaux sociaux dynamiques dans les médias narratifs.Le Chapitre 5 est consacré à la génération finale de résumés orientés personnages,capables à la fois de refléter la dynamique de l’intrigue et de ré-engager émotionnellement les spectateurs dans la narration. Nous évaluons notre système en menant à une large échelle et dans des conditions réalistes une enquête auprès d’utilisateurs. / These past ten years, TV series became increasingly popular. In contrast to classicalTV series consisting of narratively self-sufficient episodes, modern TV seriesdevelop continuous plots over dozens of successive episodes. However, thenarrative continuity of modern TV series directly conflicts with the usual viewing conditions:due to modern viewing technologies, the new seasons of TV series are beingwatched over short periods of time. As a result, viewers are largely disengaged fromthe plot, both cognitively and emotionally, when about to watch new seasons. Sucha situation provides video summarization with remarkably realistic use-case scenarios,that we detail in Chapter 1. Furthermore, automatic movie summarization, longrestricted to trailer generation based on low-level features, finds with TV series a unprecedentedopportunity to address in well-defined conditions the so-called semanticgap: summarization of narrative media requires content-oriented approaches capableto bridge the gap between low-level features and human understanding. We review inChapter 2 the two main approaches adopted so far to address automatic movie summarization.Chapter 3 is dedicated to the various subtasks needed to build the intermediaryrepresentations on which our summarization framework relies: Section 3.2focuses on video segmentation, whereas the rest of Chapter 3 is dedicated to the extractionof different mid-level features, either saliency-oriented (shot size, backgroundmusic), or content-related (speakers). In Chapter 4, we make use of social network analysisas a possible way to model the plot of modern TV series: the narrative dynamicscan be properly captured by the evolution over time of the social network of interactingcharacters. Nonetheless, we have to address here the sequential nature of thenarrative when taking instantaneous views of the state of the relationships between thecharacters. We show that standard time-windowing approaches can not properly handlethis case, and we detail our own method for extracting dynamic social networksfrom narrative media. Chapter 5 is dedicated to the final generation and evaluation ofcharacter-oriented summaries, both able to reflect the plot dynamics and to emotionallyre-engage viewers into the narrative. We evaluate our framework by performing alarge-scale user study in realistic conditions.
|
189 |
Aplicação de conhecimento léxico-conceitual na sumarização multidocumento multilíngueTosta, Fabricio Elder da Silva 27 February 2014 (has links)
Made available in DSpace on 2016-06-02T20:25:23Z (GMT). No. of bitstreams: 1
6554.pdf: 2657931 bytes, checksum: 11403ad2acdeafd11148154c92757f20 (MD5)
Previous issue date: 2014-02-27 / Financiadora de Estudos e Projetos / Traditionally, Multilingual Multi-document Automatic Summarization (MMAS) is a computational application that, from a single collection of source-texts on the same subject/topic in at least two languages, produces an informative and generic summary (extract) in one of these languages. The simplest methods automatically translate the source-texts and, from a monolingual collection, apply content selection strategies based on shallow and/or deep linguistic knowledge. Therefore, the MMAS applications need to identify the main information of the collection, avoiding the redundancy, but also treating the problems caused by the machine translation (MT) of the full source-texts. Looking for alternatives to the traditional scenario of MMAS, we investigated two methods (Method 1 and 2) that once based on deep linguistic knowledge of lexical-conceptual level avoid the full MT of the sourcetexts, generating informative and cohesive/coherent summaries. In these methods, the content selection starts with the score and the ranking of the original sentences based on the frequency of occurrence of the concepts in the collection, expressed by their common names. In Method 1, only the most well-scored and non redundant sentences from the user s language are selected to compose the extract, until it reaches the compression rate. In Method 2, the original sentences which are better ranked and non redundant are selected to the summary without privileging the user s language; in cases which sentences that are not in the user s language are selected, they are automatically translated. In order to producing automatic summaries according to Methods 1 and 2 and their subsequent evaluation, the CM2News corpus was built. The corpus has 20 collections of news texts, 1 original text in English and 1 original text in Portuguese, both on the same topic. The common names of CM2News were identified through morphosyntactic annotation and then it was semiautomatically annotated with the concepts in Princeton WordNet through the Mulsen graphic editor, which was especially developed for the task. For the production of extracts according to Method 1, only the best ranked sentences in Portuguese were selected until the compression rate was reached. For the production of extracts according to Method 2, the best ranked sentences were selected, without privileging the language of the user. If English sentences were selected, they were automatically translated into Portuguese by the Bing translator. The Methods 1 and 2 were evaluated intrinsically considering the linguistic quality and informativeness of the summaries. To evaluate linguistic quality, 15 computational linguists analyzed manually the grammaticality, non-redundancy, referential clarity, focus and structure / coherence of the summaries and to evaluate the informativeness of the sumaries, they were automatically compared to reference sumaries by ROUGE measures. In both evaluations, the results have shown the better performance of Method 1, which might be explained by the fact that sentences were selected from a single source text. Furthermore, we highlight the best performance of both methods based on lexicalconceptual knowledge compared to simpler methods of MMAS, which adopted the full MT of the source-texts. Finally, it is noted that, besides the promising results on the application of lexical-conceptual knowledge, this work has generated important resources and tools for MMAS, such as the CM2News corpus and the Mulsen editor. / Tradicionalmente, a Sumarização Automática Multidocumento Multilíngue (SAMM) é uma aplicação que, a partir de uma coleção de textos sobre um mesmo assunto em ao menos duas línguas distintas, produz um sumário (extrato) informativo e genérico em uma das línguas-fonte. Os métodos mais simples realizam a tradução automática (TA) dos textos-fonte e, a partir de uma coleção monolíngue, aplicam estratégias superficiais e/ou profundas de seleção de conteúdo. Dessa forma, a SAMM precisa não só identificar a informação principal da coleção para compor o sumário, evitando-se a redundância, mas também lidar com os problemas causados pela TA integral dos textos-fonte. Buscando alternativas para esse cenário, investigaram-se dois métodos (Método 1 e 2) que, uma vez pautados em conhecimento profundo do tipo léxico-conceitual, evitam a TA integral dos textos-fonte, gerando sumários informativos e coesos/coerentes. Neles, a seleção do conteúdo tem início com a pontuação e o ranqueamento das sentenças originais em função da frequência de ocorrência na coleção dos conceitos expressos por seus nomes comuns. No Método 1, apenas as sentenças mais bem pontuadas na língua do usuário e não redundantes entre si são selecionadas para compor o sumário até que se atinja a taxa de compressão. No Método 2, as sentenças originais mais bem ranqueadas e não redundantes entre si são selecionadas para compor o sumário sem que se privilegie a língua do usuário; caso sentenças que não estejam na língua do usuário sejam selecionadas, estas são automaticamente traduzidas. Para a produção dos sumários automáticos segundo os Métodos 1 e 2 e subsequente avaliação dos mesmos, construiu-se o corpus CM2News, que possui 20 coleções de notícias jornalísticas, cada uma delas composta por 1 texto original em inglês e 1 texto original em português sobre um mesmo assunto. Os nomes comuns do CM2News foram identificados via anotação morfossintática e anotados com os conceitos da WordNet de Princeton de forma semiautomática, ou seja, por meio do editor gráfico MulSen desenvolvido para a tarefa. Para a produção dos sumários segundo o Método 1, somente as sentenças em português mais bem pontuadas foram selecionadas até que se atingisse determinada taxa de compressão. Para a produção dos sumários segundo o Método 2, as sentenças mais pontuadas foram selecionadas sem privilegiar a língua do usuário. Caso as sentenças selecionadas estivessem em inglês, estas foram automaticamente traduzidas para o português pelo tradutor Bing. Os Métodos 1 e 2 foram avaliados de forma intrínseca, considerando-se a qualidade linguística e a informatividade dos sumários. Para avaliar a qualidade linguística, 15 linguistas computacionais analisaram manualmente a gramaticalidade, a não-redundância, a clareza referencial, o foco e a estrutura/coerência dos sumários e, para avaliar a informatividade, os sumários foram automaticamente comparados a sumários de referência pelo pacote de medidas ROUGE. Em ambas as avaliações, os resultados evidenciam o melhor desempenho do Método 1, o que pode ser justificado pelo fato de que as sentenças selecionadas são provenientes de um mesmo texto-fonte. Além disso, ressalta-se o melhor desempenho dos dois métodos baseados em conhecimento léxico-conceitual frente aos métodos mais simples de SAMM, os quais realizam a TA integral dos textos-fonte. Por fim, salienta-se que, além dos resultados promissores sobre a aplicação de conhecimento léxico-conceitual, este trabalho gerou recursos e ferramentas importantes para a SAMM, como o corpus CM2News e o editor MulSen.
|
190 |
Aide à l'analyse de traces d'exécution dans le contexte des microcontrôleurs 32 bits / Assit to execution trace analysis in the microcontrollers 32 bits contextAmiar, Azzeddine 27 November 2013 (has links)
Souvent, dû à l'aspect cyclique des programmes embarqués, les traces de microcontrôleurs contiennent beaucoup de données. De plus, dans notre contexte de travail, pour l'analyse du comportement, une seule trace se terminant sur une défaillance est disponible. L'objectif du travail présenté dans cette thèse est d'aider à l'analyse de trace de microcontrôleurs. La première contribution de cette thèse concerne l'identification de cycles, ainsi que la génération d'une description pertinente de la trace. La détection de cycles repose sur l'identification du loop-header. La description proposée à l'ingénieur est produite en utilisant la compression basée sur la génération d'une grammaire. Cette dernière permet la détection de répétitions dans la trace. La seconde contribution concerne la localisation de faute(s). Elle est basée sur l'analogie entre les exécutions du programme et les cycles. Ainsi, pour aider dans l'analyse de la trace, nous avons adapté des techniques de localisation de faute(s) basée sur l'utilisation de spectres. Nous avons aussi défini un processus de filtrage permettant de réduire le nombre de cycles à utiliser pour la localisation de faute(s). Notre troisième contribution concerne l'aide à l'analyse des cas où les multiples cycles d'une même exécution interagissent entre eux. Ainsi, pour faire de la localisation de faute(s) pour ce type de cas, nous nous intéressons à la recherche de règles d'association. Le groupement des cycles en deux ensembles (cycles suspects et cycles corrects) pour la recherche de règles d'association, permet de définir les comportements jugés correctes et ceux jugés comme suspects. Ainsi, pour la localisation de faute(s), nous proposons à l'ingénieur un diagnostic basé sur l'analyse des règles d'association selon leurs degrés de suspicion. Cette thèse présente également les évaluations menées, permettant de mesurer l'efficacité de chacune des contributions discutées, et notre outil CoMET. Les résultats de ces évaluations montrent l'efficacité de notre travail d'aide à l'analyse de traces de microcontrôleurs. / The microcontroller traces contain a huge amount of information. This is mainly due to the cyclic aspect of embedded programs. In addition, in our context, a single trace that ends at the failure is used to analyze the behavior of the microcontroller . The work presented in this thesis aims to assit in analysis of microcontroller traces. The first contribution of this thesis concerns the identification of cycles and the generation of a relevant description of the trace. The detection of cycles is based on the identification of the loop-header. The description of the trace is generated using Grammar-Based Compression, which allows the detection of repetitions in the trace. The second contribution concerns the fault localization. Our approach is based on the analogy between executions and cycles. Thus, this contribution is an adaptation of some spectrum-based fault localization techniques. This second contribution also defines a filtering process, which aims to reduce the number of cycles used by the fault localization. The third contribution considers that the multiple cycles of a same execution can interact together. Our fault localization for this type of cases is based on the use of association rules. Grouping cycles in two sets (suspect cycles and correct cycles), and searching for association rules using those two sets, helps to define the behaviors considered as corrects and those considered as suspects. This thesis presents the experimental evaluations concerning our contributions, and our tool CoMET.
|
Page generated in 0.0601 seconds