41 |
Minerador WEB: um estudo sobre mecanismos de descoberta de informações na WEB. / Minerador WEB: a study on mechanisms of discovery of information in the WEB.Toscano, Wagner 10 July 2003 (has links)
A Web (WWW - World Wide Web) possui uma grande quantidade e variedade de informações. Isso representa um grande atrativo para que as pessoas busquem alguma informação desejada na Web. Por outo lado, dessa grande quantidade de informações resulta o problema fundamental de como descobrir, de uma maneira eficaz, se a informação desejada está presente na Web e como chegar até ela. A existência de um conjunto de informações que não se permitem acessar com facilidade ou que o acesso é desprovido de ferramentas eficazes de busca da informção, inviabiliza sua utilização. Soma-se às dificuldades no processo de pesquisa, a falta de estrutura das informações da Web que dificulta a aplicação de processos na busca da informação. Neste trabalho é apresentado um estudo de técnicas alternativas de busca da informação, pela aplicação de diversos conceitos relacionados à recuperação da informação e à representação do conhecimento. Mais especificamente, os objetivos são analisar a eficiência resultante da utilização de técnicas complementares de busca da informação, em particular mecanismos de extração de informações a partir de trechos explícitos nos documentos HTML e o uso do método de Naive Bayes na classificação de sites, e analisar a eficácia de um processo de armazenamento de informações extraídas da Web numa base de conhecimento (descrita em lógica de primeira ordem) que, aliada a um conhecimento de fundo, permita respomder a consultas mais complexas que as possíveis por meio do uso de expressões baseadas em palavras-chave e conectivos lógicos. / The World Wide Web (Web) has a huge amount and a large diversity of informations. There is a big appeal to people navigate on the Web to search for a desired information. On the other hand, due to this huge amount of data, we are faced with the fundamental problems of how to discover and how to reach the desired information in a efficient way. If there is no efficient mechanisms to find informations, the use of the Web as a useful source of information becomes very restrictive. Another important problem to overcome is the lack of a regular structure of the information in the Web, making difficult the use of usual information search methods. In this work it is presented a study of alternative techniques for information search. Several concepts of information retrieval and knowledge representation are applied. A primary goal is to analyse the efficiency of information retrieval methods using analysis of extensional information and probabilistic methods like Naive Bayes to classify sites among a pre-defined classes of sites.Another goal is to design a logic based knowledhe base, in order to enable a user to apply more complex queries than queries based simply on expressions using keywouds and logical connectives
|
42 |
Minerador WEB: um estudo sobre mecanismos de descoberta de informações na WEB. / Minerador WEB: a study on mechanisms of discovery of information in the WEB.Wagner Toscano 10 July 2003 (has links)
A Web (WWW - World Wide Web) possui uma grande quantidade e variedade de informações. Isso representa um grande atrativo para que as pessoas busquem alguma informação desejada na Web. Por outo lado, dessa grande quantidade de informações resulta o problema fundamental de como descobrir, de uma maneira eficaz, se a informação desejada está presente na Web e como chegar até ela. A existência de um conjunto de informações que não se permitem acessar com facilidade ou que o acesso é desprovido de ferramentas eficazes de busca da informção, inviabiliza sua utilização. Soma-se às dificuldades no processo de pesquisa, a falta de estrutura das informações da Web que dificulta a aplicação de processos na busca da informação. Neste trabalho é apresentado um estudo de técnicas alternativas de busca da informação, pela aplicação de diversos conceitos relacionados à recuperação da informação e à representação do conhecimento. Mais especificamente, os objetivos são analisar a eficiência resultante da utilização de técnicas complementares de busca da informação, em particular mecanismos de extração de informações a partir de trechos explícitos nos documentos HTML e o uso do método de Naive Bayes na classificação de sites, e analisar a eficácia de um processo de armazenamento de informações extraídas da Web numa base de conhecimento (descrita em lógica de primeira ordem) que, aliada a um conhecimento de fundo, permita respomder a consultas mais complexas que as possíveis por meio do uso de expressões baseadas em palavras-chave e conectivos lógicos. / The World Wide Web (Web) has a huge amount and a large diversity of informations. There is a big appeal to people navigate on the Web to search for a desired information. On the other hand, due to this huge amount of data, we are faced with the fundamental problems of how to discover and how to reach the desired information in a efficient way. If there is no efficient mechanisms to find informations, the use of the Web as a useful source of information becomes very restrictive. Another important problem to overcome is the lack of a regular structure of the information in the Web, making difficult the use of usual information search methods. In this work it is presented a study of alternative techniques for information search. Several concepts of information retrieval and knowledge representation are applied. A primary goal is to analyse the efficiency of information retrieval methods using analysis of extensional information and probabilistic methods like Naive Bayes to classify sites among a pre-defined classes of sites.Another goal is to design a logic based knowledhe base, in order to enable a user to apply more complex queries than queries based simply on expressions using keywouds and logical connectives
|
43 |
WebKnox: Web Knowledge ExtractionUrbansky, David 26 January 2009 (has links)
This thesis focuses on entity and fact extraction from the web. Different knowledge representations and techniques for information extraction are discussed before the design for a knowledge extraction system, called WebKnox, is introduced. The main contribution of this thesis is the trust ranking of extracted facts with a self-supervised learning loop and the extraction system with its composition of known and refined extraction algorithms. The used
techniques show an improvement in precision and recall in most of the matters for entity and fact extractions compared to the chosen baseline approaches.
|
44 |
Entity extraction, animal disease-related event recognition and classification from webVolkova, Svitlana January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / Global epidemic surveillance is an essential task for national biosecurity management
and bioterrorism prevention. The main goal is to protect the public from major health
threads. To perform this task effectively one requires reliable, timely and accurate medical
information from a wide range of sources. Towards this goal, we present a framework for epidemiological
analytics that can be used to extract and visualize infectious disease outbreaks
from the variety of unstructured web sources automatically. More precisely, in this thesis, we
consider several research tasks including document relevance classification, entity extraction
and animal disease-related event recognition in the veterinary epidemiology domain. First,
we crawl web sources and classify collected documents by topical relevance using supervised
learning algorithms. Next, we propose a novel approach for automated ontology construction
in the veterinary medicine domain. Our approach is based on semantic relationship
discovery using syntactic patterns. We then apply our automatically-constructed ontology
for the domain-specific entity extraction task. Moreover, we compare our ontology-based
entity extraction results with an alternative sequence labeling approach. We introduce a
sequence labeling method for the entity tagging that relies on syntactic feature extraction
using a sliding window. Finally, we present our novel sentence-based event recognition
approach that includes three main steps: entity extraction of animal diseases, species, locations,
dates and the confirmation status n-grams; event-related sentence classification into
two categories - suspected or confirmed; automated event tuple generation and aggregation.
We show that our document relevance classification results as well as entity extraction
and disease-related event recognition results are significantly better compared to the results
reported by other animal disease surveillance systems.
|
45 |
Análise de métodos para programação de contextualização. / Analysis of methods for programming of page context classification.Marangon, Sílvio Luís 26 October 2006 (has links)
A localização de páginas relevantes na Internet em atividades como clipping de notícias, detecção de uso indevido de marcas ou em serviços anti-phishing torna-se cada vez mais complexa devido a vários fatores como a quantidade cada vez maior de páginas na Web e a grande quantidade de páginas irrelevantes retornadas por mecanismos de busca. Em muitos casos as técnicas tradicionais utilizadas em mecanismos de busca na Internet, isto é, localização de termos em páginas e ordenação por relevância, não são suficientes para resolver o problema de localização de páginas específicas em atividades como as citadas anteriormente. A contextualização das páginas, ou seja, a classificação de páginas segundo um contexto definido pelo usuário baseando-se nas necessidades de uma atividade específica deve permitir uma busca mais eficiente por páginas na Internet. Neste trabalho é estudada a utilização de métodos de mineração na Web para a composição de métodos de contextualização de páginas, que permitam definir contextos mais sofisticados como seu assunto ou alguma forma de relacionamento. A contextualização de páginas deve permitir a solução de vários problemas na busca de páginas na Internet pela composição de métodos, que permitam a localização de páginas através de um conjunto de suas características, diferentemente de mecanismos de busca tradicionais que apenas localizam páginas que possuam um ou mais termos especificados. / Internet services as news clipping service, anti-phising, anti-plagiarism service and other that require intensive searching in Internet have a difficult work, because of huge number of existing pages. Search Engines try driver this problem, but search engines methods retrieve a lot of irrelevant pages, some times thousands of pages and more powerful methods are necessary to drive this problem. Page content, subject, hyperlinks or location can be used to define page context and create a more powerful method that can retrieve more relevant pages, improving precision. Classification of page context is defined as classification of a page by a set of its feature. This report presents a study about Web Mining, Search Engines and application of web mining technologies to classify page context. Page context classification applied to search engines must solve the problem of irrelevant pages flood by allowing search engines retrieve pages of a context.
|
46 |
Web Usage Mining - popis, metody a nástroje, možné aplikace, konkrétní řešení / Web Usage Mining - description, methods and tools, possible applications, concrete solutionDvořák, Jan Bc. January 2007 (has links)
Obecný popis Web Miningu. Charakteristika a užití technik Web Usage Miningu. Podrobný popis metod a nástrojů zahrnovaných pod pojem ?Web Usage Mining?. Softwarové nástroje a existující řešení pro Web Usage Mining. Praktický návrh konkrétního řešení s využitím popsaných metod Web Usage Miningu ? analýza logovacích souborů webového serveru Fakulty managementu Vysoké školy ekonomické v Praze.
|
47 |
Análise de métodos para programação de contextualização. / Analysis of methods for programming of page context classification.Sílvio Luís Marangon 26 October 2006 (has links)
A localização de páginas relevantes na Internet em atividades como clipping de notícias, detecção de uso indevido de marcas ou em serviços anti-phishing torna-se cada vez mais complexa devido a vários fatores como a quantidade cada vez maior de páginas na Web e a grande quantidade de páginas irrelevantes retornadas por mecanismos de busca. Em muitos casos as técnicas tradicionais utilizadas em mecanismos de busca na Internet, isto é, localização de termos em páginas e ordenação por relevância, não são suficientes para resolver o problema de localização de páginas específicas em atividades como as citadas anteriormente. A contextualização das páginas, ou seja, a classificação de páginas segundo um contexto definido pelo usuário baseando-se nas necessidades de uma atividade específica deve permitir uma busca mais eficiente por páginas na Internet. Neste trabalho é estudada a utilização de métodos de mineração na Web para a composição de métodos de contextualização de páginas, que permitam definir contextos mais sofisticados como seu assunto ou alguma forma de relacionamento. A contextualização de páginas deve permitir a solução de vários problemas na busca de páginas na Internet pela composição de métodos, que permitam a localização de páginas através de um conjunto de suas características, diferentemente de mecanismos de busca tradicionais que apenas localizam páginas que possuam um ou mais termos especificados. / Internet services as news clipping service, anti-phising, anti-plagiarism service and other that require intensive searching in Internet have a difficult work, because of huge number of existing pages. Search Engines try driver this problem, but search engines methods retrieve a lot of irrelevant pages, some times thousands of pages and more powerful methods are necessary to drive this problem. Page content, subject, hyperlinks or location can be used to define page context and create a more powerful method that can retrieve more relevant pages, improving precision. Classification of page context is defined as classification of a page by a set of its feature. This report presents a study about Web Mining, Search Engines and application of web mining technologies to classify page context. Page context classification applied to search engines must solve the problem of irrelevant pages flood by allowing search engines retrieve pages of a context.
|
48 |
Using event sequence alignment to automatically segment web users for prediction and recommendation / Alignement de séquences d'évènements pour la segmentation automatique d'internautes, la prédiction et la recommandationLuu, Vinh Trung 16 December 2016 (has links)
Une masse de données importante est collectée chaque jour par les gestionnaires de site internet sur les visiteurs qui accèdent à leurs services. La collecte de ces données a pour objectif de mieux comprendre les usages et d'acquérir des connaissances sur le comportement des visiteurs. A partir de ces connaissances, les gestionnaires de site peuvent décider de modifier leur site ou proposer aux visiteurs du contenu personnalisé. Cependant, le volume de données collectés ainsi que la complexité de représentation des interactions entre le visiteur et le site internet nécessitent le développement de nouveaux outils de fouille de données. Dans cette thèse, nous avons exploré l’utilisation des méthodes d’alignement de séquences pour l'extraction de connaissances sur l'utilisation de site Web (web mining). Ces méthodes sont la base du regroupement automatique d’internautes en segments, ce qui permet de découvrir des groupes de comportements similaires. De plus, nous avons également étudié comment ces groupes pouvaient servir à effectuer de la prédiction et la recommandation de pages. Ces thèmes sont particulièrement importants avec le développement très rapide du commerce en ligne qui produit un grand volume de données (big data) qu’il est impossible de traiter manuellement. / This thesis explored the application of sequence alignment in web usage mining, including user clustering and web prediction and recommendation.This topic was chosen as the online business has rapidly developed and gathered a huge volume of information and the use of sequence alignment in the field is still limited. In this context, researchers are required to build up models that rely on sequence alignment methods and to empirically assess their relevance in user behavioral mining. This thesis presents a novel methodological point of view in the area and show applicable approaches in our quest to improve previous related work. Web usage behavior analysis has been central in a large number of investigations in order to maintain the relation between users and web services. Useful information extraction has been addressed by web content providers to understand users’ need, so that their content can be correspondingly adapted. One of the promising approaches to reach this target is pattern discovery using clustering, which groups users who show similar behavioral characteristics. Our research goal is to perform users clustering, in real time, based on their session similarity.
|
49 |
Goal Attainment On Long Tail Web Sites: An Information Foraging ApproachMccart, James A. 13 October 2009 (has links)
This dissertation sought to explain goal achievement at limited traffic “long tail” Web sites using
Information Foraging Theory (IFT). The central thesis of IFT is that individuals are driven by a
metaphorical sense of smell that guides them through patches of information in their environment.
An information patch is an area of the search environment with similar information. Information
scent is the driving force behind why a person makes a navigational selection amongst a group
of competing options. As foragers are assumed to be rational, scent is a mechanism by which to
reduce search costs by increasing the accuracy on which option leads to the information of value.
IFT was originally developed to be used in a “production rule” environment, where a user would
perform an action when the conditions of a rule were met. However, the use of IFT in clickstream
research required conceptualizing the ideas of information scent and patches in a non-production
rule environment. To meet such an end this dissertation asked three research questions regarding
(1) how to learn information patches, (2) how to learn trails of scent, and finally (3) how to combine
both concepts to create a Clickstream Model of Information Foraging (CMIF).
The learning of patches and trails were accomplished by using contrast sets, which distinguished
between individuals who achieved a goal or not. A user- and site-centric version of the CMIF,
which extended and operationalized IFT, presented and evaluated hypotheses. The user-centric
version had four hypotheses and examined product purchasing behavior from panel data, whereas
the site-centric version had nine hypotheses and predicted contact form submission using data
from a Web hosting company.
In general, the results show that patches and trails exist on several Web sites, and the majority
of hypotheses were supported in each version of the CMIF. This dissertation contributed to the literature
by providing a theoretically-grounded model which tested and extended IFT; introducing
a methodology for learning patches and trails; detailing a methodology for preprocessing clickstream
data for long tail Web sites; and focusing on traditionally under-studied long tail Web sites.
|
50 |
Finding Finding Aids on the World Wide WebTibbo, Helen R., Meho, Lokman I. January 2001 (has links)
Reports results of a study to explore how well six popular Web search engines performed in retrieving specific electronic finding aids mounted on the World Wide Web. A random sample of online finding aids was selected and then searched using AltaVista, Excite, Fast Search, Google, Hotbot and Northern Light, employing both word and phrase searching. As of February 2000, approximately 8 percent of repositories listed at the 'Repositories of Primary Resources' Web site had mounted at least four full finding aids on the Web. The most striking finding of this study was the importance of using phrase searches whenever possible, rather than word searches. Also of significance was the fact that if a finding aid were to be found using any search engine, it was generally found in the first ten or twenty items at most. The study identifies the best performers among the six chosen search engines. Combinations of search engines often produced much better results than did the search engines individually, evidence that there may be little overlap among the top hits provided by individual engines.
|
Page generated in 0.0766 seconds