• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 43
  • 11
  • 8
  • 4
  • 4
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 94
  • 94
  • 33
  • 27
  • 22
  • 14
  • 14
  • 12
  • 11
  • 10
  • 10
  • 10
  • 10
  • 10
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Predicting Day-Zero Review Ratings: A Social Web Mining Approach

John, Zubin R. January 2015 (has links)
No description available.
62

Intelligent Event Focused Crawling

Farag, Mohamed Magdy Gharib 23 September 2016 (has links)
There is need for an integrated event focused crawling system to collect Web data about key events. When an event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of information about events. We propose intelligent event focused crawling for automatic event tracking and archiving, as well as effective access. We extend the traditional focused (topical) crawling techniques in two directions, modeling and representing: events and webpage source importance. We developed an event model that can capture key event information (topical, spatial, and temporal). We incorporated that model into the focused crawler algorithm. For the focused crawler to leverage the event model in predicting a webpage's relevance, we developed a function that measures the similarity between two event representations, based on textual content. Although the textual content provides a rich set of features, we proposed an additional source of evidence that allows the focused crawler to better estimate the importance of a webpage by considering its website. We estimated webpage source importance by the ratio of number of relevant webpages to non-relevant webpages found during crawling a website. We combined the textual content information and source importance into a single relevance score. For the focused crawler to work well, it needs a diverse set of high quality seed URLs (URLs of relevant webpages that link to other relevant webpages). Although manual curation of seed URLs guarantees quality, it requires exhaustive manual labor. We proposed an automated approach for curating seed URLs using social media content. We leveraged the richness of social media content about events to extract URLs that can be used as seed URLs for further focused crawling. We evaluated our system through four series of experiments, using recent events: Orlando shooting, Ecuador earthquake, Panama papers, California shooting, Brussels attack, Paris attack, and Oregon shooting. In the first experiment series our proposed event model representation, used to predict webpage relevance, outperformed the topic-only approach, showing better results in precision, recall, and F1-score. In the second series, using harvest ratio to measure ability to collect relevant webpages, our event model-based focused crawler outperformed the state-of-the-art focused crawler (best-first search). The third series evaluated the effectiveness of our proposed webpage source importance for collecting more relevant webpages. The focused crawler with webpage source importance managed to collect roughly the same number of relevant webpages as the focused crawler without webpage source importance, but from a smaller set of sources. The fourth series provides guidance to archivists regarding the effectiveness of curating seed URLs from social media content (tweets) using different methods of selection. / Ph. D.
63

Recherche d’entités nommées complexes sur le web : propositions pour l’extraction et pour le calcul de similarité / Retrieval of Comple Named Entities on the web : proposals for extraction and similarity computation

Fotsoh Tawaofaing, Armel 27 February 2018 (has links)
Les récents développements des nouvelles technologies de l’information et de la communication font du Web une véritable mine d’information. Cependant, les pages Web sont très peu structurées. Par conséquent, il est difficile pour une machine de les traiter automatiquement pour en extraire des informations pertinentes pour une tâche ciblée. C’est pourquoi les travaux de recherche s’inscrivant dans la thématique de l’Extraction d’Information dans les pages web sont en forte croissance. Aussi, l’interrogation de ces informations, généralement structurées et stockées dans des index pour répondre à des besoins d’information précis correspond à la Recherche d’Information (RI). Notre travail de thèse se situe à la croisée de ces deux thématiques. Notre objectif principal est de concevoir et de mettre en œuvre des stratégies permettant de scruter le web pour extraire des Entités Nommées (EN) complexes (EN composées de plusieurs propriétés pouvant être du texte ou d’autres EN) de type entreprise ou de type événement, par exemple. Nous proposons ensuite des services d’indexation et d’interrogation pour répondre à des besoins d’informations. Ces travaux ont été réalisés au sein de l’équipe T2I du LIUPPA, et font suite à une commande de l’entreprise Cogniteev, dont le cœur de métier est centré sur l’analyse du contenu du Web. Les problématiques visées sont, d’une part, l’extraction d’EN complexes sur le Web et, d’autre part, l’indexation et la recherche d’information intégrant ces EN complexes. Notre première contribution porte sur l’extraction d’EN complexes dans des textes. Pour cette contribution, nous prenons en compte plusieurs problèmes, notamment le contexte bruité caractérisant certaines propriétés (pour un événement par exemple, la page web correspondante peut contenir deux dates : la date de l’événement et celle de mise en vente des billets). Pour ce problème en particulier, nous introduisons un module de détection de blocs qui permet de focaliser l’extraction des propriétés sur des blocs de texte pertinents. Nos expérimentations montrent une nette amélioration des performances due à cette approche. Nous nous sommes également intéressés à l’extraction des adresses, où la principale difficulté découle du fait qu’aucun standard ne se soit réellement imposé comme modèle de référence. Nous proposons donc un modèle étendu et une approche d’extraction basée sur des patrons et des ressources libres.Notre deuxième contribution porte sur le calcul de similarité entre EN complexes. Dans l’état de l’art, ce calcul se fait généralement en deux étapes : (i) une première calcule les similarités entre propriétés et (ii) une deuxième agrège les scores obtenus pour le calcul de la similarité globale. En ce qui concerne cette première étape, nous proposons une fonction de calcul de similarité entre EN spatiale, l’une représentée par un point et l’autre par un polygone. Elle complète l’état de l’art. Notons que nos principales propositions se situent au niveau de la deuxième étape. Ainsi, nous proposons trois techniques pour l’agrégation des scores intermédiaires. Les deux premières sont basées sur la somme pondérée des scores intermédiaires (combinaison linéaire et régression logistique). La troisième exploite les arbres de décisions pour agréger les scores intermédiaires. Enfin, nous proposons une dernière approche basée sur le clustering et le modèle vectoriel de Salton pour le calcul de similarité entre EN complexes. Son originalité vient du fait qu’elle ne nécessite pas de passer par le calcul de scores de similarités intermédiaires. / Recent developments in information technologies have made the web an important data source. However, the web content is very unstructured. Therefore, it is a difficult task to automatically process this web content in order to extract relevant information. This is a reason why research work related to Information Extraction (IE) on the web are growing very quickly. Similarly, another very explored research area is the querying of information extracted on the web to answer an information need. This other research area is known as Information Retrieval (IR). Our research work is at the crossroads of both areas. The main goal of our work is to develop strategies and techniques for crawling the web in order to extract complex Named Entities (NEs) (NEs with several properties that may be text or other NEs). We then propose to index them and to query them in order to answer information needs. This work was carried out within the T2I team of the LIUPPA laboratory, in collaboration with Cogniteev, a company which core business is focused on the analysis of web content. The issues we had to deal with were the extraction of complex NEs on the web and the development of IR services supplied by the extracted data. Our first contribution is related to complex NEs extraction from text content. For this contribution, we take into consideration several problems, in particular the noisy context characterizing some properties (the web page describing an event for example, may contain more than one dates: the event’s date and the date of ticket’s sales opening). For this particular problem, we introduce a block detection module that focuses property's extraction on relevant text blocks. Our experiments show an improvement of system’s performances. We also focused on address extraction where the main issue arises from the fact that there is not a standard way for writing addresses in general and on the web in particular. We therefore propose a pattern-based approach which uses some lexicons for extracting addresses from text, regardless of proprietary resources.Our second contribution deals with similarity computation between complex NEs. In the state of the art, this similarity computation is generally performed in two steps: (i) first, similarities between properties are calculated; (ii) then the obtained similarities are aggregated to compute the overall similarity. Our main proposals focuses on the second step. We propose three techniques for aggregating property’s similarities. The first two are based on the weighted sum of these property’s similarities (simple linear combination and logistic regression). The third technique however, uses decision trees for the aggregation. Finally, we also propose a last approach based on clustering and Salton vector model. This last approach evaluates the similarity at the complex NE level without computing property’s similarities. We also propose a similarity computation function between spatial EN, one represented by a point and the other by a polygon. This completes those of the state of the art.
64

Abordagem simbólica de aprendizado de máquina na recuperação automática de artigos científicos a partir de web / Symbolic approach of machine learning in the scientific article automatic recovery from the web

Brasil, Christiane Regina Soares 07 April 2006 (has links)
Atualmente, devido ao incessante aumento dos documentos científicos disponíveis na rede mundial de computadores, as ferrametas de busca tornaram-se um importante auxílio para recuperação de informação a partir da Internet em todas as áreas de conhecimento para pesquisadores e usuários. Entretanto, as atuais ferramentas de busca disponíveis selecionam uma enorme lista de páginas, cabendo ao usuário a tarefa final de escolher aquelas que realmente são relevantes a sua pesquisa. Assim, é importante o desenvolvimento de técnicas e ferramentas que não apenas retornem uma lista de possíveis documentos relacionados com a consulta apresentada pelo usuário, mas que organizem essa informação de acordo com o conteúdo de tais documentos, e apresentem o resultado da busca em uma representação gráfica que auxilie a exploração e o entendimento geral dos documentos recuperados. Neste contexto, foi proposto o projeto de uma Ferramenta Inteligente de Apoio à Pesquisa (FIP), do qual este trabalho é parte. O objetivo deste trabalho é analisar estratégias de recuperação automática de artigos científicos sobre uma determinada área de pesquisa a partir da Web, que poderá ser adotada pelo módulo de recuperação da FIP. Neste trabalho são considerados artigos escritos em inglês, no formato PDF, abrangendo as áreas da Ciência da Computação. Corpora de treino e teste foram usados para avaliação das abordagens simbólicas de Aprendizado de Máquina na indução de regras que poderão ser inseridas em um crawler inteligente para recuperação automática de artigos dessas áreas. Diversos experimentos foram executados para definir parâmetros de pré-processamento apropriados ao domínio, bem como para definir a melhor estratégia de aplicação das regras induzidas e do melhor algoritmo simbólico de indução. / Today, due to the increase of scientific documents available on the World Wide Web, search tools have become an important aid for information retrieval from the Internet in all fields of knowledge for researchers and users. However, the search tools currently available, in general, select a huge list of pages leaving the user with the final task of choosing those pages that actually fit its research. It is important to develop techniques and tools that return a list of documents related to the query made by the user in accordance with the content of such documents, and then present the result in a meaningful graphical representation with the aim to improve the exploration and understanding of the retrieved articles. In this context, a project of an Intelligent Tool for Research Supporting (FIP) was proposed. This MSc work is part of this project. The objective of this work is to analyze strategies of automatic scientific article retrieval on a specific field from the Web. Such strategy must fit the requirements of the retrieval module of the FIP. In this work articles written in English, in PDF format, covering the fields of Computer Science were considered. Corpora of training and testing were used to evaluate the symbolic approaches of Machine Learning in the induction of rules. These rules could be imbedded in an intelligent crawler for automatic retrieving of the articles in the chosen fields. Several experiments have been carried out in order to define parameters as attribute weights, cut-off point, stopwords in the corpora domain, a better strategy to apply the rules for the categorization of the articles and a better symbolic algorithm to induce the rules
65

Abordagem simbólica de aprendizado de máquina na recuperação automática de artigos científicos a partir de web / Symbolic approach of machine learning in the scientific article automatic recovery from the web

Christiane Regina Soares Brasil 07 April 2006 (has links)
Atualmente, devido ao incessante aumento dos documentos científicos disponíveis na rede mundial de computadores, as ferrametas de busca tornaram-se um importante auxílio para recuperação de informação a partir da Internet em todas as áreas de conhecimento para pesquisadores e usuários. Entretanto, as atuais ferramentas de busca disponíveis selecionam uma enorme lista de páginas, cabendo ao usuário a tarefa final de escolher aquelas que realmente são relevantes a sua pesquisa. Assim, é importante o desenvolvimento de técnicas e ferramentas que não apenas retornem uma lista de possíveis documentos relacionados com a consulta apresentada pelo usuário, mas que organizem essa informação de acordo com o conteúdo de tais documentos, e apresentem o resultado da busca em uma representação gráfica que auxilie a exploração e o entendimento geral dos documentos recuperados. Neste contexto, foi proposto o projeto de uma Ferramenta Inteligente de Apoio à Pesquisa (FIP), do qual este trabalho é parte. O objetivo deste trabalho é analisar estratégias de recuperação automática de artigos científicos sobre uma determinada área de pesquisa a partir da Web, que poderá ser adotada pelo módulo de recuperação da FIP. Neste trabalho são considerados artigos escritos em inglês, no formato PDF, abrangendo as áreas da Ciência da Computação. Corpora de treino e teste foram usados para avaliação das abordagens simbólicas de Aprendizado de Máquina na indução de regras que poderão ser inseridas em um crawler inteligente para recuperação automática de artigos dessas áreas. Diversos experimentos foram executados para definir parâmetros de pré-processamento apropriados ao domínio, bem como para definir a melhor estratégia de aplicação das regras induzidas e do melhor algoritmo simbólico de indução. / Today, due to the increase of scientific documents available on the World Wide Web, search tools have become an important aid for information retrieval from the Internet in all fields of knowledge for researchers and users. However, the search tools currently available, in general, select a huge list of pages leaving the user with the final task of choosing those pages that actually fit its research. It is important to develop techniques and tools that return a list of documents related to the query made by the user in accordance with the content of such documents, and then present the result in a meaningful graphical representation with the aim to improve the exploration and understanding of the retrieved articles. In this context, a project of an Intelligent Tool for Research Supporting (FIP) was proposed. This MSc work is part of this project. The objective of this work is to analyze strategies of automatic scientific article retrieval on a specific field from the Web. Such strategy must fit the requirements of the retrieval module of the FIP. In this work articles written in English, in PDF format, covering the fields of Computer Science were considered. Corpora of training and testing were used to evaluate the symbolic approaches of Machine Learning in the induction of rules. These rules could be imbedded in an intelligent crawler for automatic retrieving of the articles in the chosen fields. Several experiments have been carried out in order to define parameters as attribute weights, cut-off point, stopwords in the corpora domain, a better strategy to apply the rules for the categorization of the articles and a better symbolic algorithm to induce the rules
66

Word Space Models for Web User Clustering and Page Prefetching

Sundin, Albin January 2012 (has links)
This study evaluates methods for clustering web users via vector space models, for the purpose of web page prefetching for possible applications of server optimization. An experiment using Latent Semantic Analysis (LSA) is deployed to investigate whether LSA can reproduce the encouraging results obtained from previous research with Random Indexing (RI) and a chaos based optimization algorithm (CAS-C). This is not only motivated by LSA being yet another vector space model, but also by a study indicating LSA to outperform RI in a task similar to the web user clustering and prefetching task. The prefetching task was used to verify the applicability of LSA, where both RI and CAS-C have shown promising results. The original data set from the RI web user clustering and prefetching task was modeled using weighted (tf-idf) LSA. Clusters were defined using a common clustering algorithm (k-means). The least scattered cluster configuration for the model was identified by combining an internal validity measure (SSE) and a relative criterion validity measure (SD index). The assumed optimal cluster configuration was used for the web page prefetching task.   Precision and recall of the LSA based method is found to be on par with RI and CAS-C, in as much that it solves the web user clustering and web task with similar characteristics as unweighted RI. The hypothesized inherent gains to precision and recall by using LSA was neither confirmed nor conclusively disproved. The effects of different weighting functions for RI are discussed and a number of methodological factors are identified for further research concerning LSA based clustering and prefetching.
67

Plataforma para la Extracción y Almacenamiento del Conocimiento Extraído de los Web Data

Rebolledo Lorca, Víctor January 2008 (has links)
No description available.
68

Towards a Hybrid Imputation Approach Using Web Tables

Lehner, Wolfgang, Ahmadov, Ahmad, Thiele, Maik, Eberius, Julian, Wrembel, Robert 12 January 2023 (has links)
Data completeness is one of the most important data quality dimensions and an essential premise in data analytics. With new emerging Big Data trends such as the data lake concept, which provides a low cost data preparation repository instead of moving curated data into a data warehouse, the problem of data completeness is additionally reinforced. While traditionally the process of filling in missing values is addressed by the data imputation community using statistical techniques, we complement these approaches by using external data sources from the data lake or even the Web to lookup missing values. In this paper we propose a novel hybrid data imputation strategy that, takes into account the characteristics of an incomplete dataset and based on that chooses the best imputation approach, i.e. either a statistical approach such as regression analysis or a Web-based lookup or a combination of both. We formalize and implement both imputation approaches, including a Web table retrieval and matching system and evaluate them extensively using a corpus with 125M Web tables. We show that applying statistical techniques in conjunction with external data sources will lead to a imputation system which is robust, accurate, and has high coverage at the same time.
69

Program Transformations for Information Personalization

Perugini, Saverio 01 July 2004 (has links)
Personalization constitutes the mechanisms and technologies necessary to customize information access to the end-user. It can be defined as the automatic adjustment of information content, structure, and presentation. The central thesis of this dissertation is that modeling interaction explicitly in a representation, and studying how partial information can be harnessed in it by program transformations to direct the flow of the interaction, can provide insight into, reveal opportunities for, and define a model for personalized interaction. To evaluate this thesis, a formal modeling methodology is developed for personalizing interactions with information systems, especially hierarchical hypermedia, based on program transformations. The predominant form of personalized interaction developed in this thesis is out-of-turn interaction, a technique which empowers the user to take the initiative in a user--system dialog by providing unsolicited, but relevant, information out-of-turn. Out-of-turn interaction helps flexibly bridge any mismatch between the user's model of information seeking and the system's hardwired hyperlink structure in a manner fundamentally different from extant solutions, such as multiple faceted browsing classifications and search tools. This capability is showcased through two interaction interfaces using alternate modalities to capture and communicate out-of-turn information to the underlying system: a toolbar embedded into a traditional browser for out-of-turn textual input and voice-enabled content pages for out-of-turn speech input. The specific research issues addressed involve identifying and developing representations and transformations suitable for general classes of hierarchical hypermedia, providing supplemental interactions for improving the personalized experience, and studying user's (out-of-turn) interactions with resulting systems. / Ph. D.
70

Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns : the development and evaluation of new Web mining methods that enhance information retrieval and improve the understanding of users' Web behavior in websites and social blogs

Ammari, Ahmad N. January 2010 (has links)
The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

Page generated in 0.0936 seconds