Global ETD Search

21	Comquest: an Adaptive Crawler for User Comments on the Web Chen, Zhijia, 0009-0005-7866-4549 05 1900 (has links) This thesis introduces Comquest, an adaptive framework designed for the large-scale collection and integration of user comments from the Web. User comments are featured on many websites and there is growing interest in mining and studying user comments in applications, such as opinion mining and information diffusion. However, crawling user comments generally requires hard-coded solutions that are tethered to specific websites, which is hard to scale and maintain. To achieve a generalizable and scalable comment crawling solution, Comquest employs two website-agnostic approaches for comment crawling: Web API querying and HTML data extraction. When the target Web page is integrated with a third-party commenting system whose Web API that is in Comquest’s knowledge base, it retrieves comments by sending HTTP requests to the API’s URL with parameters extracted from the target webpage. The approach has several challenges. Firstly, extracting accurate parameter values to construct HTTP requests is difficult since they are buried deep within the HTML string of web documents (if they exist). Secondly, the solution needs to generalize both vertically (within a website) and horizontally (across unseen websites). To tackle these challenges, the parameter extraction problem is treated as a variant of the multiclass Named Entity Recognition (NER) problem, where the entities represent the values of the parameters. Comquest leverages a sequential labeling deep learning model to identify parameter values within HTML source codes. When the commenting system is native to the website or unknown, Comquest detects and extracts user comments from fully rendered Web pages. However, comments are often hidden until triggered by specific user interaction, such as clicking on a designated page element among many other clickable elements. Furthermore, comments are typically presented as structured record-like Web data with high structure variations, making them difficult to detect and extract from the target Web page along with other record-like Web data. Comquest utilizes deep learning models and Web record extraction algorithms to automate the process of triggering, extracting, and classifying comments. Comquest has been implemented as a comprehensive system that consists of an administration web portal, a task controller, and a crawler backend. It provides a useful tool for collecting comments that represent a wider range of opinions, stances, and sentiments from websites on a global scale. / Computer and Information Science Computer science Crawler Deep learning User comments Weak labeling Web API Web data
22	Distributed data management with a declarative rule-based language webdamlog / Gestion des données distribuées avec le langage de règles Webdamlog Antoine, Emilien 05 December 2013 (has links) Notre but est de permettre à un utilisateur du Web d’organiser la gestionde ses données distribuées en place, c’est à dire sans l’obliger à centraliserses données chez un unique hôte. Par conséquent, notre système diffèrede Facebook et des autres systèmes centralisés, et propose une alternativepermettant aux utilisateurs de lancer leurs propres pairs sur leurs machinesgérant localement leurs données personnelles et collaborant éventuellementavec des services Web externes.Dans ma thèse, je présente Webdamlog, un langage dérivé de datalogpour la gestion de données et de connaissances distribuées. Le langage étenddatalog de plusieurs manières, principalement avec une nouvelle propriété ladélégation, autorisant les pairs à échanger non seulement des faits (les données)mais aussi des règles (la connaissance). J’ai ensuite mené une étude utilisateurpour démontrer l’utilisation du langage. Enfin je décris le moteur d’évaluationde Webdamlog qui étend un moteur d’évaluation de datalog distribué nomméBud, en ajoutant le support de la délégation et d’autres innovations tellesque la possibilité d’avoir des variables pour les noms de pairs et des relations.J’aborde de nouvelles techniques d’optimisation, notamment basées sur laprovenance des faits et des règles. Je présente des expérimentations quidémontrent que le coût du support des nouvelles propriétés de Webdamlogreste raisonnable même pour de gros volumes de données. Finalement, jeprésente l’implémentation d’un pair Webdamlog qui fournit l’environnementpour le moteur. En particulier, certains adaptateurs permettant aux pairsWebdamlog d’échanger des données avec d’autres pairs sur Internet. Pourillustrer l’utilisation de ces pairs, j’ai implémenté une application de partagede photos dans un réseau social en Webdamlog. / Our goal is to enable aWeb user to easily specify distributed data managementtasks in place, i.e. without centralizing the data to a single provider. Oursystem is therefore not a replacement for Facebook, or any centralized system,but an alternative that allows users to launch their own peers on their machinesprocessing their own local personal data, and possibly collaborating with Webservices.We introduce Webdamlog, a datalog-style language for managing distributeddata and knowledge. The language extends datalog in a numberof ways, notably with a novel feature, namely delegation, allowing peersto exchange not only facts but also rules. We present a user study thatdemonstrates the usability of the language. We describe a Webdamlog enginethat extends a distributed datalog engine, namely Bud, with the supportof delegation and of a number of other novelties of Webdamlog such as thepossibility to have variables denoting peers or relations. We mention noveloptimization techniques, notably one based on the provenance of facts andrules. We exhibit experiments that demonstrate that the rich features ofWebdamlog can be supported at reasonable cost and that the engine scales tolarge volumes of data. Finally, we discuss the implementation of a Webdamlogpeer system that provides an environment for the engine. In particular, a peersupports wrappers to exchange Webdamlog data with non-Webdamlog peers.We illustrate these peers by presenting a picture management applicationthat we used for demonstration purposes. Distribution Datalog Base de connaissances Pair à pair Gestion de données du Web Distribution Datalog Knowledge Base Peer to Peer Web Data Management
23	Distributed data management with a declarative rule-based language webdamlog Antoine, Emilien 05 December 2013 (has links) (PDF) Our goal is to enable aWeb user to easily specify distributed data managementtasks in place, i.e. without centralizing the data to a single provider. Oursystem is therefore not a replacement for Facebook, or any centralized system,but an alternative that allows users to launch their own peers on their machinesprocessing their own local personal data, and possibly collaborating with Webservices.We introduce Webdamlog, a datalog-style language for managing distributeddata and knowledge. The language extends datalog in a numberof ways, notably with a novel feature, namely delegation, allowing peersto exchange not only facts but also rules. We present a user study thatdemonstrates the usability of the language. We describe a Webdamlog enginethat extends a distributed datalog engine, namely Bud, with the supportof delegation and of a number of other novelties of Webdamlog such as thepossibility to have variables denoting peers or relations. We mention noveloptimization techniques, notably one based on the provenance of facts andrules. We exhibit experiments that demonstrate that the rich features ofWebdamlog can be supported at reasonable cost and that the engine scales tolarge volumes of data. Finally, we discuss the implementation of a Webdamlogpeer system that provides an environment for the engine. In particular, a peersupports wrappers to exchange Webdamlog data with non-Webdamlog peers.We illustrate these peers by presenting a picture management applicationthat we used for demonstration purposes. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Distribution Datalog Knowledge Base Peer to Peer Web Data Management
24	Scalable view-based techniques for web data : algorithms and systems Katsifodimos, Asterios 03 July 2013 (has links) (PDF) XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre XML Web data Materialized views Query optimization View selection Publish/subscribe Data management
25	Easing information extraction on the web through automated rules discovery Ortona, Stefano January 2016 (has links) The advent of the era of big data on the Web has made automatic web information extraction an essential tool in data acquisition processes. Unfortunately, automated solutions are in most cases more error prone than those created by humans, resulting in dirty and erroneous data. Automatic repair and cleaning of the extracted data is thus a necessary complement to information extraction on the Web. This thesis investigates the problem of inducing cleaning rules on web extracted data in order to (i) repair and align the data w.r.t. an original target schema, (ii) produce repairs that are as generic as possible such that different instances can benefit from them. The problem is addressed from three different angles: replace cross-site redundancy with an ensemble of entity recognisers; produce general repairs that can be encoded in the extraction process; and exploit entity-wide relations to infer common knowledge on extracted data. First, we present ROSeAnn, an unsupervised approach to integrate semantic annotators and produce a unied and consistent annotation layer on top of them. Both the diversity in vocabulary and widely varying accuracy justify the need for middleware that reconciles different annotator opinions. Considering annotators as "black-boxes" that do not require per-domain supervision allows us to recognise semantically related content in web extracted data in a scalable way. Second, we show in WADaR how annotators can be used to discover rules to repair web extracted data. We study the problem of computing joint repairs for web data extraction programs and their extracted data, providing an approximate solution that requires no per-source supervision and proves effective across a wide variety of domains and sources. The proposed solution is effective not only in repairing the extracted data, but also in encoding such repairs in the original extraction process. Third, we investigate how relationships among entities can be exploited to discover inconsistencies and additional information. We present RuDiK, a disk-based scalable solution to discover first-order logic rules over RDF knowledge bases built from web sources. We present an approach that does not limit its search space to rules that rely on "positive" relationships between entities, as in the case with traditional mining of constraints. On the contrary, it extends the search space to also discover negative rules, i.e., patterns that lead to contradictions in the data.
26	Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences Rech, Rodrigo Octavio January 2007 (has links) Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance. Ciência e tecnologia Avaliação acadêmica Recuperacao : Informacao Extracao : Dados Produção científica Academic evaluation Similarity functions Web data extraction Information retrieval
27	Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences Rech, Rodrigo Octavio January 2007 (has links) Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance. Ciência e tecnologia Avaliação acadêmica Recuperacao : Informacao Extracao : Dados Produção científica Academic evaluation Similarity functions Web data extraction Information retrieval
28	Um modelo de pontuação na busca de competências acadêmicas de pesquisadores / A score-based model for assessing academic researchers competences Rech, Rodrigo Octavio January 2007 (has links) Esta pesquisa descreve um modelo para descobrir e pontuar competências acadêmicas de pesquisadores, baseado na combinação de indicadores quantitativos que permitem mensurar a produção acadêmica dos cientistas. Um diferencial do modelo é a inclusão de indicadores quantitativos relacionados com a importância da produção bibliográfica dos pesquisadores. Estes indicadores possibilitam uma avaliação da produção considerando aspectos como repercussão na comunidade acadêmica e nível dos veículos de publicação. A pesquisa também contribui com a especificação de uma arquitetura flexível e extensível fundamentada em técnicas de extração de dados na Web e casamento aproximado de dados (através de funções de similaridade). A arquitetura foi implementada em um sistema Web cuja principal característica é a integração de diversas tecnologias open source. O sistema desenvolvido permite que qualquer pesquisador avalie quantitativamente sua produção científica, automatizando diversos aspectos relacionados à tarefa de avaliação, como a obtenção dos indicadores e a integração das diferentes bases de informações. / The present research describes a model that aims finding out and scoring academic researchers skills or competences based on the combination of quantitative indicators that make it possible to measure the production of academic scientists. A special feature concerning our model is the inclusion of quantitative indicators related to the importance of the researchers’ bibliographic production. These indicators allow the evaluation of the production considering both the outcome it has had in the academic community, and the quality level of the place it was published. The study also presents a flexible and extensible architecture specification based on techniques of web data extraction, and on approximate data matching (which is carried out through similarity functions). The architecture has been implemented in a web system whose main feature relies on the integration of several open-source technologies. The developed system allows any researcher to evaluate his/her own scientific production in quantitative terms, automating as well the so many aspects regarding the evaluation task, by making it easier to obtain the indicators and to integrate the different information databases, for instance. Ciência e tecnologia Avaliação acadêmica Recuperacao : Informacao Extracao : Dados Produção científica Academic evaluation Similarity functions Web data extraction Information retrieval
29	Plataforma para la Extracción y Almacenamiento del Conocimiento Extraído de los Web Data Rebolledo Lorca, Víctor January 2008 (has links) No description available. Ingeniería Gestión de Operaciones Minería de datos Recursos de información Web mining Web data KDD
30	Distributed data management with access control : social Networks and Data of the Web / Gestion de Données Distribuées avec Contrôle d’Accès : réseaux sociaux et données du Web Galland, Alban 28 September 2011 (has links) La masse d’information disponible sur leWeb s’accroit rapidement, sous l’afflux de données en provenance des utilisateurs et des compagnies. Ces données qu’ils souhaitent partager de façon controllée sur le réseau et quisont réparties sur de nombreuses machines et systèmes différents, ne sont rapidement plus gérables directement par des moyens humains. Nous introduisons WebdamExchange, un nouveau modèle de bases de connaissancesdistribuées, qui comprend des assertions au sujet des données, du contrôle d’accés et de la distribution. Ces assertions peuvent être échangées avec d’autres pairs, répliquées, interrogées et mises à jour, en gardant la trace de leur origine. La base de connaissance permet aussi de guider de façon automatique sa propre gestion. WebdamExchange est basé surWebdamLog, un nouveau langage de règles pour la gestion de données distribuées, qui associe formellement les règles déductives de Datalog avec négation et les règles actives de Datalog::. WebdamLog met l’accent sur la dynamicité et les interactions, caractéristiques du Web 2.0. Ce modèle procure à la fois un langage expressif pour la spécification de systèmes distribués complexes et un cadre formel pour l’étude de propriétés fondamentales de la distribution. Nous présentons aussi une implémentation de notre base de connaissance. Nous pensons que ces contributions formentune fondation solide pour surmonter les problèmes de gestion de données du Web, en particulier dans le cadre du contrôle d’accès. / The amount of information on the Web is spreading very rapidly. Users as well as companies bring data to the network and are willing to share with others. They quickly reach a situation where their information is hosted on many machines they own and on a large number of autonomous systems where they have accounts. Management of all this information is rapidly becoming beyond human expertise. We introduce WebdamExchange, a novel distributed knowledge-base model that includes logical statements for specifying information, access control, secrets, distribution, and knowledge about other peers. These statements can be communicated, replicated, queried, and updated, while keeping track of time and provenance. The resulting knowledge guides distributed data management. WebdamExchange model is based on WebdamLog, a new rule-based language for distributed data management that combines in a formal setting deductiverules as in Datalog with negation, (to specify intensional data) and active rules as in Datalog:: (for updates and communications). The model provides a novel setting with a strong emphasis on dynamicity and interactions(in a Web 2.0 style). Because the model is powerful, it provides a clean basis for the specification of complex distributed applications. Because it is simple, it provides a formal framework for studying many facets of the problem such as distribution, concurrency, and expressivity in the context of distributed autonomous peers. We also discuss an implementation of a proof-of-concept system that handles all the components of the knowledge base and experiments with a lighter system designed for smartphones. We believe that these contributions are a good foundation to overcome theproblems of Web data management, in particular with respect to access control. Distribution Contrôle d’Accès Réseaux Sociaux Gestion de Données du We Datalog Distribué Distribution Access Control Social Network Web Data Management Distributed Datalog

Search results