11 |
Distributed data management with access control : social Networks and Data of the WebGalland, Alban 28 September 2011 (has links) (PDF)
The amount of information on the Web is spreading very rapidly. Users as well as companies bring data to the network and are willing to share with others. They quickly reach a situation where their information is hosted on many machines they own and on a large number of autonomous systems where they have accounts. Management of all this information is rapidly becoming beyond human expertise. We introduce WebdamExchange, a novel distributed knowledge-base model that includes logical statements for specifying information, access control, secrets, distribution, and knowledge about other peers. These statements can be communicated, replicated, queried, and updated, while keeping track of time and provenance. The resulting knowledge guides distributed data management. WebdamExchange model is based on WebdamLog, a new rule-based language for distributed data management that combines in a formal setting deductiverules as in Datalog with negation, (to specify intensional data) and active rules as in Datalog:: (for updates and communications). The model provides a novel setting with a strong emphasis on dynamicity and interactions(in a Web 2.0 style). Because the model is powerful, it provides a clean basis for the specification of complex distributed applications. Because it is simple, it provides a formal framework for studying many facets of the problem such as distribution, concurrency, and expressivity in the context of distributed autonomous peers. We also discuss an implementation of a proof-of-concept system that handles all the components of the knowledge base and experiments with a lighter system designed for smartphones. We believe that these contributions are a good foundation to overcome theproblems of Web data management, in particular with respect to access control.
|
12 |
Interaktivní procházení webu a extrakce dat / Interactive web crawling and data extractionFejfar, Petr January 2018 (has links)
Title: Interactive crawling and data extraction Author: Bc. Petr Fejfar Author's e-mail address: pfejfar@gmail.com Department: Department of Distributed and Dependable Systems Supervisor: Mgr. Pavel Je ek, Ph.D., Department of Distributed and De- pendable Systems Abstract: The subject of this thesis is Web crawling and data extraction from Rich Internet Applications (RIA). The thesis starts with analysis of modern Web pages along with techniques used for crawling and data extraction. Based on this analysis, we designed a tool which crawls RIAs according to the instructions defined by the user via graphic interface. In contrast with other currently popular tools for RIAs, our solution is targeted at users with no programming experience, including business and analyst users. The designed solution itself is implemented in form of RIA, using the Web- Driver protocol to automate multiple browsers according to user-defined instructions. Our tool allows the user to inspect browser sessions by dis- playing pages that are being crawled simultaneously. This feature enables the user to troubleshoot the crawlers. The outcome of this thesis is a fully design and implemented tool enabling business user to extract data from the RIAs. This opens new opportunities for this type of user to collect data from Web pages for use...
|
13 |
Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden webLutz, João Adolfo Froede January 2013 (has links)
Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords:
|
14 |
Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden webLutz, João Adolfo Froede January 2013 (has links)
Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords:
|
15 |
Descoberta de ruído em páginas da web oculta através de uma abordagem de aprendizagem supervisionada / A supervised learning approach for noise discovery in web pages found in the hidden webLutz, João Adolfo Froede January 2013 (has links)
Um dos problemas da extração de dados na web é a remoção de ruído existente nas páginas. Esta tarefa busca identificar todos os elementos não informativos em meio ao conteúdo, como por exemplo cabeçalhos, menus ou propagandas. A presença de ruído pode prejudicar seriamente o desempenho de motores de busca e tarefas de mineração de dados na web. Este trabalho aborda o problema da descoberta de ruído em páginas da web oculta, a parte da web que é acessível apenas através do preenchimento de formulários. No processamento da web oculta, a extração de dados geralmente é precedida por uma etapa de inserção de dados, na qual os formulários que dão acesso às páginas ocultas são automaticamente ou semi-automaticamente preenchidos. Durante esta fase, são coleta- dos dados do domínio em questão, como os rótulos e valores dos campos. A proposta deste trabalho é agregar este tipo de dados com informações sintáticas dos elementos que compõem a página. É mostrado empiricamente que esta combinação atinge resultados melhores que uma abordagem baseada apenas em informações sintáticas. / One of the problems of data extraction from web pages is the identification of noise in pages. This task aims at identifying non-informative elements in pages, such as headers, menus, or advertisement. The presence of noise may hinder the performance of search engines and web mining tasks. In this paper we tackle the problem of discovering noise in web pages found in the hidden web, i.e., that part of the web that is only accessible by filling web forms. In hidden web processing, data extraction is usually preceeded by a form filling step, in which the query forms that give access to the hidden web pages are automatically or semi-automatically filled. During form filling relevant data about the queried domain are collected, as field names and field values. Our proposal combines this type of data with syntactic information about the nodes that compose the page. We show empirically that this combination achieves better results than an approach that is based solely on syntactic information. Keywords:
|
16 |
A comparison of HTML-aware tools for Web Data extractionBoronat, Xavier Azagra 20 October 2017 (has links)
Nowadays we live in a world where information is present everywhere in our daily life. In those last years the amount of information that we receive has grown and the stands in which is distributed have changed; from conventional newspapers or the radio to mobile phones, digital television or the Web. In this document we reference to the information that we can find in the Web, a really big source of data which is still developing.
|
17 |
iFuice - Information Fusion utilizing Instance Correspondences and Peer MappingsRahm, Erhard, Thor, Andreas, Aumüller, David, Do, Hong-Hai, Golovin, Nick, Kirsten, Toralf 04 February 2019 (has links)
We present a new approach to information fusion of web data sources. It is based on peer-to-peer mappings between sources and utilizes correspondences between their instances. Such correspondences are already available between many sources, e.g. in the form of web links, and help combine the information about specific objects and support a high quality data fusion. Sources and
mappings relate to a domain model to support a semantically focused information fusion. The iFuice architecture incorporates a mapping mediator offering both an interactive and a script-driven, workflow-like access to the sources and their mappings. The script programmer can use powerful generic operators to execute
and manipulate mappings and their results. The paper motivates the new approach and outlines the architecture and its main components, in particular the domain model, source and mapping model, and the script operators and their usage.
|
18 |
Analytics-as-a-Service in a Multi-Cloud Environment through Semantically-enabled Hierarchical Data ProcessingJayaraman, P.P., Perera, C., Georgakopoulos, D., Dustdar, S., Thakker, Dhaval, Ranjan, R. 16 August 2016 (has links)
yes / A large number of cloud middleware platforms and tools are deployed to support a variety of Internet
of Things (IoT) data analytics tasks. It is a common practice that such cloud platforms are only used
by its owners to achieve their primary and predefined objectives, where raw and processed data are only
consumed by them. However, allowing third parties to access processed data to achieve their own objectives
significantly increases intergation, cooperation, and can also lead to innovative use of the data. Multicloud,
privacy-aware environments facilitate such data access, allowing different parties to share processed
data to reduce computation resource consumption collectively. However, there are interoperability issues in
such environments that involve heterogeneous data and analytics-as-a-service providers. There is a lack of
both - architectural blueprints that can support such diverse, multi-cloud environments, and corresponding
empirical studies that show feasibility of such architectures. In this paper, we have outlined an innovative
hierarchical data processing architecture that utilises semantics at all the levels of IoT stack in multicloud
environments. We demonstrate the feasibility of such architecture by building a system based on this
architecture using OpenIoT as a middleware, and Google Cloud and Microsoft Azure as cloud environments.
The evaluation shows that the system is scalable and has no significant limitations or overheads.
|
19 |
An Empirical Study of Novel Approaches to Dimensionality Reduction and ApplicationsNsang, Augustine S. 23 September 2011 (has links)
No description available.
|
20 |
SEEDEEP: A System for Exploring and Querying Deep Web Data SourcesWang, Fan 27 September 2010 (has links)
No description available.
|
Page generated in 0.0742 seconds