• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 34
  • 11
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 56
  • 56
  • 19
  • 13
  • 10
  • 10
  • 9
  • 9
  • 9
  • 8
  • 8
  • 8
  • 8
  • 8
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Explorando dados provindos da internet em dispositivos móveis: uma abordagem baseada em visualização de informação / Exploring web data on mobile devices: an approach based on information visualization

Felipe Simões Lage Gomes Duarte 12 February 2015 (has links)
Com o progresso da computação e popularização da Internet, a sociedade entrou na era da informação. Esta nova fase é marcada pela forma como produzimos e lidamos com a informação. Diariamente, são produzidos e armazenados milhares de Gigabytes de dados cujo valor é reduzido se a informação ali contida não puder ser transformada em conhecimento. Concomitante a este fato, o padrão da computação está caminhando para a miniaturização e acessibilidade com os dispositivos móveis. Estes novos equipamentos estão mudando o padrão de comportamento dos usuários que passam de leitores passivos a geradores de conteúdo. Neste contexto, este projeto de mestrado propõe a técnica de visualização de dados NMap e a ferramenta de visualização de dados web aplicável a dispositivo móvel SPCloud. A técnica NMap utiliza todo o espaço visual disponível para transmitir informações de grupos preservando a metáfora distância-similaridade. Teste comparativos com as principais técnicas do estado da arte mostraram que a técnica NMap tem melhores resultados em preservação de vizinhança com um tempo de processamento significativamente melhor. Este fato coloca a NMap como uma das principais técnicas de ocupação do espaço visual. A ferramenta SPCloud utiliza a NMap para visualizar notícias disponíveis na web. A ferramenta foi desenvolvida levando em consideração as características inerentes aos dispositivos moveis o que possibilita utiliza-la nestes equipamentos. Teste informais com usuários demonstraram que a ferramenta tem um bom desempenho para sumarizar grandes quantidades de notícias em um pequeno espaço visual. / With the development of computers and the increasing popularity of the Internet, our society has entered the information age. This era is marked by the way we produce and deal with information. Everyday, thousand of Gigabytes are stored, but their value is reduced if the data cannot be transformed into knowledge. Concomitantly, computing is moving towards miniaturization and affordability of mobile devices, which are changing users behavior who move from passive readers to content generators. In this context, in this master thesis we propose and develop a data visualization technique, called NMap, and a web data visualization tool for mobile devices, called SPCloud. NMap uses all available visual space to transmit information preserving the metaphor of distance-similarity between elements. Comparative evaluations were performed with the state of the art techniques and the result has shown that NMap produces better results of neighborhood preservation with a significant improvement in processing time. Our results place NMap as a major space-filling technique establishing a new state of the art. SPCloud, a tool which uses NMap to present news available on the web, was developed taking into account the inherent characteristics of mobile devices. Informal user tests revealed that SPCloud performs well to summarize large amounts of news in a small visual space.
42

Extração de informação não-supervisionada por segmentação de texto

Vilarinho, Eli Cortez Custódio 14 December 2012 (has links)
Submitted by Lúcia Brandão (lucia.elaine@live.com) on 2015-07-27T19:15:09Z No. of bitstreams: 1 Tese - Eli Cortez Custódio Vilarinho.pdf: 11041462 bytes, checksum: 19414e6ce9e997483dc1adee4e5eb413 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-07-28T19:02:25Z (GMT) No. of bitstreams: 1 Tese - Eli Cortez Custódio Vilarinho.pdf: 11041462 bytes, checksum: 19414e6ce9e997483dc1adee4e5eb413 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-07-28T19:08:39Z (GMT) No. of bitstreams: 1 Tese - Eli Cortez Custódio Vilarinho.pdf: 11041462 bytes, checksum: 19414e6ce9e997483dc1adee4e5eb413 (MD5) / Made available in DSpace on 2015-07-28T19:08:39Z (GMT). No. of bitstreams: 1 Tese - Eli Cortez Custódio Vilarinho.pdf: 11041462 bytes, checksum: 19414e6ce9e997483dc1adee4e5eb413 (MD5) Previous issue date: 2012-12-14 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / In this work we propose, implement and evaluate a new unsupervised approach for the problem of Information Extraction by Text Segmentation (IETS). Our approach relies on information available on pre-existing data to learn how to associate segments in the input string with attributes of a given domain relying on a very effective set of content-based features. The effectiveness of the content-based features is also exploited to directly learn from test data structure-based features, with no previous human-driven training, a feature unique to our approach. Based on our approach, we have produced a number of results to address the IETS problem in a unsupervised fashion. In particular, we have developed, implemented and evaluated distinct IETS methods, namely ONDUX, JUDIE and iForm. ONDUX (On Demand Unsupervised Information Extraction) is an unsupervised probabilistic approach for IETS that relies on content-based features to bootstrap the learning of structure-based features. Structure-based features are exploited to disambiguate the extraction of certain attributes through a reinforcement step, which relies on sequencing and positioning of attribute values directly learned on-demand from the input texts. JUDIE (Joint Unsupervised Structure Discovery and Information Extraction) aims at automatically extracting several semi-structured data records in the form of continuous text and having no explicit delimiters between them. In comparison with other IETS methods, including ONDUX, JUDIE faces a task considerably harder, that is, extracting information while simultaneously uncovering the underlying structure of the implicit records containing it. In spite of that, it achieves results comparable to the state-of- the-art methods. iForm applies our approach to the task of Web form filling. It aims at extracting segments from a data-rich text given as input and associating these segments with fields from a target Web form. The extraction process relies on content-based features learned from data that was previously submitted to the Web form. All of these methods were evaluated considering different experimental datasets, which we use to perform a large set of experiments in order to validate our approach and methods. These experiments indicate that our proposed approach yields high quality results when compared to state-of-the-art approaches and that it is able to properly support IETS methods in a number of real applications. / Neste trabalho, propomos, implementar e avaliar uma nova abordagem não supervisionada para o problema de Extração de Informações Segmentação Texto (IETS). Nossa abordagem baseia-se em informações disponíveis sobre dados pré-existentes para aprender a associar segmentos na seqüência de entrada com atributos de um determinado domínio contando com uma muito eficaz conjunto de recursos baseados em conteúdo. A eficácia dos recursos com base em conteúdo também é explorada para aprender diretamente com recursos baseados em estrutura de dados de teste, sem prévia formação humana-driven, uma característica única para a nossa abordagem. Com base em nossa abordagem, que produziram um número de resultados de abordar o problema IETS num sem supervisão moda. Em particular, temos desenvolvido, implementado e avaliado IETS distintas métodos, nomeadamente ONDUX, judie e iForm. ONDUX (On Demand Unsupervised Extração de Informação) é uma abordagem probabilística sem supervisão para que IETS depende de características baseadas em conteúdo para iniciar o aprendizado de características baseadas em estrutura. Recursos baseados em estrutura são exploradas para disambiguate a extração de certos atributos através de uma etapa de reforço, que se baseia na sequenciação e posicionamento de valores de atributos diretamente aprendidas on-demand a partir dos textos de entrada. Judie (Joint Estrutura sem supervisão Descoberta e Extração de Informações) visa automaticamente extrair vários registros semi-estruturados de dados na forma de texto contínuo e não tendo delimitadores explícitas entre eles. Em comparação com outros IETS métodos, incluindo ONDUX, judie enfrenta uma tarefa consideravelmente mais forte, isto é, extrair informações, ao mesmo tempo descobrindo a estrutura subjacente de os registros implícitas que o contenham. Apesar disso, ele consegue resultados comparáveis ​​aos a métodos the-art estado-da. iForm aplica-se a nossa abordagem para a tarefa de forma Web o preenchimento. Destina-se a extração de segmentos de um texto rico em dados fornecidos como entrada e associando esses segmentos com campos de um formulário Web de destino. O processo de extracção depende de recursos com base em conteúdo aprendidas com os dados que foram previamente submetidos à o formulário Web. Todos esses métodos foram avaliados considerando diferente experimental conjuntos de dados, que usamos para realizar um grande conjunto de experiências, a fim de validar nossa abordagem e métodos. Estas experiências indicam que a nossa abordagem proposta produz resultados de alta qualidade quando comparado com abordagens state-of-the-art e que ele é capaz de suportar adequadamente os métodos IETS em uma série de aplicações reais.
43

Extrakce dat z webu / Web Data Extraction

Novella, Tomáš January 2016 (has links)
Creation of web wrappers (i.e programs that extract data from the web) is a subject of study in the field of web data extraction. Designing a domain-specific language for a web wrapper is a challenging task, because it introduces trade-offs between expressiveness of a wrapper's language and safety. In addition, little attention has been paid to execution of a wrapper in restricted environment. In this thesis, we present a new wrapping language -- Serrano -- that has three goals in mind. (1) Ability to run in restricted environment, such as a browser extension, (2) extensibility, to balance the tradeoffs between expressiveness of a command set and safety, and (3) processing capabilities, to eliminate the need for additional programs to clean the extracted data. Serrano has been successfully deployed in a number of projects and provided encouraging results. Powered by TCPDF (www.tcpdf.org)
44

Query-Time Data Integration

Eberius, Julian 16 December 2015 (has links) (PDF)
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections.
45

Vizualizace dat pro Ansible Automation Analytics / Chart Builder Ansible Automation Analytics

Berky, Levente January 2021 (has links)
Tato práce se zaměřuje na vytvoření webové komponenty k vykreslení grafů ze strukturovanýho datovýho formátu (dále jen schéma) a vytvoření uživatelského rozhraní pro editaci schématu pro Ansible Automation Analytics. Práce zkoumá aktuální implementaci Ansible Automation Analytics a odpovídající API. Dále zkoumá vhodné knihovny pro vykreslování grafů a popisuje základy použitých technologií. Praktická část popisuje požadavky na komponentu a popisuje vývoj a implementaci pluginu. Dále práce popisuje proces testování a~plány budoucího vývoje pluginu.
46

Query-Time Data Integration

Eberius, Julian 10 December 2015 (has links)
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections.
47

The One Spider To Rule Them All : Web Scraping Simplified: Improving Analyst Productivity and Reducing Development Time with A Generalized Spider / Spindeln som härskar över dom alla : Webbskrapning förenklat: förbättra analytikerproduktiviteten och minska utvecklingstiden med generaliserade spindlar

Johansson, Rikard January 2023 (has links)
This thesis addresses the process of developing a generalized spider for web scraping, which can be applied to multiple sources, thereby reducing the time and cost involved in creating and maintaining individual spiders for each website or URL. The project aims to improve analyst productivity, reduce development time for developers, and ensure high-quality and accurate data extraction. The research involves investigating web scraping techniques and developing a more efficient and scalable approach to report retrieval. The problem statement emphasizes the inefficiency of the current method with one customized spider per source and the need for a more streamlined approach to web scraping. The research question focuses on identifying patterns in the web scraping process and functions required for specific publication websites to create a more generalized web scraper. The objective is to reduce manual effort, improve scalability, and maintain high-quality data extraction. The problem is resolved using a quantitative approach that involves the analysis and implementation of spiders for each data source. This enables a comprehensive understanding of all potential scenarios and provides the necessary knowledge to develop a general spider. These spiders are then grouped based on their similarity, and through the application of simple logic, they are consolidated into a single general spider capable of handling all the sources. To construct the general spider, a utility library is created, equipped with the essential tools for extracting relevant information such as title, description, date, and PDF links. Subsequently, all the individual information is transferred to configuration files, enabling the execution of the general spider. The findings demonstrate the successful integration of multiple sources and spiders into a unified general spider. However, due to the limited time frame of the project, there is potential for further improvement. Enhancements could include better structuring of the configuration files, expansion of the utility library, or even the integration of AI capabilities to enhance the performance of the general spider. Nevertheless, the current solution is deemed suitable for automated article retrieval and ready to be used. / Denna rapport tar upp processen att utveckla en generaliserad spindel för webbskrapning, som kan appliceras på flera källor, och därigenom minska tiden och kostnaderna för att skapa och underhålla individuella spindlar för varje webbplats eller URL. Projektet syftar till att förbättra analytikers produktivitet, minska utvecklingstiden för utvecklare och säkerställa högkvalitativ och korrekt dataextraktion. Forskningen går ut på att undersöka webbskrapningstekniker och utveckla ett mer effektivt och skalbart tillvägagångssätt för att hämta rapporter. Problemformuleringen betonar ineffektiviteten hos den nuvarande metoden med en anpassad spindel per källa och behovet av ett mer effektiviserad tillvägagångssätt för webbskrapning. Forskningsfrågan fokuserar på att identifiera mönster i webbskrapningsprocessen och funktioner som krävs för specifika publikationswebbplatser för att skapa en mer generaliserad webbskrapa. Målet är att minska den manuella ansträngningen, förbättra skalbarheten och upprätthålla datautvinning av hög kvalitet. Problemet löses med hjälp av en kvantitativ metod som involverar analys och implementering av spindlar för varje datakälla. Detta möjliggör en omfattande förståelse av alla potentiella scenarier och ger den nödvändiga kunskapen för att utveckla en allmän spindel. Dessa spindlar grupperas sedan baserat på deras likhet, och genom tillämpning av enkel logik konsolideras de till en enda allmän spindel som kan hantera alla källor. För att konstruera den allmänna spindeln skapas ett verktygsbibliotek, utrustat med de väsentliga verktygen för att extrahera relevant information som titel, beskrivning, datum och PDF-länkar. Därefter överförs all individuell information till konfigurationsfiler, vilket möjliggör exekvering av den allmänna spindeln. Resultaten visar den framgångsrika integrationen av flera källor och spindlar till en enhetlig allmän spindel. Men på grund av projektets begränsade tidsram finns det potential för ytterligare förbättringar. Förbättringar kan inkludera bättre strukturering av konfigurationsfilerna, utökning av verktygsbiblioteket eller till och med integrering av AI-funktioner för att förbättra den allmänna spindelns prestanda. Ändå bedöms den nuvarande lösningen vara lämplig för automatisk artikelhämtning och redo att användas.
48

A Flexible Graph-Based Data Model Supporting Incremental Schema Design and Evolution

Braunschweig, Katrin, Thiele, Maik, Lehner, Wolfgang 26 January 2023 (has links)
Web data is characterized by a great structural diversity as well as frequent changes, which poses a great challenge for web applications based on that data. We want to address this problem by developing a schema-optional and flexible data model that supports the integration of heterogenous and volatile web data. Therefore, we want to rely on graph-based models that allow to incrementally extend the schema by various information and constraints. Inspired by the on-going web 2.0 trend, we want users to participate in the design and management of the schema. By incrementally adding structural information, users can enhance the schema to meet their very specific requirements.
49

Web Mining - die Fallstudie Swarovski : theoretische Grundlagen und praktische Anwendungen /

Linder, Alexander, Wehrli, Hans Peter. January 2005 (has links) (PDF)
Zugleich: Diss. Wirtschaftswiss. Zürich, 2004. / Im Buchh.: Wiesbaden : Deutscher Universitäts-Verlag. Literaturverz.
50

View-Based techniques for the efficient management of web data / Techniques fondées sur des vues matérialisées pour la gestion efficace des données du web

Karanasos, Konstantinos 29 June 2012 (has links)
De nos jours, des masses de données sont publiées à grande échelle dans des formats numériques. Une part importante de ces données a une structure complexe, typiquement organisée sous la forme d'arbres (les documents du web, comme HTML et XML, étant les plus représentatifs) ou de graphes (en particulier, les bases de données du Web Sémantique structurées en graphes, et exprimées en RDF). Exploiter ces données complexes, qu'elles soient dans un format d'accès Open Data ou bien propriétaire (au sein d'une compagnie), présente un grand intérêt. Le faire de façon efficace pour de grands volumes de données reste encore un défi. Les vues matérialisées sont utilisées depuis longtemps pour améliorer considérablement l'évaluation des requêtes. Le principe est q'une vue stocke des résultats pre-calculés qui peuvent être utilisés pour évaluer (une partie d') une requête. L'adoption des techniques de vues matérialisées dans le contexte de données du web que nous considérons est particulièrement exigeante à cause de la complexité structurelle et sémantique des données. Cette thèse aborde deux problèmes liés à la gestion des données du web basée sur des vues matérialisées. D'abord, nous nous concentrons sur le problème de sélection des vues pour des ensembles de requêtes RDF. Nous présentons un algorithme original qui, basé sur un ensemble de requêtes, propose les vues les plus appropriées à matérialiser dans la base des données. Ceci dans le but de minimiser à la fois les coûts d'évaluation des requêtes, de maintenance et de stockage des vues. Bien que les requêtes RDF contiennent typiquement un grand nombre de jointures, ce qui complique le processus de sélection de vues, notre algorithme passe à l'échelle de centaines de requêtes, un nombre non atteint par les méthodes existantes. En outre, nous proposons des techniques nouvelles pour tenir compte des données implicites qui peuvent être dérivées des schémas RDF sans complexifier davantage la sélection des vues. La deuxième contribution de notre travail concerne la réécriture de requêtes en utilisant des vues matérialisées XML. Nous commençons par identifier un dialecte expressif de XQuery, correspondant aux motifs d'arbres avec des jointures sur la valeur, et nous étudions des propriétés importantes de ces requêtes, y compris l'inclusion et la minimisation. En nous fondant sur ces notions, nous considérons le problème de trouver des réécritures minimales et équivalentes d'une requête exprimée dans ce dialecte, en utilisant des vues matérialisées exprimées dans le même dialecte, et nous fournissons un algorithme correct et complet à cet effet. Notre travail dépasse l'état de l'art en permettant à chaque motif d'arbre de renvoyer un ensemble d'attributs, en prenant en charge des jointures sur la valeur entre les motifs, et en considérant des réécritures qui combinent plusieurs vues. Enfin, nous montrons comment notre méthode de réécriture peut être appliquée dans un contexte distribué, pour la dissémination efficace d'un corpus de documents XML annotés en RDF. / Data is being published in digital formats at very high rates nowadays. A large share of this data has complex structure, typically organized as trees (Web documents such as HTML and XML being the most representative) or graphs (in particular, graph-structured Semantic Web databases, expressed in RDF). There is great interest in exploiting such complex data, whether in an Open Data access model or within companies owning it, and efficiently doing so for large data volumes remains challenging. Materialized views have long been used to obtain significant performance improvements when processing queries. The principle is that a view stores pre-computed results that can be used to evaluate (possibly part of) a query. Adapting materialized view techniques to the Web data setting we consider is particularly challenging due to the structural and semantic complexity of the data. This thesis tackles two problems in the broad context of materialized view-based management of Web data. First, we focus on the problem of view selection for RDF query workloads. We present a novel algorithm, which, based on a query workload, proposes the most appropriate views to be materialized in the database, in order to minimize the combined cost of query evaluation, view maintenance and view storage. Although RDF query workloads typically feature many joins, hampering the view selection process, our algorithm scales to hundreds of queries, a number unattained by existing approaches. Furthermore, we propose new techniques to account for the implicit data that can be derived by the RDF Schemas and which further complicate the view selection process. The second contribution of our work concerns query rewriting based on materialized XML views. We start by identifying an expressive dialect of XQuery, corresponding to tree patterns with value joins, and study some important properties for these queries, such as containment and minimization. Based on these notions, we consider the problem of finding minimal equivalent rewritings of a query expressed in this dialect, using materialized views expressed in the same dialect, and provide a sound and complete algorithm for that purpose. Our work extends the state of the art by allowing each pattern node to return a set of attributes, supporting value joins in the patterns, and considering rewritings which combine many views. Finally, we show how our view-based query rewriting algorithm can be applied in a distributed setting, in order to efficiently disseminate corpora of XML documents carrying RDF annotations.

Page generated in 0.1452 seconds