Global ETD Search

21	Extrakcia štruktúrovaných dát z neštruktúrovaného textu / Structured Data Extraction from Unstructured Text Kóša, Peter January 2013 (has links) Title: Structured Data Extraction from Unstructured Text Author: Bc. Peter Kóša Department: Department of Software Engineering Supervisor: Mgr. Martin Nečaský, Ph.D., Department of Software Engineering Abstract: In the last 20 years, there has been an ever-growing amount of information present on the Internet and in published texts. However, this information is often in a non-structured format and this causes various problems such as the inability to efficiently search in diverse collections of texts (medical reports, ads, etc.). To overcome these problems, we need efficient tools capable of automatic processing, extracting the important information and storing of these results in some form for later reuse. The purpose of this thesis is to compare existing solutions as well as to compare them with our solution, which was created in the scope of software project SemJob. The SemJob project is introduced and the reader can therefore obtain knowledge about its inner structure and workings. Keywords: structured data extraction, extraction rules, ontologies, (semi)automatic wrapper induction
22	Anonymization of directory-structured sensitive data / Anonymisering av katalogstrukturerad känslig data Folkesson, Carl January 2019 (has links) Data anonymization is a relevant and important field within data privacy, which tries to find a good balance between utility and privacy in data. The field is especially relevant since the GDPR came into force, because the GDPR does not regulate anonymous data. This thesis focuses on anonymization of directory-structured data, which means data structured into a tree of directories. In the thesis, four of the most common models for anonymization of tabular data, k-anonymity, ℓ-diversity, t-closeness and differential privacy, are adapted for anonymization of directory-structured data. This adaptation is done by creating three different approaches for anonymizing directory-structured data: SingleTable, DirectoryWise and RecursiveDirectoryWise. These models and approaches are compared and evaluated using five metrics and three attack scenarios. The results show that there is always a trade-off between utility and privacy when anonymizing data. Especially it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach gives the highest privacy, but also the highest information loss. On the contrary, the k-anonymity model when using the SingleTable approach or the t-closeness model when using the DirectoryWise approach gives the lowest information loss, but also the lowest privacy. The differential privacy model and the RecursiveDirectoryWise approach were also shown to give best protection against the chosen attacks. Finally, it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach, was the most suitable combination to use when trying to follow the GDPR when anonymizing directory-structured data. data anonymization data privacy directory-structured data k-anonymity l-diversity t-closeness differential privacy GDPR Computer Engineering Datorteknik
23	Internet-Scale Information Monitoring: A Continual Query Approach Tang, Wei 08 December 2003 (has links) Information monitoring systems are publish-subscribe systems that continuously track information changes and notify users (or programs acting on behalf of humans) of relevant updates according to specified thresholds. Internet-scale information monitoring presents a number of new challenges. First, automated change detection is harder when sources are autonomous and updates are performed asynchronously. Second, information source heterogeneity makes the problem of modelling and representing changes harder than ever. Third, efficient and scalable mechanisms are needed to handle a large and growing number of users and thousands or even millions of monitoring triggers fired at multiple sources. In this dissertation, we model users' monitoring requests using continual queries (CQs) and present a suite of efficient and scalable solutions to large scale information monitoring over structured or semi-structured data sources. A CQ is a standing query that monitors information sources for interesting events (triggers) and notifies users when new information changes meet specified thresholds. In this dissertation, we first present the system level facilities for building an Internet-scale continual query system, including the design and development of two operational CQ monitoring systems OpenCQ and WebCQ, the engineering issues involved, and our solutions. We then describe a number of research challenges that are specific to large-scale information monitoring and the techniques developed in the context of OpenCQ and WebCQ to address these challenges. Example issues include how to efficiently process large number of continual queries, what mechanisms are effective for building a scalable distributed trigger system that is capable of handling tens of thousands of triggers firing at hundreds of data sources, how to effectively disseminate fresh information to the right users at the right time. We have developed a suite of techniques to optimize the processing of continual queries, including an effective CQ grouping scheme, an auxiliary data structure to support group-based indexing of CQs, and a differential CQ evaluation algorithm (DRA). The third contribution is the design of an experimental evaluation model and testbed to validate the solutions. We have engaged our evaluation using both measurements on real systems (OpenCQ/WebCQ) and simulation-based approach. To our knowledge, the research documented in this dissertation is to date the first one to present a focused study of research and engineering issues in building large-scale information monitoring systems using continual queries. Differential re-evaluation Continual queries Web page monitoring Semi-structured data Information monitoring Web sites Management Information technology Internet programming
24	Parametric kernels for structured data analysis Shin, Young-in 04 May 2015 (has links) Structured representation of input physical patterns as a set of local features has been useful for a veriety of robotics and human computer interaction (HCI) applications. It enables a stable understanding of the variable inputs. However, this representation does not fit the conventional machine learning algorithms and distance metrics because they assume vector inputs. To learn from input patterns with variable structure is thus challenging. To address this problem, I propose a general and systematic method to design distance metrics between structured inputs that can be used in conventional learning algorithms. Based on the observation of the stability in the geometric distributions of local features over the physical patterns across similar inputs, this is done combining the local similarities and the conformity of the geometric relationship between local features. The produced distance metrics, called “parametric kernels”, are positive semi-definite and require almost linear time to compute. To demonstrate the general applicability and the efficacy of this approach, I designed and applied parametric kernels to handwritten character recognition, on-line face recognition, and object detection from laser range finder sensor data. Parametric kernels achieve recognition rates competitive to state-of-the-art approaches in these tasks. / text Parametric kernels Structured data Distance metrics Conventional learning algorithms Handwritten character recognition On-line face recognition Object detection
25	[en] A MODEL FOR EXPLORATION OF SEMI-STRUCTURED DATASETS / [pt] UM MODELO PARA EXPLORAÇÃO DE DADOS SEMIESTRUTURADOS THIAGO RIBEIRO NUNES 05 February 2018 (has links) [pt] Tarefas de exploração de informação são reconhecidas por possuir características tais como alta complexidade, falta de conhecimento do usuário sobre o domínio da tarefa e incertezas sobre as estratégias de solução. O estado-da-arte em exploração de dados inclui uma variedade de modelos e ferramentas baseadas em diferentes paradigmas de interação, como por exemplo, busca por palavras-chave, busca facetada e orientação-a-conjuntos. Não obstante os muitos avanços das últimas décadas, a falta de uma abordagem formal do processo de exploração, juntamente com a falta de uma adoção mais pragmática do princípio de separação-de-responsabilidades no design dessas ferramentas são a causa de muitas limitações. Dentre as limitações, essa tese aborda a falta de expressividade, caracterizada por restrições na gama de estratégias de solução possíveis, e dificuldades de análise e comparação entre as ferramentas propostas. A partir desta observação, o presente trabalho propõe um modelo formal de ações e processos de exploração, uma nova abordagem para o projeto de ferramentas de exploração e uma ferramenta que generaliza o estado-da-arte em exploração de informação. As avaliações do modelo, realizadas por meio de estudos de caso, análises e comparações o estado-da-arte, corroboram a utilidade da abordagem. / [en] Information exploration processes are usually recognized by their inherent complexity, lack of knowledge and uncertainty, concerning both the domain and the solution strategies. Even though there has been much work on the development of computational systems supporting exploration tasks, such as faceted search and set-oriented interfaces, the lack of a formal understanding of the exploration process and the absence of a proper separation of concerns approach in the design phase is the cause of many expressivity issues and serious limitations. This work proposes a novel design approach of exploration tools based on a formal framework for representing exploration actions and processes. Moreover, we present a new exploration system that generalizes the majority of the state-of-the art exploration tools. The evaluation of the proposed framework is guided by case studies and comparisons with state-of-the-art tools. The results show the relevance of our approach both for the design of new exploration tools with higher expressiveness, and formal assessments and comparisons between different tools. [pt] FRAMEWORK [en] FRAMEWORK [pt] EXPLORACAO [en] EXPLORATION [pt] MODELO FORMAL [en] FORMAL MODEL [pt] DADOS SEMIESTRUTRADOS [en] SEMI-STRUCTURED DATA
26	A Data Layout Descriptor Language (LADEL). Jeelani, Ashfaq Ahmed 01 May 2001 (has links) (PDF) To transfer data between devices and main memory, standard C block I/O interfaces use block buffers of type char. C++ programs that perform block I/O commonly use typecasting to move data between structures and block buffers. The subject of this thesis, the layout description language (LADEL), represents a high-level solution to the problem of block buffer management. LADEL provides operators that hide the casting ordinarily required to pack and to unpack buffers and guard against overflow of the virtual fields. LADEL also allows a programmer to dynamically define a structured view of a block buffer's contents. This view includes the use of variable length field specifiers, which supports the development of a general specification for an I/O block that optimizes the use of preset buffers. The need for optimizing buffer use arises in file processing algorithms that perform optimally when I/O buffers are filled to capacity. Packing a buffer to capacity can require reasonably complex C++ code. LADEL can be used to reduce this complexity to considerable extent. C++ programs written using LADEL are less complex, easy to maintain, and easier to read than equivalent programs written LADEL. This increase in maintainability is achieved at a cost of approximately 11 % additional time in comparison to programs that use casting to manipulate block buffer data. structured data network I/O dynamic data layout data layout block I/O Computer Sciences Physical Sciences and Mathematics
27	Towards Efficient Data Analysis and Management of Semi-structured Data Tatikonda, Shirish 08 September 2010 (has links) No description available. Computer Science semi-structured data data mining data management high performance computing databases architecture-conscious techniques trees multicore systems
28	Automated Extraction of Data from Insurance Websites / Automatiserad Datautvinning från Försäkringssidor Hodzic, Amar January 2022 (has links) Websites have become a critical source of information for many organizations in today's digital era. However, extracting and organizing semi-structured data from web pages from multiple websites poses challenges. This is especially true when a high level of automation is desired while maintaining generality. A natural progression in the quest for automation is to extend the methods for web data extraction from only being able to handle a single website to handling multiple ones, usually within the same domain. Although these websites share the same domain, the structure of the data can vary greatly. A key question becomes how generalized such a system can be to encompass a large number of websites while maintaining adequate accuracy. The thesis examined the efficiency of automated web data extraction on multiple Swedish insurance company websites. Previous work showed that good results can be achieved with a known English data set that contains web pages from a number of domains. The state-of-the-art model MarkupLM was chosen and trained with supervised learning using two pre-trained models, a Swedish and an English model, on a labeled training set of car insurance customers' web data using zero-shot learning. The results show that such a model can achieve good accuracy on a domain scale with Swedish as the source language with a relatively small data set by leveraging pre-trained models. / Webbsidor har blivit en kritisk källa av information för många organisationer idag. Men att extrahera och strukturera semistrukturerade data från webbsidor från flertal webbplatser är en utmaning. Speciellt när det är önskvärt med en hög nivå av automatisering i kombination med en generaliserbar lösning. En naturlig utveckling i målat av automation är att utöka metoderna för datautvinning från att endast kunna hantera en specifik webbplats till flertal webbplatser inom samma domän. Men även om dessa webbplatser delar samma domän så kan strukturen på data variera i stor utsträckning. En nyckelfråga blir då hur pass generell en sådan lösning kan vara samtidigt som en adekvat prestanda uppehålls. Detta arbete undersöker prestandan av automatiserad datautvinning från ett flertal svenska försäkringssidor. Tidigare arbete visar på att goda resultat kan uppnås på ett känt engelskt dataset som innehåller webbsidor från ett flertal domän. Den toppmoderna modellen MarkupLM valdes och blev tränad med två olika förtränade modeller, en svensk och en engelsk modell, med märkt data från konsumenters bilförsäkringsdata. Modellen blev utvärderad på data från webbplatser som inte ingick i träningsdatat. Resultaten visar på att en sådan modell kan nå god prestanda på domänskala när innehållsspråket är svenska trots en relativt liten datamängd när förtränade modeller används. Insurance Semi-structured data Web data extraction Deep learning Försäkring Semistrukturerad data Webbdataextraktion Djupinlärning Computer Sciences Datavetenskap (datalogi)
29	Semantisk interoperabilitet för hantering av XML Lindgren, Ida, Norman, Isabelle January 2014 (has links) Business Analytics används idag i ökad grad i organisationer som grund till beslutsfattande. Ett av villkoren för att kunna använda sig av Business Analytics för att utföra analyser av data från olika källor är att det finns interoperabilitet mellan dem. Syftet med den här studien är att undersöka om det är möjligt att skapa en IT-artefakt som kan hämta data ifrån flertalet XML-dokument med olika struktur för att uppnå semantisk interoperabilitet och på så vis möjliggöra för Business Analytics. Med olika struktur menar vi att benämningarna på taggarna skiljer sig språkmässigt men har samma semantiska betydelse. Lösningen skapas genom forskningsstrategin Design Science vilket innebär att en IT-artefakt utvecklas som kunskapsbidrag, och visar att en implementation av en lösning är möjlig för de semantiska problem vi identifierat. Resultatet av utvecklingen är en flexibel IT-artefakt där en användare kan koppla samman och hämta data från XML-filer med olika struktur. Denna sammankoppling skapas genom att användaren själv kan bygga upp och använda en ontologi med de ord som används som taggar i XML-filerna. Genom att använda ontologier på det här sättet visar vi med vår forskning att det är möjligt att uppnå semantisk interoperabilitet mellan XML-filer med olika struktur. Utifrån resultatet av den IT-artefakt vi skapar kan vi dra slutsatser om att det går att skapa en generell lösning för denna typ av problematik. / Today Business Analytics is becoming increasingly popular and is utilized by organizations to analyze data that is used as support for decision-making. Business Analytics requires that interoperability exists between the data sources used to gather and compile data for analysis to ensure that data can be correctly interpreted. Therefore, the aim of this study is to investigate the possibility of creating an IT-artifact for querying several XML-documents consisting of various structures in order to achieve semantic interoperability, thus enabling Business Analytics. The structural differences considered in this report focuses on when XML-tags have been given different names that essentially have the same semantic meaning. The research strategy Design Science has been used when creating the solution. As a result of the research strategy the knowledge contribution is an IT-artifact. The IT-artifact is a Proof of concept that demonstrates a possible implementation of a solution that handles the semantic problems identified in this report. The result of the development is a flexible application that users can utilize to gather data from XML-files with different structures. This is made possible by letting the user create an ontology containing the tag names from the XML-files. By using ontologies like this we have given proof that it is possible to accomplish interoperability between XML-files with different structures. The conclusion that can be drawn from the development of the IT-artifact is that it is possible to create a general solution for the identified problem. XML Semantic interoperability IT-artifact Ontology Semi-structured data Design Science XML Semantisk interoperabilitet IT-artefakt Ontologi Semistrukturerad data Design Science
30	Supervised metric learning with generalization guarantees / Apprentissage supervisé de métriques avec garanties en généralisation Bellet, Aurélien 11 December 2012 (has links) Ces dernières années, l'importance cruciale des métriques en apprentissage automatique a mené à un intérêt grandissant pour l'optimisation de distances et de similarités en utilisant l'information contenue dans des données d'apprentissage pour les rendre adaptées au problème traité. Ce domaine de recherche est souvent appelé apprentissage de métriques. En général, les méthodes existantes optimisent les paramètres d'une métrique devant respecter des contraintes locales sur les données d'apprentissage. Les métriques ainsi apprises sont généralement utilisées dans des algorithmes de plus proches voisins ou de clustering.Concernant les données numériques, beaucoup de travaux ont porté sur l'apprentissage de distance de Mahalanobis, paramétrisée par une matrice positive semi-définie. Les méthodes récentes sont capables de traiter des jeux de données de grande taille.Moins de travaux ont été dédiés à l'apprentissage de métriques pour les données structurées (comme les chaînes ou les arbres), car cela implique souvent des procédures plus complexes. La plupart des travaux portent sur l'optimisation d'une notion de distance d'édition, qui mesure (en termes de nombre d'opérations) le coût de transformer un objet en un autre.Au regard de l'état de l'art, nous avons identifié deux limites importantes des approches actuelles. Premièrement, elles permettent d'améliorer la performance d'algorithmes locaux comme les k plus proches voisins, mais l'apprentissage de métriques pour des algorithmes globaux (comme les classifieurs linéaires) n'a pour l'instant pas été beaucoup étudié. Le deuxième point, sans doute le plus important, est que la question de la capacité de généralisation des méthodes d'apprentissage de métriques a été largement ignorée.Dans cette thèse, nous proposons des contributions théoriques et algorithmiques qui répondent à ces limites. Notre première contribution est la construction d'un nouveau noyau construit à partir de probabilités d'édition apprises. A l'inverse d'autres noyaux entre chaînes, sa validité est garantie et il ne comporte aucun paramètre. Notre deuxième contribution est une nouvelle approche d'apprentissage de similarités d'édition pour les chaînes et les arbres inspirée par la théorie des (epsilon,gamma,tau)-bonnes fonctions de similarité et formulée comme un problème d'optimisation convexe. En utilisant la notion de stabilité uniforme, nous établissons des garanties théoriques pour la similarité apprise qui donne une borne sur l'erreur en généralisation d'un classifieur linéaire construit à partir de cette similarité. Dans notre troisième contribution, nous étendons ces principes à l'apprentissage de métriques pour les données numériques en proposant une méthode d'apprentissage de similarité bilinéaire qui optimise efficacement l'(epsilon,gamma,tau)-goodness. La similarité est apprise sous contraintes globales, plus appropriées à la classification linéaire. Nous dérivons des garanties théoriques pour notre approche, qui donnent de meilleurs bornes en généralisation pour le classifieur que dans le cas des données structurées. Notre dernière contribution est un cadre théorique permettant d'établir des bornes en généralisation pour de nombreuses méthodes existantes d'apprentissage de métriques. Ce cadre est basé sur la notion de robustesse algorithmique et permet la dérivation de bornes pour des fonctions de perte et des régulariseurs variés / In recent years, the crucial importance of metrics in machine learningalgorithms has led to an increasing interest in optimizing distanceand similarity functions using knowledge from training data to make them suitable for the problem at hand.This area of research is known as metric learning. Existing methods typically aim at optimizing the parameters of a given metric with respect to some local constraints over the training sample. The learned metrics are generally used in nearest-neighbor and clustering algorithms.When data consist of feature vectors, a large body of work has focused on learning a Mahalanobis distance, which is parameterized by a positive semi-definite matrix. Recent methods offer good scalability to large datasets.Less work has been devoted to metric learning from structured objects (such as strings or trees), because it often involves complex procedures. Most of the work has focused on optimizing a notion of edit distance, which measures (in terms of number of operations) the cost of turning an object into another.We identify two important limitations of current supervised metric learning approaches. First, they allow to improve the performance of local algorithms such as k-nearest neighbors, but metric learning for global algorithms (such as linear classifiers) has not really been studied so far. Second, and perhaps more importantly, the question of the generalization ability of metric learning methods has been largely ignored.In this thesis, we propose theoretical and algorithmic contributions that address these limitations. Our first contribution is the derivation of a new kernel function built from learned edit probabilities. Unlike other string kernels, it is guaranteed to be valid and parameter-free. Our second contribution is a novel framework for learning string and tree edit similarities inspired by the recent theory of (epsilon,gamma,tau)-good similarity functions and formulated as a convex optimization problem. Using uniform stability arguments, we establish theoretical guarantees for the learned similarity that give a bound on the generalization error of a linear classifier built from that similarity. In our third contribution, we extend the same ideas to metric learning from feature vectors by proposing a bilinear similarity learning method that efficiently optimizes the (epsilon,gamma,tau)-goodness. The similarity is learned based on global constraints that are more appropriate to linear classification. Generalization guarantees are derived for our approach, highlighting that our method minimizes a tighter bound on the generalization error of the classifier. Our last contribution is a framework for establishing generalization bounds for a large class of existing metric learning algorithms. It is based on a simple adaptation of the notion of algorithmic robustness and allows the derivation of bounds for various loss functions and regularizers. Apprentissage de métriques Apprentissage statistique Optimisation convexe Classification Données structurées Distance d'édition Bornes en généralisation Metric learning Statistical learning Convex optimization Classification Structured data Edit distance Generalization bounds

Search results