Global ETD Search

1	DescribeX: A Framework for Exploring and Querying XML Web Collections Rizzolo, Flavio Carlos 26 February 2009 (has links) The nature of semistructured data in web collections is evolving. Even when XML web documents are valid with regard to a schema, the actual structure of such documents exhibits significant variations across collections for several reasons: an XML schema may be very lax (e.g., to accommodate the flexibility needed to represent collections of documents in RSS feeds), a schema may be large and different subsets used for different documents (e.g., this is common in industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). A schema alone may not provide sufficient information for many data management tasks that require knowledge of the actual structure of the collection. Web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges. This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogenous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them. Experimental results demonstrate the scalability of DescribeX summary refinements and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web collections. A comparative study suggests that using a DescribeX summary created from a given workload can produce query evaluation times orders of magnitude better than using existing summaries. DescribeX’s light-weight approach of combining summaries with a file-at-a-time XPath processor can be a very competitive alternative, in terms of performance, to conventional fully-fledged XML query engines that provide DB-like functionality such as security, transaction processing, and native storage. XML Summaries Framework Semistructured Web XPath 0984
2	DescribeX: A Framework for Exploring and Querying XML Web Collections Rizzolo, Flavio Carlos 26 February 2009 (has links) The nature of semistructured data in web collections is evolving. Even when XML web documents are valid with regard to a schema, the actual structure of such documents exhibits significant variations across collections for several reasons: an XML schema may be very lax (e.g., to accommodate the flexibility needed to represent collections of documents in RSS feeds), a schema may be large and different subsets used for different documents (e.g., this is common in industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). A schema alone may not provide sufficient information for many data management tasks that require knowledge of the actual structure of the collection. Web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges. This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogenous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them. Experimental results demonstrate the scalability of DescribeX summary refinements and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web collections. A comparative study suggests that using a DescribeX summary created from a given workload can produce query evaluation times orders of magnitude better than using existing summaries. DescribeX’s light-weight approach of combining summaries with a file-at-a-time XPath processor can be a very competitive alternative, in terms of performance, to conventional fully-fledged XML query engines that provide DB-like functionality such as security, transaction processing, and native storage. XML Summaries Framework Semistructured Web XPath 0984
3	A formal framework for linguistic tree query Lai, Catherine Unknown Date (has links) (PDF) The analysis of human communication, in all its forms, increasingly depends on large collections of texts and transcribed recordings. These collections, or corpora, are often richly annotated with structural information. These datasets are extremely large so manual analysis is only successful up to a point. As such, significant effort has recently been invested in automatic techniques for extracting and analyzing these massive data sets. However, further progress on analytical tools is confronted by three major challenges. First, we need the right data model. Second, we need to understand the theoretical foundations of query languages on that data model. Finally, we need to know the expressive requirements for general purpose query language with respect to linguistics. This thesis has addressed all three of these issues. / Specifically, this thesis studies formalisms used by linguists and database theorists to describe tree structured data. Specifically, Propositional dynamic logic and monadic second-order logic. These formalisms have been used to reason about a number of tree querying languages and their applicability to the linguistic tree query problem. We identify a comprehensive set of linguistic tree query requirements and the level of expressiveness needed to implement them. The main result of this study is that the required level of expressiveness of linguistic tree query is that of the first-order predicate calculus over trees. / This formal approach has resulted in a convergence between two seemingly disparate fields of study. Further work in the intersection of linguistics and database theory should also pave the way for theoretically well-founded future work in this area. This, in turn, will lead to better tools for linguistic analysis and data management, and more comprehensive theories of human language.
4	Probabilistic Databases and Their Applications Zhao, Wenzhong 01 January 2004 (has links) Probabilistic reasoning in databases has been an active area of research during the last twodecades. However, the previously proposed database approaches, including the probabilistic relationalapproach and the probabilistic object approach, are not good fits for storing and managingdiverse probability distributions along with their auxiliary information.The work in this dissertation extends significantly the initial semistructured probabilistic databaseframework proposed by Dekhtyar, Goldsmith and Hawkes in [20]. We extend the formal SemistructuredProbabilistic Object (SPO) data model of [20]. Accordingly, we also extend the SemistructuredProbabilistic Algebra (SP-algebra), the query algebra proposed for the SPO model.Based on the extended framework, we have designed and implemented a Semistructured ProbabilisticDatabase Management System (SPDBMS) on top of a relational DBMS. The SPDBMS isflexible enough to meet the need of storing and manipulating diverse probability distributions alongwith their associated information. Its query language supports standard database queries as wellas queries specific to probabilities, such as conditionalization and marginalization. Currently theSPDBMS serves as a storage backbone for the project Decision Making and Planning under Uncertaintywith Constraints 1‡ , that involves managing large quantities of probabilistic information. Wealso report our experimental results evaluating the performance of the SPDBMS.We describe an extension of the SPO model for handling interval probability distributions. TheExtended Semistructured Probabilistic Object (ESPO) framework improves the flexibility of theoriginal semistructured data model in two important features: (i) support for interval probabilitiesand (ii) association of context and conditionals with individual random variables. An extended SPO1 This project is partially supported by the National Science Foundation under Grant No. ITR-0325063.(ESPO) data model has been developed, and an extended query algebra for ESPO has also beenintroduced to manipulate probability distributions for probability intervals.The Bayesian Network Development Suite (BaNDeS), a system which builds Bayesian networkswith full data management support of the SPDBMS, has been described. It allows expertswith particular expertise to work only on specific subsystems during the Bayesian network constructionprocess independently and asynchronously while updating the model in real-time.There are three major foci of our ongoing and future work: (1) implementation of a queryoptimizer and performance evaluation of query optimization, (2) extension of the SPDBMS to handleinterval probability distributions, and (3) incorporation of machine learning techniques into theBaNDeS.
5	SEMISTRUCTURED PROBABILISTIC OBJECT QUERY LANGUAGE (A Query Language for Semistructured Probabilistic Data) Gutti, Praveen 01 January 2007 (has links) This work presents SPOQL, a structured query language for Semistructured Probabilistic Object (SPO) model [4]. The original query language for semistructured probabilistic database management system [20], SP-Algebra [4], has limitations such as complex functional notation and unfamiliarity to application programmers. SPOQL alleviates these problems by providing a user friendly and familiar SQL-like declarative syntax for writing queries against SPDBMS. We show that parsing SPOQL queries is a more involving task than parsing SQL queries. We describe the evaluation algorithm for SPOQL queries that we have implemented.
6	Querying Large Collections of Semistructured Data Kamali, Shahab 05 September 2013 (has links) An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further. Mathematics retrieval Search Semistructured data Query Language XML Grammar inference Computer Science
7	Querying Large Collections of Semistructured Data Kamali, Shahab 05 September 2013 (has links) An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further. Mathematics retrieval Search Semistructured data Query Language XML Grammar inference Computer Science
8	The role of Ulwaluko in the construction of masculinity in men at the University of the Western Cape Magodyo, Tapiwa C. January 2013 (has links) Magister Artium (Psychology) - MA(Psych) / Ulwaluko is a Xhosa word that refers to male circumcision, an initiation ritual performed to transform boys into men. The ritual is supposed to instill good moral and social values. Research has demonstrated that, the practice of Ulwaluko has undergone many changes primarily because of urbanization, acculturation and the emergence of back-door circumcision schools amongst other things. This has culminated in instances of moral decline such as criminal activity, drug abuse, risky sexual behaviour and inhumane behaviour among some of the initiates. There has been a recent upsurge in research on Ulwaluko in South Africa. However, lacking in this body of scholarship is a focus on how Ulwaluko constructs masculinities. This served as the motivation for my study. Given the above, my study explored the role of Ulwaluko in the construction of masculinity in men at the University of the Western Cape (UWC). Hegemonic masculinity (Connell, 1994; Connell & Messerschmidt, 2005) was used as a theoretical framework conceptualizing this study. The study utilised a qualitative framework and data was collected using in-depth semi-structured interviews. Seven participants aged from 19 to 32, consented to be part of the study. These were recruited using purposive sampling. The ethical considerations of the study adhered to the guidelines stipulated by UWC. Data was transcribed, and analysed using thematic decomposition analysis. The findings of this study indicate that Ulwaluko constructs masculinity in hegemonic ways. Through hegemony it establishes, maintains and retains control over young men, boys and women. It constructs an idealised masculine identity that is morally upright, faced with ritual challenges and burdened by a prescriptive set of masculine role expectations. This study also shows the self-reflexive, critical and imaginative engagement by men as they negotiated Ulwaluko‟s ideal masculinity. Such contestations resulted in the creation of rival masculinities. It also demonstrates how subject position(s) impact understandings and constructions of masculinities. This study provided a richer and more nuanced contextual understanding of the psychosocial realities of men who underwent Ulwaluko Ulwaluko masculinities social constructionism hegemonic masculinity Xhosa men understandings of masculinity qualitative exploratory study purposive sampling semistructured interviews thematic decomposition analysis
9	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents Farfan, Fernando R 12 November 2009 (has links) The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering
10	Informovanost osob bez domova o dostupných sociálních službách / Awareness of homeless people about available social services Petrová, Angelika Nelly January 2021 (has links) The presented diploma thesis focuses on the awareness of homeless people about available social services that could help improve their current life situation. The main goal of this work is to find out what social services for the homeless these people know, which of them they have used, what is their experience with them and whether these services can lead to a return to the majority society in their opinion or whether they only help the homeless to survive. This is a case study that took place in the capital city of Prague. Data were collected in the form of semi- structured in-depth interviews directly with homeless people. A total of 29 interviews took place. The diploma thesis first describes the definition of homelessness in the theoretical part, possible causes of homelessness and the social system of the Czech Republic. In the methodological part of the work, I describe the target population, the technique of selecting informants, the method of data collection, the advantages and disadvantages of semi-structured interviews, the method of data processing and analysis, and the ethics of research. This is followed by a presentation of the results, their analysis, answering research questions, discussion of the methods used, reflection on the shortcomings of my own work, and recommendations...

Search results