Global ETD Search

361	Reengineering PhysNet in the uPortal framework Zhou, Ye 11 July 2003 (has links) A Digital Library (DL) is an electronic information storage system focused on meeting the information seeking needs of its constituents. As modern DLs often stay in synchronization with the latest progress of technologies in all fields, interoperability among DLs is often hard to achieve. With the advent of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and Open Digital Libraries (ODL), lightweight protocols show a promising future in promoting DL interoperability. Furthermore, DL is envisaged as a network of independent components working collaboratively through simple standardized protocols. Prior work with ODL shows the feasibility of building componentized DLs with techniques that are a precursor to web services designs. In our study, we elaborate the feasibility to apply web services to DL design. DL services are modeled as a set of web services offering information dissemination through the Simple Object Access Protocol (SOAP). Additionally, a flexible DL user interface assembly framework is offered in order to build DLs with customizations and personalizations. Our hypothesis is proven and demonstrated in the PhysNet reengineering project. / Master of Science Digital Libraries OAI PhysNet uPortal Text Categorization
362	Statistical Learning for Sequential Unstructured Data Xu, Jingbin 30 July 2024 (has links) Unstructured data, which cannot be organized into predefined structures, such as texts, human behavior status, and system logs, often presented in a sequential format with inherent dependencies. Probabilistic model are commonly used to capture these dependencies in the data generation process through latent parameters and can naturally extend into hierarchical forms. However, these models rely on the correct specification of assumptions about the sequential data generation process, which often limits their scalable learning abilities. The emergence of neural network tools has enabled scalable learning for high-dimensional sequential data. From an algorithmic perspective, efforts are directed towards reducing dimensionality and representing unstructured data units as dense vectors in low-dimensional spaces, learned from unlabeled data, a practice often referred to as numerical embedding. While these representations offer measures of similarity, automated generalizations, and semantic understanding, they frequently lack the statistical foundations required for explicit inference. This dissertation aims to develop statistical inference techniques tailored for the analysis of unstructured sequential data, with their application in the field of transportation safety. The first part of dissertation presents a two-stage method. It adopts numerical embedding to map large-scale unannotated data into numerical vectors. Subsequently, a kernel test using maximum mean discrepancy is employed to detect abnormal segments within a given time period. Theoretical results showed that learning from numerical vectors is equivalent to learning directly through the raw data. A real-world example illustrates how driver mismatched visual behavior occurred during a lane change. The second part of the dissertation introduces a two-sample test for comparing text generation similarity. The hypothesis tested is whether the probabilistic mapping measures that generate textual data are identical for two groups of documents. The proposed test compares the likelihood of text documents, estimated through neural network-based language models under the autoregressive setup. The test statistic is derived from an estimation and inference framework that first approximates data likelihood with an estimation set before performing inference on the remaining part. The theoretical result indicates that the test statistic's asymptotic behavior approximates a normal distribution under mild conditions. Additionally, a multiple data-splitting strategy is utilized, combining p-values into a unified decision to enhance the test's power. The third part of the dissertation develops a method to measure differences in text generation between a benchmark dataset and a comparison dataset, focusing on word-level generation variations. This method uses the sliced-Wasserstein distance to compute the contextual discrepancy score. A resampling method establishes a threshold to screen the scores. Crash report narratives are analyzed to compare crashes involving vehicles equipped with level 2 advanced driver assistance systems and those involving human drivers. / Doctor of Philosophy / Unstructured data, such as texts, human behavior records, and system logs, cannot be neatly organized. This type of data often appears in sequences with natural connections. Traditional methods use models to understand these connections, but these models depend on specific assumptions, which can limit their effectiveness. New tools using neural networks have made it easier to work with large and complex data. These tools help simplify data by turning it into smaller, manageable pieces, a process known as numerical embedding. While this helps in understanding the data better, it often requires a statistical foundation for the proceeding inferential analysis. This dissertation aims to develop statistical inference techniques for analyzing unstructured sequential data, focusing on transportation safety. The first part of the dissertation introduces a two-step method. First, it transforms large-scale unorganized data into numerical vectors. Then, it uses a statistical test to detect unusual patterns over a period. For example, it can identify when a driver's visual behavior doesn't properly aligned with the driving attention demand during lane changes. The second part of the dissertation presents a method to compare the similarity of text generation. It tests whether the way texts are generated is the same for two groups of documents. This method uses neural network-based models to estimate the likelihood of text documents. Theoretical results show that as the more data observed, the distribution of the test statistic will get closer to the desired distribution under certain conditions. Additionally, combining multiple data splits improves the test's power. The third part of the dissertation constructs a score to measure differences in text generation processes, focusing on word-level differences. This score is based on a specific distance measure. To check if the difference is not a false discovery, a screening threshold is established using resampling technique. If the score exceeds the threshold, the difference is considered significant. An application of this method compares crash reports from vehicles with advanced driver assistance systems to those from human-driven vehicles. Statistical Inference Text Mining Neural Networks
363	Getransformeer : van jeugverhaal tot dramateks / J.J. de Beer De Beer, Judith Jacoba January 2003 (has links) This research comprises a comparative examination of the transformation of Afrikaans and Dutch youth narratives into drama texts. Attention has been paid to the story elements embodied in various narratives and dramas, and, in addition, to aspects related to narrative and drama. By means of the comparison of the constants and variants with regard to the four texts, the possibility of creating a transformation model has been examined. The transformation model derived from the research, is applicable, firstly, to the narratives and drama texts upon which this study has been based. It is therefore presented as a conception for the conversion of a narrative text into a drama text, but the uniqueness of each separate narrative is taken into consideration; hence the model is not prescriptive, and it is assumed that the model may be adjusted in line with each adaptation. The comparison is effected between Afrikaans and Dutch texts, in view of the existence in the Low Countries of an established culture of bookshops, publishers and theatrical companies, focused on youth literature and theatre. Some publishers and bookshops, moreover, exclusively publish and sell youth narratives and dramas. Theatre productions aimed at children and young adults are plentiful, and attract a large percentage of young people. Should the fact that some theatres specialise in youth theatre productions be taken into account, also, the contrast and the gaps pertaining to the Afrikaans literary system are marked. The research in respect of the transformation of prose texts into drama texts has identified those procedures employed to adapt the narrative aspects (narrator, focalization, character, event, time and space) in such a way that it is reconcilable with the unique nature of the dramatic aspects (didascalia, dialogue, character, action, time and space). By virtue of the transformation of youth narratives into drama texts (with the purpose of the eventual performance thereof), the adolescent reader is made aware in a different manner of the value of narrative. / Thesis (M.A. (Afrikaans and Dutch))--Potchefstroom University for Christian Higher Education, 2003 Reception aesthetics Transformation Youth literature Narrative text Drama text Youth literature Intertextuality Transformation procedures Developmental psychology
364	Getransformeer : van jeugverhaal tot dramateks / J.J. de Beer De Beer, Judith Jacoba January 2003 (has links) This research comprises a comparative examination of the transformation of Afrikaans and Dutch youth narratives into drama texts. Attention has been paid to the story elements embodied in various narratives and dramas, and, in addition, to aspects related to narrative and drama. By means of the comparison of the constants and variants with regard to the four texts, the possibility of creating a transformation model has been examined. The transformation model derived from the research, is applicable, firstly, to the narratives and drama texts upon which this study has been based. It is therefore presented as a conception for the conversion of a narrative text into a drama text, but the uniqueness of each separate narrative is taken into consideration; hence the model is not prescriptive, and it is assumed that the model may be adjusted in line with each adaptation. The comparison is effected between Afrikaans and Dutch texts, in view of the existence in the Low Countries of an established culture of bookshops, publishers and theatrical companies, focused on youth literature and theatre. Some publishers and bookshops, moreover, exclusively publish and sell youth narratives and dramas. Theatre productions aimed at children and young adults are plentiful, and attract a large percentage of young people. Should the fact that some theatres specialise in youth theatre productions be taken into account, also, the contrast and the gaps pertaining to the Afrikaans literary system are marked. The research in respect of the transformation of prose texts into drama texts has identified those procedures employed to adapt the narrative aspects (narrator, focalization, character, event, time and space) in such a way that it is reconcilable with the unique nature of the dramatic aspects (didascalia, dialogue, character, action, time and space). By virtue of the transformation of youth narratives into drama texts (with the purpose of the eventual performance thereof), the adolescent reader is made aware in a different manner of the value of narrative. / Thesis (M.A. (Afrikaans and Dutch))--Potchefstroom University for Christian Higher Education, 2003 Reception aesthetics Transformation Youth literature Narrative text Drama text Youth literature Intertextuality Transformation procedures Developmental psychology
365	The ’tail’ of Alice’s tale : A case study of Swedish translations of puns in Alice’s Adventures in Wonderland My, Linderholt January 2016 (has links) This study investigates the use of different strategies for translating puns in Alice’s Adventures in Wonderland. The material chosen for this study consist of the two Swedish translations by Nonnen (1870/1984) and Westman (2009). Six puns were selected for the analysis which greatly relies on Delabastita’s (1996) eight strategies for translating puns, and Newmark’s (1988) translation methods. The analysis shows that Westman empathises with the readers of the TT while Nonnen empathises with the ST. This entails that Westman tends to use a more ‘free’ translation and is more inclined to adapt the ST puns to make them more visible for the readership of the TT. The priority for Nonnen, on the other hand, is to remain faithful to the contextual meaning of the ST. Paradoxically, to be faithful to the ST does not necessarily entail that the translator respects the semantic aspects of the ST, but that they adapt the culture of the ST to better fit the cultural and linguistic framework of the TL. Since Westman adapts the ST puns so that they are still recognised by the reader of the TT, her translation appears to be more suitable for the TL readership than Nonnen’s. Alice’s Adventures in Wonderland puns source language source text target language target text translation strategies wordplay
366	The Ancient Egyptian Demonology Project Weber, Felicitas 20 April 2016 (has links) (PDF) “The Ancient Egyptian Demonology Project: Second Millennium BCE” was intended and funded as a three-year project (2013-2016) to explore the world of Ancient Egyptian demons in the 2nd millennium BC. It intends to create a classification and ontology of benevolent and malevolent demons. Whereas ancient Egyptians did not use a specific term denoting “demons”, liminal beings known from various other cultures such as δαίμονες, ghosts, angels, Mischwesen, genies, etc., were nevertheless described in texts and illustrations. The project aims to collect philological, iconographical and archaeological evidence to understand the religious beliefs, practices, interactions and knowledge not only of the ancient Egyptians’ daily life but also their perception of the afterlife. Till today scholars, as well as interested laymen, have had no resource to consult for specific examples of those beings, except for rather general encyclopaedias that include all kinds of divine beings or the Iconography of Deities and Demons (IDD) project that is ongoing. Neither provides, however, a searchable platform for both texts and images. The database created by the Demonology Project: 2K is designed to remedy this gap. The idea is to provide scholars and the public with a database that allows statistical analyses and innovative data visualisation, accessible and augmentable from all over the world to stimulate the dialogue and open communication not only within Egyptology but also with neighbouring disciplines. For the time-span of the three year project a pilot database was planned as a foundation for further data-collection and analysis. The data that were chosen date to the 2nd Millennium BCE and originate from objects of daily life (headrests and ivory wands), as well as from objects related to the afterlife, (coffins and ‘Book of the Dead’ manuscripts). This material, connected by its religious purposes, nevertheless provides a cross-section through ancient Egyptian religious practice. The project is funded by the Leverhulme Trust and includes Kasia Szpakowska (director) who supervises the work of the two participating PhD students in Egyptology. The project does not include funds for computer scientists or specialists in digital humanities. Therefore, the database is designed, developed and input by the members of the team only. The focus of my presentation will be the structure of the database that faces the challenge to include both textual and iconographical evidence. I will explain the organisation of the data, search patterns and the opportunities of their visualisation and possible research outcome. Furthermore, I will discuss the potentials the database already possesses and might generate in the future for scholars and the public likewise. Since the evidence belongs to numerous collections from all over the world, I would like to address the problems of intellectual property and copyright with the solution we pursue for releasing the database for registered usage onto the internet. altägyptische Objekte Dämonen Datenbank Text Darstellungen Ancient Egyptian objects demons database text depiction ddc:930
367	Multi Domain Semantic Information Retrieval Based on Topic Model Lee, Sanghoon 07 May 2016 (has links) Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems. Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics. In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications. Information retrieval Semantics Topic model Query expansion Text classification Text summarization
368	New data-driven approaches to text simplification Štajner, Sanja January 2016 (has links) No description available.
369	Word based off-line handwritten Arabic classification and recognition : design of automatic recognition system for large vocabulary offline handwritten Arabic words using machine learning approaches AlKhateeb, Jawad Hasan Yasin January 2010 (has links) The design of a machine which reads unconstrained words still remains an unsolved problem. For example, automatic interpretation of handwritten documents by a computer is still under research. Most systems attempt to segment words into letters and read words one character at a time. However, segmenting handwritten words is very difficult. So to avoid this words are treated as a whole. This research investigates a number of features computed from whole words for the recognition of handwritten words in particular. Arabic text classification and recognition is a complicated process compared to Latin and Chinese text recognition systems. This is due to the nature cursiveness of Arabic text. The work presented in this thesis is proposed for word based recognition of handwritten Arabic scripts. This work is divided into three main stages to provide a recognition system. The first stage is the pre-processing, which applies efficient pre-processing methods which are essential for automatic recognition of handwritten documents. In this stage, techniques for detecting baseline and segmenting words in handwritten Arabic text are presented. Then connected components are extracted, and distances between different components are analyzed. The statistical distribution of these distances is then obtained to determine an optimal threshold for word segmentation. The second stage is feature extraction. This stage makes use of the normalized images to extract features that are essential in recognizing the images. Various method of feature extraction are implemented and examined. The third and final stage is the classification. Various classifiers are used for classification such as K nearest neighbour classifier (k-NN), neural network classifier (NN), Hidden Markov models (HMMs), and the Dynamic Bayesian Network (DBN). To test this concept, the particular pattern recognition problem studied is the classification of 32492 words using ii the IFN/ENIT database. The results were promising and very encouraging in terms of improved baseline detection and word segmentation for further recognition. Moreover, several feature subsets were examined and a best recognition performance of 81.5% is achieved. 006.3
370	Mathematical modelling of some aspects of stressing a Lithuanian text / Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimas Anbinderis, Tomas 02 July 2010 (has links) The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words). The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%. The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%. Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18... [to full text] / Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%. Informatics Clitics Homographs Text stressing Text-to-speech synthesis Klitikai Homografai Teksto kirčiavimas Balso sintezė

Search results