• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 17
  • 2
  • 1
  • 1
  • Tagged with
  • 27
  • 27
  • 8
  • 8
  • 8
  • 8
  • 8
  • 7
  • 6
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Bayesian Test Analytics for Document Collections

Walker, Daniel David 15 November 2012 (has links) (PDF)
Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end,though most of it is focused on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition(OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps,which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors.
12

Bridging the Digital Disparitiesin Sweden : A Discursive Analysis of Swedish Policy Reports on Digital Inclusion

Gültekin, Nur January 2023 (has links)
This study investigates the construction of discourse on digital inclusion in Sweden by closely analyzing policy reports from various governmental entities responsible for the digitalization agenda spanning the years 2017 to 2023. The research forms a three-dimensional approach, which focuses on discursive motivations for bridging the digital divide, perceived access prerequisites for achieving this goal, and the primary target group for digital inclusion efforts within the policy discourse. Drawing upon van Dijk's Resources and Appropriation theory, the mezzo-scaleanalysis explores how properties of digital divides related to resource inequalities and adaptation were expressed within the discourse, forming the core framework of this thesis. Fairclough's critical discourse theory (CDA) guides the macro-scale analysis; however, the large-scale view, with a focus on power relations, is not the key framework in this study. Instead, they are drawn upon in the discussion section while evaluating the key findings.The methodology employed combines CDA through close reading with exploratory text mining techniques from the Digital Humanities, revealing three key discursive motivations: 1) social participation, 2) democracy and social equality, and 3) economic prosperity. Material/physical and skills access are identified as primary prerequisites, with a particular focus on people with disabilities. A critical evaluation of these findings provides significant implications for future research on the digital divide, particularly with regards to Swedish policymaking.
13

An Integrated Framework of Text and Visual Analytics to Facilitate Information Retrieval towards Biomedical Literature

Ji, Xiaonan 18 September 2018 (has links)
No description available.
14

Knowledge Driven Search Intent Mining

Jadhav, Ashutosh 31 May 2016 (has links)
No description available.
15

Event recognition in epizootic domains

Bujuru, Swathi January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / William H. Hsu / In addition to named entities such as persons, locations, organizations, and quantities which convey factual information, there are other entities and attributes that relate identifiable objects in the text and can provide valuable additional information. In the field of epizootics, these include specific properties of diseases such as their name, location, species affected, and current confirmation status. These are important for compiling the spatial and temporal statistics and other information needed to track diseases, leading to applications such as detection and prevention of bioterrorism. Toward this objective, we present a system (Rule Based Event Extraction System in Epizootic Domains) that can be used for extracting the infectious disease outbreaks from the unstructured data automatically by using the concept of pattern matching. In addition to extracting events, the components of this system can help provide structured and summarized data that can be used to differentiate confirmed events from suspected events, answer questions regarding when and where the disease was prevalent develop a model for predicting future disease outbreaks, and support visualization using interfaces such as Google Maps. While developing this system, we consider the research issues that include document relevance classification, entity extraction, recognizing the outbreak events in the disease domain and to support the visualization for events. We present a sentence-based event extraction approach for extracting the outbreak events from epizootic domain that has tasks such as extracting the events such as the disease name, location, species, confirmation status, and date; classifying the events into two categories of confirmation status- confirmed or suspected. The present approach shows how confirmation status is important in extracting the disease based events from unstructured data and a pyramid approach using reference summaries is used for evaluating the extracted events.
16

A framework for exploiting electronic documentation in support of innovation processes

Uys, J. W. 03 1900 (has links)
Thesis (PhD (Industrial Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The crucial role of innovation in creating sustainable competitive advantage is widely recognised in industry today. Likewise, the importance of having the required information accessible to the right employees at the right time is well-appreciated. More specifically, the dependency of effective, efficient innovation processes on the availability of information has been pointed out in literature. A great challenge is countering the effects of the information overload phenomenon in organisations in order for employees to find the information appropriate to their needs without having to wade through excessively large quantities of information to do so. The initial stages of the innovation process, which are characterised by free association, semi-formal activities, conceptualisation, and experimentation, have already been identified as a key focus area for improving the effectiveness of the entire innovation process. The dependency on information during these early stages of the innovation process is especially high. Any organisation requires a strategy for innovation, a number of well-defined, implemented processes and measures to be able to innovate in an effective and efficient manner and to drive its innovation endeavours. In addition, the organisation requires certain enablers to support its innovation efforts which include certain core competencies, technologies and knowledge. Most importantly for this research, enablers are required to more effectively manage and utilise innovation-related information. Information residing inside and outside the boundaries of the organisation is required to feed the innovation process. The specific sources of such information are numerous. Such information may further be structured or unstructured in nature. However, an ever-increasing ratio of available innovation-related information is of the unstructured type. Examples include the textual content of reports, books, e-mail messages and web pages. This research explores the innovation landscape and typical sources of innovation-related information. In addition, it explores the landscape of text analytical approaches and techniques in search of ways to more effectively and efficiently deal with unstructured, textual information. A framework that can be used to provide a unified, dynamic view of an organisation‟s innovation-related information, both structured and unstructured, is presented. Once implemented, this framework will constitute an innovation-focused knowledge base that will organise and make accessible such innovation-related information to the stakeholders of the innovation process. Two novel, complementary text analytical techniques, Latent Dirichlet Allocation and the Concept-Topic Model, were identified for application with the framework. The potential value of these techniques as part of the information systems that would embody the framework is illustrated. The resulting knowledge base would cause a quantum leap in the accessibility of information and may significantly improve the way innovation is done and managed in the target organisation. / AFRIKAANSE OPSOMMING: Die belangrikheid van innovasie vir die daarstel van „n volhoubare mededingende voordeel word tans wyd erken in baie sektore van die bedryf. Ook die belangrikheid van die toeganklikmaking van relevante inligting aan werknemers op die geskikte tyd, word vandag terdeë besef. Die afhanklikheid van effektiewe, doeltreffende innovasieprosesse op die beskikbaarheid van inligting word deurlopend beklemtoon in die navorsingsliteratuur. „n Groot uitdaging tans is om die oorsake en impak van die inligtingsoorvloedverskynsel in ondernemings te bestry ten einde werknemers in staat te stel om inligting te vind wat voldoen aan hul behoeftes sonder om in die proses deur oormatige groot hoeveelhede inligting te sif. Die aanvanklike stappe van die innovasieproses, gekenmerk deur vrye assosiasie, semi-formele aktiwiteite, konseptualisering en eksperimentasie, is reeds geïdentifiseer as sleutelareas vir die verbetering van die effektiwiteit van die innovasieproses in sy geheel. Die afhanklikheid van hierdie deel van die innovasieproses op inligting is besonder hoog. Om op „n doeltreffende en optimale wyse te innoveer, benodig elke onderneming „n strategie vir innovasie sowel as „n aantal goed gedefinieerde, ontplooide prosesse en metingskriteria om die innovasieaktiwiteite van die onderneming te dryf. Bykomend benodig ondernemings sekere innovasie-ondersteuningsmeganismes wat bepaalde sleutelaanlegde, -tegnologiëe en kennis insluit. Kern tot hierdie navorsing, benodig organisasies ook ondersteuningsmeganismes om hul in staat te stel om meer doeltreffend innovasie-verwante inligting te bestuur en te gebruik. Inligting, gehuisves beide binne en buite die grense van die onderneming, word benodig om die innovasieproses te voer. Die bronne van sulke inligting is veeltallig en hierdie inligting mag gestruktureerd of ongestruktureerd van aard wees. „n Toenemende persentasie van innovasieverwante inligting is egter van die ongestruktureerde tipe, byvoorbeeld die inligting vervat in die tekstuele inhoud van verslae, boeke, e-posboodskappe en webbladsye. In hierdie navorsing word die innovasielandskap asook tipiese bronne van innovasie-verwante inligting verken. Verder word die landskap van teksanalitiese benaderings en -tegnieke ondersoek ten einde maniere te vind om meer doeltreffend en optimaal met ongestruktureerde, tekstuele inligting om te gaan. „n Raamwerk wat aangewend kan word om „n verenigde, dinamiese voorstelling van „n onderneming se innovasieverwante inligting, beide gestruktureerd en ongestruktureerd, te skep word voorgestel. Na afloop van implementasie sal hierdie raamwerk die innovasieverwante inligting van die onderneming organiseer en meer toeganklik maak vir die deelnemers van die innovasieproses. Daar word verslag gelewer oor die aanwending van twee nuwerwetse, komplementêre teksanalitiese tegnieke tot aanvulling van die raamwerk. Voorts word die potensiele waarde van hierdie tegnieke as deel van die inligtingstelsels wat die raamwerk realiseer, verder uitgewys en geillustreer.
17

Word Clustering in an Interactive Text Analysis Tool / Klustring av ord i ett interaktivt textanalysverktyg

Gränsbo, Gustav January 2019 (has links)
A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour.
18

Extrakce sémantických vztahů z nestrukturovaných dat v komerční sféře / Semantic relation extraction from unstructured data in the business domain

Rampula, Ilana January 2016 (has links)
Text analytics in the business domain is a growing field in research and practical applications. We chose to concentrate on Relation Extraction from unstructured data which was provided by a corporate partner. Analyzing text from this domain requires a different approach, counting with irregularities and domain specific attributes. In this thesis, we present two methods for relation extraction. The Snowball system and the Distant Supervision method were both adapted for the unique data. The methods were implemented to use both structured and unstructured data from the database of the company. Keywords: Information Retrieval, Relation Extraction, Text Analytics, Distant Supervision, Snowball
19

Um estudo sobre o papel de medidas de similaridade em visualização de coleções de documentos / A study on the role of similarity measures in visual text analytics

Salazar, Frizzi Alejandra San Roman 27 September 2012 (has links)
Técnicas de visualização de informação, tais como as que utilizam posicionamento de pontos baseado na similaridade do conteúdo, são utilizadas para criar representações visuais de dados que evidenciem certos padrões. Essas técnicas são sensíveis à qualidade dos dados, a qual, por sua vez, depende de uma etapa de pré-processamento muito influente. Esta etapa envolve a limpeza do texto e, em alguns casos, a detecção de termos e seus pesos, bem como a definição de uma função de (dis)similaridade. Poucos são os estudos realizados sobre como esses cálculos de (dis)similaridade afetam a qualidade das representações visuais geradas para dados textuais. Este trabalho apresenta um estudo sobre o papel das diferentes medidas de (dis)similaridade entre pares de textos na geração de mapas visuais. Nos concentramos principalmente em dois tipos de funções de distância, aquelas computadas a partir da representação vetorial do texto (Vector Space Model (VSM)) e em medidas de comparação direta de strings textuais. Comparamos o efeito na geração de mapas visuais com técnicas de posicionamento de pontos, utilizando as duas abordagens. Para isso, foram utilizadas medidas objetivas para comparar a qualidade visual dos mapas, tais como Neighborhood Hit (NH) e Coeficiente de Silhueta (CS). Descobrimos que ambas as abordagens têm pontos a favor, mas de forma geral, o VSM apresentou melhores resultados quanto à discriminação de classes. Porém, a VSM convencional não é incremental, ou seja, novas adições à coleção forçam o recálculo do espaço de dados e das dissimilaridades anteriormente computadas. Nesse sentido, um novo modelo incremental baseado no VSM (Incremental Vector Space Model (iVSM)) foi considerado em nossos estudos comparativos. O iVSM apresentou os melhores resultados quantitativos e qualitativos em diversas configurações testadas. Os resultados da avaliação são apresentados e recomendações sobre a aplicação de diferentes medidas de similaridade de texto em tarefas de análise visual, são oferecidas / Information visualization techniques, such as similarity based point placement, are used for generating of visual data representation that evidence some patterns. These techniques are sensitive to data quality, which depends of a very influential preprocessing step. This step involves cleaning the text and in some cases, detecting terms and their weights, as well as definiting a (dis)similarity function. There are few studies on how these (dis)similarity calculations aect the quality of visual representations for textual data. This work presents a study on the role of the various (dis)similarity measures in generating visual maps. We focus primarily on two types of distance functions, those based on vector representations of the text (Vector Space Model (VSM)) and measures obtained from direct comparison of text strings, comparing the effect on the visual maps obtained with point placement techniques with the two approaches. For this, objective measures were employed to compare the visual quality of the generated maps, such as the Neighborhood Hit and Silhouette Coefficient. We found that both approaches have strengths, but in general, the VSM showed better results as far as class discrimination is concerned. However, the conventional VSM is not incremental, i.e., new additions to the collection force the recalculation of the data space and dissimilarities previously computed. Thus, a new model based on incremental VSM (Incremental Vector Space Model (iVSM)) has been also considered in our comparative studies. iVSM showed the best quantitative and qualitative results in several of the configurations considered. The evaluation results are presented and recommendations on the application of different similarity measures for text analysis tasks visually are provided
20

Um estudo sobre o papel de medidas de similaridade em visualização de coleções de documentos / A study on the role of similarity measures in visual text analytics

Frizzi Alejandra San Roman Salazar 27 September 2012 (has links)
Técnicas de visualização de informação, tais como as que utilizam posicionamento de pontos baseado na similaridade do conteúdo, são utilizadas para criar representações visuais de dados que evidenciem certos padrões. Essas técnicas são sensíveis à qualidade dos dados, a qual, por sua vez, depende de uma etapa de pré-processamento muito influente. Esta etapa envolve a limpeza do texto e, em alguns casos, a detecção de termos e seus pesos, bem como a definição de uma função de (dis)similaridade. Poucos são os estudos realizados sobre como esses cálculos de (dis)similaridade afetam a qualidade das representações visuais geradas para dados textuais. Este trabalho apresenta um estudo sobre o papel das diferentes medidas de (dis)similaridade entre pares de textos na geração de mapas visuais. Nos concentramos principalmente em dois tipos de funções de distância, aquelas computadas a partir da representação vetorial do texto (Vector Space Model (VSM)) e em medidas de comparação direta de strings textuais. Comparamos o efeito na geração de mapas visuais com técnicas de posicionamento de pontos, utilizando as duas abordagens. Para isso, foram utilizadas medidas objetivas para comparar a qualidade visual dos mapas, tais como Neighborhood Hit (NH) e Coeficiente de Silhueta (CS). Descobrimos que ambas as abordagens têm pontos a favor, mas de forma geral, o VSM apresentou melhores resultados quanto à discriminação de classes. Porém, a VSM convencional não é incremental, ou seja, novas adições à coleção forçam o recálculo do espaço de dados e das dissimilaridades anteriormente computadas. Nesse sentido, um novo modelo incremental baseado no VSM (Incremental Vector Space Model (iVSM)) foi considerado em nossos estudos comparativos. O iVSM apresentou os melhores resultados quantitativos e qualitativos em diversas configurações testadas. Os resultados da avaliação são apresentados e recomendações sobre a aplicação de diferentes medidas de similaridade de texto em tarefas de análise visual, são oferecidas / Information visualization techniques, such as similarity based point placement, are used for generating of visual data representation that evidence some patterns. These techniques are sensitive to data quality, which depends of a very influential preprocessing step. This step involves cleaning the text and in some cases, detecting terms and their weights, as well as definiting a (dis)similarity function. There are few studies on how these (dis)similarity calculations aect the quality of visual representations for textual data. This work presents a study on the role of the various (dis)similarity measures in generating visual maps. We focus primarily on two types of distance functions, those based on vector representations of the text (Vector Space Model (VSM)) and measures obtained from direct comparison of text strings, comparing the effect on the visual maps obtained with point placement techniques with the two approaches. For this, objective measures were employed to compare the visual quality of the generated maps, such as the Neighborhood Hit and Silhouette Coefficient. We found that both approaches have strengths, but in general, the VSM showed better results as far as class discrimination is concerned. However, the conventional VSM is not incremental, i.e., new additions to the collection force the recalculation of the data space and dissimilarities previously computed. Thus, a new model based on incremental VSM (Incremental Vector Space Model (iVSM)) has been also considered in our comparative studies. iVSM showed the best quantitative and qualitative results in several of the configurations considered. The evaluation results are presented and recommendations on the application of different similarity measures for text analysis tasks visually are provided

Page generated in 0.0732 seconds