• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 247
  • 124
  • 44
  • 38
  • 31
  • 29
  • 24
  • 24
  • 13
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • Tagged with
  • 629
  • 629
  • 144
  • 132
  • 122
  • 115
  • 95
  • 89
  • 87
  • 82
  • 81
  • 77
  • 72
  • 67
  • 66
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
351

Fatoração de matrizes no problema de coagrupamento com sobreposição de colunas / Matrix factorization for overlapping columns coclustering

Lucas Fernandes Brunialti 31 August 2016 (has links)
Coagrupamento é uma estratégia para análise de dados capaz de encontrar grupos de dados, então denominados cogrupos, que são formados considerando subconjuntos diferentes das características descritivas dos dados. Contextos de aplicação caracterizados por apresentar subjetividade, como mineração de texto, são candidatos a serem submetidos à estratégia de coagrupamento; a flexibilidade em associar textos de acordo com características parciais representa um tratamento adequado a tal subjetividade. Um método para implementação de coagrupamento capaz de lidar com esse tipo de dados é a fatoração de matrizes. Nesta dissertação de mestrado são propostas duas estratégias para coagrupamento baseadas em fatoração de matrizes não-negativas, capazes de encontrar cogrupos organizados com sobreposição de colunas em uma matriz de valores reais positivos. As estratégias são apresentadas em termos de suas definições formais e seus algoritmos para implementação. Resultados experimentais quantitativos e qualitativos são fornecidos a partir de problemas baseados em conjuntos de dados sintéticos e em conjuntos de dados reais, sendo esses últimos contextualizados na área de mineração de texto. Os resultados são analisados em termos de quantização do espaço e capacidade de reconstrução, capacidade de agrupamento utilizando as métricas índice de Rand e informação mútua normalizada e geração de informação (interpretabilidade dos modelos). Os resultados confirmam a hipótese de que as estratégias propostas são capazes de descobrir cogrupos com sobreposição de forma natural, e que tal organização de cogrupos fornece informação detalhada, e portanto de valor diferenciado, para as áreas de análise de agrupamento e mineração de texto / Coclustering is a data analysis strategy which is able to discover data clusters, known as coclusters. This technique allows data to be clustered based on different subsets defined by data descriptive features. Application contexts characterized by subjectivity, such as text mining, are candidates for applying coclustering strategy due to the flexibility to associate documents according to partial features. The coclustering method can be implemented by means of matrix factorization, which is suitable to handle this type of data. In this thesis two strategies are proposed in non-negative matrix factorization for coclustering. These strategies are able to find column overlapping coclusters in a given dataset of positive data and are presented in terms of their formal definitions as well as their algorithms\' implementation. Quantitative and qualitative experimental results are presented through applying synthetic datasets and real datasets contextualized in text mining. This is accomplished by analyzing them in terms of space quantization, clustering capabilities and generated information (interpretability of models). The well known external metrics Rand index and normalized mutual information are used to achieve the analysis of clustering capabilities. Results confirm the hypothesis that the proposed strategies are able to discover overlapping coclusters naturally. Moreover, these coclusters produced by the new algorithms provide detailed information and are thus valuable for future research in cluster analysis and text mining
352

Utilização de técnicas de dados não estruturados para desenvolvimento de modelos aplicados ao ciclo de crédito

Andrade Junior, Valter Lacerda de 13 August 2014 (has links)
Made available in DSpace on 2016-04-29T14:23:30Z (GMT). No. of bitstreams: 1 Valter Lacerda de Andrade Junior.pdf: 673552 bytes, checksum: 68480511c98995570354a0166d2bb577 (MD5) Previous issue date: 2014-08-13 / The need for expert assessment of Data Mining in textual data fields and other unstructured information is increasingly present in the public and private sector. Through probabilistic models and analytical studies, it is possible to broaden the understanding of a particular information source. In recent years, technology progress caused exponential growth of the information produced and accessed in the virtual media (web and private). It is estimated that by 2003 humanity had historically generated a total of 5 exabytes of content; today that asset volume can be produced in a few days. With the increasing demand, this project aims to work with probabilistic models related to the financial market in order to check whether the textual data fields, or unstructured information, contained within the business environment, can predict certain customers behaviors. It is assumed that in the corporate environment and on the web, there is great valuable information that, due to the complexity and lack of structure, they are barely considered in probabilistic studies. This material may represent competitive and strategic advantage for business, so analyzing unstructured information one can acquire important data on behaviors and mode of user interaction in the environment in which it operates, providing data as to obtain psychographic profile and satisfaction degree. The corpus of this study consists of the results of experiments made in negotiating environment of a financial company in São Paulo. On the foregoing analysis, it was applied statistical bias semiotic concepts. Among the findings of this study, it is possible to get a critical review and thorough understanding of the processes of textual data assessment / A necessidade de análise especializada de Mineração de Dados (Data Mining) em campos textuais e em outras informações não estruturadas estão, cada vez mais, presente nas instituições dos setores públicos e privados. Por meio de modelos probabilísticos e estudos analíticos, torna-se possível ampliar o entendimento sobre determinada fonte de informação. Nos últimos anos, devido ao avanço tecnológico, observa-se um crescimento exponencial na quantidade de informação produzida e acessada nas mídias virtuais (web e privada). Até 2003, a humanidade havia gerado, historicamente, um total de 5 exabytes de conteúdo; hoje estima-se que esse volume possa ser produzido em poucos dias. Assim, a partir desta crescente demanda identificada, este projeto visa trabalhar com modelos probabilísticos relacionados ao mercado financeiro com o intuito de analisar se os campos textuais e ilustrativos, ou informações não estruturadas, contidas dentro do ambiente de negócio, podem prever certos comportamentos de clientes. Parte-se do pressuposto que, no ambiente corporativo e na web, existem informações de grande valor e que, devido à complexidade e falta de estrutura, não são consideradas em estudos probabilísticos. Isso pode representar vantagem competitiva e estratégica para o negócio, pois, por meio da análise da informação não estruturada, podem-se conhecer comportamentos e modos de interação do usuário nestes ambientes, proporcionando obter dados como perfil psicográfico e grau de satisfação. O corpus deste estudo constitui-se de resultados de experimentos efetuados no ambiente negocial de uma empresa do setor financeiro. Para as análises, foram aplicados conceitos estatísticos com viés semiótico. Entre as informações obtidas por esta pesquisa, verifica-se a compreensão crítica e aprofundada dos processos de análise textual
353

Análise de dados por meio de agrupamento fuzzy semi-supervisionado e mineração de textos / Data analysis using semisupervised fuzzy clustering and text mining

Medeiros, Debora Maria Rossi de 08 December 2010 (has links)
Esta Tese apresenta um conjunto de técnicas propostas com o objetivo de aprimorar processos de Agrupamento de Dados (AD). O principal objetivo é fornecer à comunidade científica um ferramental para uma análise completa de estruturas implícitas em conjuntos de dados, desde a descoberta dessas estruturas, permitindo o emprego de conhecimento prévio sobre os dados, até a análise de seu significado no contexto em que eles estão inseridos. São dois os pontos principais desse ferramental. O primeiro se trata do algoritmo para AD fuzzy semi-supervisionado SSL+P e sua evolução SSL+P*, capazes de levar em consideração o conhecimento prévio disponível sobre os dados em duas formas: rótulos e níveis de proximidade de pares de exemplos, aqui denominados Dicas de Conhecimento Prévio (DCPs). Esses algoritmos também permitem que a métrica de distância seja ajustada aos dados e às DCPs. O algoritmo SSL+P* também busca estimar o número ideal de clusters para uma determinada base de dados, levando em conta as DCPs disponíveis. Os algoritmos SSL+P e SSL+P* envolvem a minimização de uma função objetivo por meio de um algoritmo de Otimização Baseado em População (OBP). Esta Tese também fornece ferramentas que podem ser utilizadas diretamente neste ponto: as duas versões modificadas do algoritmo Particle Swarm Optimization (PSO), DPSO-1 e DPSO-2 e 4 formas de inicialização de uma população inicial de soluções. O segundo ponto principal do ferramental proposto nesta Tese diz respeito à análise de clusters resultantes de um processo de AD aplicado a uma base de dados de um domínio específico. É proposta uma abordagem baseada em Mineração de Textos (MT) para a busca em informações textuais, disponibilizadas digitalmente e relacionadas com as entidades representadas nos dados. Em seguida, é fornecido ao pesquisador um conjunto de palavras associadas a cada cluster, que podem sugerir informações que ajudem a identificar as relações compartilhadas por exemplos atribuídos ao mesmo cluster / This Thesis presents a whole set of techniques designed to improve the data clustering proccess. The main goal is to provide to the scientific community a tool set for a complete analyses of the implicit structures in datasets, from the identification of these structures, allowing the use of previous knowledge about the data, to the analysis of its meaning in their context. There are two main points involved in that tool set. The first one is the semi-supervised clustering algorithm SSL+P and its upgraded version SSL+P*, which are able of take into account the available knowlegdge about de data in two forms: class labels and pairwise proximity levels, both refered here as hints. These algorithms are also capable of adapting the distance metric to the data and the available hints. The SSL+P* algorithm searches the ideal number of clusters for a dataset, considering the available hints. Both SSL+P and SSL+P* techniques involve the minimization of an objective function by a Population-based Optimization algorithm (PBO). This Thesis also provides tools that can be directly employed in this area: the two modified versions of the Particle Swarm Optimization algorithm (PSO), DPSO-1 and DPSO-2, and 4 diferent methods for initializing a population of solutions. The second main point of the tool set proposed by this Thesis regards the analysis of clusters resulting from a clustering process applied to a domain specific dataset. A Text Mining based approach is proposed to search for textual information related to the entities represented by the data, available in digital repositories. Next, a set of words associated with each cluster is presented to the researcher, which can suggest information that can support the identification of relations shared by objects assigned to the same cluster
354

Refinamento interativo de mapas de documentos apoiado por extração de tópicos / Interactive refinement of document maps supported by topic extraction

Silva, Renato Rodrigues Oliveira da 15 December 2010 (has links)
Mapas de documentos são representações visuais que permitem analisar de forma eficiente diversas relações entre documentos de uma coleção. Técnicas de projeção multidimensional podem ser empregadas para criar mapas que refletem a similaridade de conteúdo, favorecendo a identificação de agrupamentos com conteúdo similar. Este trabalho aborda uma evolução do arcabouço genérico oferecido pelas projeções multidimensionais para apoiar a análise interativa de documentos textuais, implementado na plataforma PEx. Foram propostas e implementadas técnicas que permitem ao usuário interagir com o mapa de documentos utilizando tópicos extraídos do próprio corpus. Assim a representação visual pode gradualmente evoluir para refletir melhor os interesses do usuário, e apoiá-lo de maneira mais efetiva em tarefas exploratórias. A interação foi avaliada utilizando uma técnica de inspeção de usabilidade, que visa identificar os principais problemas enfrentados pelos usuários ao interagir com as funcionalidades desenvolvidas. Adicionalmente, a utilidade das funcionalidades foi avaliada pela condução de dois estudos de caso, em que foram definidas tarefas a serem conduzidas pelo usuário sobre os mapas de documentos. Os resultados mostram que com o auxílio das visualizações foi possível conduzir as tarefas satisfatoriamente, permitindo manipular de forma eficiente milhares de documentos sem a necessidade de ler individualmente cada texto / Content-based document maps are visualizations that help users to identify and explore relationships among documents in a collection. Multidimensional projection techniques have been employed to create similaritybased maps that can help identifying documents of similar content. This work aims to enhance the generic framework offered by the multidimensional projection techniques in the PEx visualization platform to support interactive analysis of textual data. Several interaction functions and visual representations have been proposed and implemented that allow users to interact with document maps aided by topics automatically extracted from the corpus. By exploring the topics and maps in an integrated manner, users can refine and evolve the visual representations gradually to better reflect their needs and interests, enhancing support to exploratory tasks. The proposed interaction functions were evaluated employing a usability inspection technique, seeking to detect interface problems. Moreover, two illustrative case studies were conducted to evaluate the usefulness of the proposed interactions, based on typical user tasks defined over different document collections. They illustrate how the developed visualizations can assist the proposed tasks, allowing users to interactively explore large document corpora and refine document maps
355

Indirect Relatedness, Evaluation, and Visualization for Literature Based Discovery

Henry, Sam 01 January 2019 (has links)
The exponential growth of scientific literature is creating an increased need for systems to process and assimilate knowledge contained within text. Literature Based Discovery (LBD) is a well established field that seeks to synthesize new knowledge from existing literature, but it has remained primarily in the theoretical realm rather than in real-world application. This lack of real-world adoption is due in part to the difficulty of LBD, but also due to several solvable problems present in LBD today. Of these problems, the ones in most critical need of improvement are: (1) the over-generation of knowledge by LBD systems, (2) a lack of meaningful evaluation standards, and (3) the difficulty interpreting LBD output. We address each of these problems by: (1) developing indirect relatedness measures for ranking and filtering LBD hypotheses; (2) developing a representative evaluation dataset and applying meaningful evaluation methods to individual components of LBD; (3) developing an interactive visualization system that allows a user to explore LBD output in its entirety. In addressing these problems, we make several contributions, most importantly: (1) state of the art results for estimating direct semantic relatedness, (2) development of set association measures, (3) development of indirect association measures, (4) development of a standard LBD evaluation dataset, (5) division of LBD into discrete components with well defined evaluation methods, (6) development of automatic functional group discovery, and (7) integration of indirect relatedness measures and automatic functional group discovery into a comprehensive LBD visualization system. Our results inform future development of LBD systems, and contribute to creating more effective LBD systems.
356

Multi-scale analysis of languages and knowledge through complex networks / Análise multi-escala de línguas e conecimento por meio de redes complexas

Arruda, Henrique Ferraz de 24 January 2019 (has links)
There any many different aspects in natural languages and their related dynamics that have been studied. In the case of languages, some quantitative analyses have been done by using stochastic models. Furthermore, natural languages can be understood as complex systems. Thus, there is a possibility to use set of tools development to analyse complex networks, which are computationally represented by graphs, also to analyse natural languages. Furthermore, these tools can be used to represent and analyse some related dynamics taking place on the networks. Observe that knowledge is intrinsically related to language, because language is the vehicle used by humans beings to transmit dicoveries, and the language itself is also a type of knowledge. This thesis is divided into two types of analyses: (i) texts and (II) dynamical aspects. In the first part, we proposed networks representations of text in different scales analyses, starting from the analysis of writing style considering word adjacency networks (co-occurence) to understand local patterns of words, to a mesoscopic representation, which is created from chunks of text and grasps information of the unfolding of the story. In the second part, we considered the structure and dynamics related to knowledge and language, in this case, starting from the larger scale, in which we studied the connectivity between applied and theoretical physics. In the following, we simulated the knowledge acquisition by researchers in a multi-agent dynamics and an intelligent machine that solves problems, which is represented by a network. At the smallest considered scale, we simulate the transmission of networks. This transmission considers the data as a series of organized symbols that is obtained from a dynamics. In order to improve the speed of transmission, the series can be compacted. For that, we considered the information theory and Huffman code. The proposed network-based approaches were found to be suitable to deal with the employed analysis for all of the tested scales. / Existem diversos aspectos das linguagens naturais e de dinâmicas relacionadas que estão sendo estudadas. No caso das línguas, algumas análises quantitativas foram feitas usando modelos estocásticos. Ademais, linguagens naturais podem ser entendidas como sistemas complexos. Para analisar linguagens naturais, existe a possibilidade de utilizar o conjunto de ferramentas que já foram desenvolvidas para analisar redes complexas, que são representadas computacionalmente. Além disso, tais ferramentas podem ser utilizadas para representar e analisar algumas dinâmicas relacionadas a redes complexas. Observe que o conhecimento está intrinsecamente relacionado à linguagem, pois a linguagem é o veículo usado para transmitir novas descobertas, sendo que a própria linguagem também é um tipo de conhecimento. Esta tese é dividida em dois tipos de análise : (i) textos e (ii) aspectos dinâmicos. Na primeira parte foram propostas representações de redes de texto em diferentes escalas de análise. A partir da análise do estilo de escrita, considerando redes de adjacência de palavras (co-ocorrência) para entender padrões locais de palavras, até uma representação mesoscópica, que é criada a partir de pedaços de texto e que representa informações do texto de acordo com o desenrolar da história. Na segunda parte, foram consideradas a estrutura e dinâmica relacionadas ao conhecimento e à linguagem. Neste caso, partiu-se da escala maior, com a qual estudamos a conectividade entre física aplicada e física teórica. A seguir, simulou-se a aquisição de conhecimento por pesquisadores em uma dinâmica multi-agente e uma máquina inteligente que resolve problemas, que é representada por uma rede. Como a menor escala considerada, foi simulada a transmissão de redes. Essa transmissão considera os dados como uma série de símbolos organizados que são obtidos a partir de uma dinâmica. Para melhorar a velocidade de transmissão, a série pode ser compactada. Para tanto, foi utilizada a teoria da informação e o código de Huffman. As propostas de abordagens baseadas em rede foram consideradas adequadas para lidar com a análise empregada, em todas as escalas testadas.
357

應用文字探勘技術於臺灣上市公司重大訊息對股價影響之研究 / The study on impact of material information of public listed company to its stock price by using text mining approach

吳漢瑞, Wu, Han Ruei Unknown Date (has links)
台灣股票市場屬於淺碟型,因此外界的訊息易於影響股價波動;同時台灣是一個以個別投資人為主的散戶市場,外界的訊息會影響市場投資。因此,重大訊息的發布對公司股價變化的影響,值得我們進一步探討。 本研究以公開資訊觀測站之重大訊息為資料來源,蒐集2005~2009年間統一、中華電信、長榮航空以及臺灣企銀四間上市公司之重大訊息共1382篇。利用文字探勘kNN演算法將四間公司之重大訊息加以分群,分析出各訊息的發布對於股價之影響程度,並找出不同群組之重大訊息的漲跌趨勢,期能對未來即時重大訊息的發布,分析出其對於股價之漲跌影響,進一步得到訊息發布日後兩日之報酬率走勢,成為日後投資標的之選擇參考。 本研究結果顯示取樣公司於發布前兩日至發布後兩日,交易量有顯著之異常,顯示訊息發布對於公司股票確有影響;而不同的重大訊息內容,將會被分於不同之群組當中,各群組也各有其不同之漲跌趨勢,本研究於測試資料之分類結果,整體平均有六成五之準確率,在於上漲類別之準確率更高達八成;最後於發布後累積報酬率之影響,投資正確率平均高於六成。 本研究透過系統化之分析與預測,省去投資者對於重大訊息之搜尋以及解讀的時間,提供投資者一個可供參考之依據。 / In this study we used the technique of text mining to classify the material information of companies and analyze how the disclosure of it affects the market. Hence, we would be able to predict the price of stock based on disclosures of the material information and then use the outcome as reference of investment. This study chose the Market Observation Post System as the source of information to its justice. We chose UNI-PRESIDENT ENTERPRISES CORP, Chunghwa Telecom Co., Ltd, EVA AIRWAYS CORPORATION and Taiwan Business Bank for their great evaluation of the information disclosure. We collected 1382 material information from 2005 to 2009 and for the better performance, we selected kNN algorithm as our rule of classification. We conducted three experiments in this study. In these experiments, we have approved that the trading volume of two periods were with significant differences. We have over 60% accuracy of the all data to classify the tested data. As a result, we found that the return rate of the “up” group has over 60% upside probability and the “down” group has over 60% downside probability. In this study, we built a time-saving automatic system to group material information and find out those that are valuable. Based on our result, we provided a reference to investors for their investment strategy. At the same time, we also came up with some inspiration for future research.
358

On text mining to identify gene networks with a special reference to cardiovascular disease / Identifiering av genetiska nätverk av betydelse för kärlförkalkning med hjälp av automatisk textsökning i Medline, en medicinsk litteraturdatabas

Strandberg, Per Erik January 2005 (has links)
<p>The rate at which articles gets published grows exponentially and the possibility to access texts in machine-readable formats is also increasing. The need of an automated system to gather relevant information from text, text mining, is thus growing. </p><p>The goal of this thesis is to find a biologically relevant gene network for atherosclerosis, themain cause of cardiovascular disease, by inspecting gene cooccurrences in abstracts from PubMed. In addition to this gene nets for yeast was generated to evaluate the validity of using text mining as a method. </p><p>The nets found were validated in many ways, they were for example found to have the well known power law link distribution. They were also compared to other gene nets generated by other, often microbiological, methods from different sources. In addition to classic measurements of similarity like overlap, precision, recall and f-score a new way to measure similarity between nets are proposed and used. The method uses an urn approximation and measures the distance from comparing two unrelated nets in standard deviations. The validity of this approximation is supported both analytically and with simulations for both Erd¨os-R´enyi nets and nets having a power law link distribution. The new method explains that very poor overlap, precision, recall and f-score can still be very far from random and also how much overlap one could expect at random. The cutoff was also investigated. </p><p>Results are typically in the order of only 1% overlap but with the remarkable distance of 100 standard deviations from what one could have expected at random. Of particular interest is that one can only expect an overlap of 2 edges with a variance of 2 when comparing two trees with the same set of nodes. The use of a cutoff at one for cooccurrence graphs is discussed and motivated by for example the observation that this eliminates about 60-70% of the false positives but only 20-30% of the overlapping edges. This thesis shows that text mining of PubMed can be used to generate a biologically relevant gene subnet of the human gene net. A reasonable extension of this work is to combine the nets with gene expression data to find a more reliable gene net.</p>
359

Matching Vehicle License Plate Numbers Using License Plate Recognition and Text Mining Techniques

Oliveira Neto, Francisco Moraes 01 August 2010 (has links)
License plate recognition (LPR) technology has been widely applied in many different transportation applications such as enforcement, vehicle monitoring and access control. In most applications involving enforcement (e.g. cashless toll collection, congestion charging) and access control (e.g. car parking) a plate is recognized at one location (or checkpoint) and compared against a list of authorized vehicles. In this research I dealt with applications where a vehicle is detected at two locations and there is no list of reference for vehicle identification. There seems to be very little effort in the past to exploit all information generated by LPR systems. In nowadays, LPR machines have the ability to recognize most characters on the vehicle plates even under the harshest practical conditions. Therefore, even though the equipment are not perfect in terms of plate reading, it is still possible to judge with certain confidence if a pair of imperfect readings, in the form of sequenced characters (strings), most likely belong to the same vehicle. The challenge here is to design a matching procedure in order to decide whether or not they belong to same vehicle. In view of the aforementioned problem, this research intended to design and assess a matching procedure that takes advantage of a similarity measure called edit distance (ED) between two strings. The ED measure the minimum editing cost to convert a string to another. The study first attempted to assess a simple case of a dual LPR setup using the traditional ED formulation with 0 or 1 cost assignments (i.e. 0 if a pair-wise character is the same, and 1 otherwise). For this dual setup, this research has further proposed a symbol-based weight function using a probabilistic approach having as input parameters the conditional probability matrix of character association. As a result, this new formulation outperformed the original ED formulation. Lastly, the research sought to incorporate the passage time information into the procedure. With this, the performance of the matching procedure improved considerably resulting in a high positive matching rate and much lower (about 2%) false matching rate.
360

Concept Based Knowledge Discovery from Biomedical Literature.

Radovanovic, Aleksandar. January 2009 (has links)
<p>This thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology&nbsp / resented can be integrated with the researchers&rsquo / own knowledge, experimentation and observations for optimal progression of scientific research.</p>

Page generated in 0.0674 seconds