Spelling suggestions: "subject:"similarity."" "subject:"imilarity.""
161 |
Direct Demonstration of Self-Similarity in a Hydrodynamic Treatment of Polymer Self-DiffusionMerriam, Susan Carol 01 May 2002 (has links)
The self-diffusion coefficient of a polymer in solution may be expanded in the concentration of the polymer, as seen in equation 1. The linear term would represent a perturbation due to the presence of another polymer; the c^{2} term would represent a perturbation due to interactions of trios of polymers. Phillies determined the c^{2} term of a virial expansion of the self-diffusion coefficient for trios of polymers interacting via a ring. Here I determine a correction to the c^{2} term due to trios of polymers interacting via a figure-eight scattering diagram: the equivalent of four polymers interacting in a ring where the second polymer and the fourth polymer are the same. D_{s}(c) = D_{0}(1+ alpha D_{0} c + beta D_{0}^{2}c^{2}+...) 1 or, D_{s}(c) = D_{0}(1+ alpha D_{s}(c)c). 2 A D_{0} may be replaced by D_{s}(c) in equation 1 to arrive at equation 2. The left-hand-side of equation 2 is the final self-diffusion coefficient, and the D_{s}(c) on the right-hand-side of this equation is that due to the question of self-similarity. If the D_{s}(c) on the right-hand-side is given by equation 1, resulting in beta=alpha^{2}, it may be said that the system exhibits self-similarity. I demonstrate self-similarity quantitatively for a polymer solution using a generalized Kirkwood-Riseman model of polymer dynamics. The major physical assumption of the model I utilize to derive equation 2 is that, in solution, polymer motions are dominantly governed by hydrodynamic interactions between the chains. First, I review the Kirkwood-Riseman model for intrachain hydrodynamic interactions. I then discuss Phillies' extension of this model to interchain interactions for duos or trios of polymers in a ring. I analytically calculate the hydrodynamic interaction tensor from a multiple scattering picture T_{54321}, for five polymers in solution and verify this tensor by numerical differentiation. Finally, I perform the ensemble average of the self-interaction tensor b_{1232} appropriate to the figure-eight scattering diagram both analytically and with a Monte Carlo routine, thereby verifying equation 2 to second order in concentration.
|
162 |
Estudos e desenvolvimento de métodos baseados em harmônicos esféricos para análise de similaridade estrutural entre ligantes / Study and development of spherical harmonics based methods for similarity ligand analysisCaires, Fernando Ribeiro 19 October 2016 (has links)
Descritores moleculares são essenciais em muitas aplicações de física e química computacional, como na análise de similaridade entre ligantes baseada em sua estrutura. Harmônicos esféricos têm sido utilizados como descritores da superfície molecular por serem uma forma compacta de descrição geométrica e por possuírem um descritor invariante por rotação. Assim, este trabalho propõe um método de análise de similaridade estrutural entre ligantes no qual se modela a superfície de uma molécula através de uma expansão em harmônicos esféricos realizada pelo programa LIRA. Os coeficientes encontrados são utilizados para percorrer o banco de dados DUD-E, com descritores previamente calculados, utilizando Distância Euclidiana e diversos valores de corte para selecionar compostos mais semelhantes. O potencial do método é avaliado usando o Ultrafast Shape Recognition (USR) como método padrão, pelo fato de ser uma excelente e rápida métrica para análise da similaridade de ligantes. Foram selecionadas 50 moléculas de diferentes tamanhos e composição de forma a representar todos os grupos moleculares presentes na DUD-E. Em seguida, cada molécula foi submetida à busca de similares variando-se valores de corte para o LIRA em que o conjunto de moléculas selecionadas foi comparado com as selecionadas pelo USR através de um processo de classificação binária e criação e interpretação de curvas ROC. Além do benchmarking, foi realizada a análise das componentes principais para determinar quais descritores são os mais importantes e carregam as melhores informações utilizadas na descrição da superfície da molécula. A partir das componentes principais, foi realizado um estudo do uso de funções peso, associando mais importância aos descritores adequados, e a redução da dimensionalidade do banco de dados, seleção de um novo conjunto de autovetores que formam as bases do espaço vetorial e uma nova descrição das moléculas para o novo espaço, no qual cada variação foi avaliada através de um novo benchmarking. O LIRA se mostrou tão rápido quanto o USR e apresentou grande potencial de seleção de moléculas similares, para a maioria das moléculas testadas, pois as curvas ROC apresentaram pontos acima da linha do aleatório. Tanto a redução da dimensionalidade quanto o uso de funções de ponderação agregaram valor à métrica deixando-a mais veloz, no caso da redução da quantidade de descritores, e seletiva, em ambos os casos. Dessa forma, o método proposto se mostrou eficiente em mensurar a similaridade entre ligantes de forma seletiva e rápida utilizando somente informações a respeito da superfície molecular. / Molecular descriptors are essential for many applications in computational chemistry and physics, such as ligand-based similarity searching. Spherical harmonics have previously been suggested as comprehensive descriptors of molecular structure due to their properties, orthonormality and rotationally invariant. Here we proposed a ligand similarity analysis method where molecule\'s surface is modeled by an expansion in Spherical Harmonics, called LIRA, whose coefficient are used to perform a search in DUD-E database, with all descriptors previously calculated, measured by Euclidian Distance and different cutoff\'s values to select similar compounds. Method\'s potential is evaluated against Ultrafast Shape Recognition (USR), due to it is an excellent a fast metric to ligand similarity analysis, in a benchmarking. Fifty molecules are selected varying chemical composition and size to represent all molecular groups of DUD-E. After that, which one was submitted in a search with different values of cutoff for LIRA and the subset selected was compared with the ones selected by USR through binary classification and ROC curves analysis. Beyond benchmarking, it was performed a principal component analysis to identify which are the most valuable coefficient for shape description. Using principal components two other studies are made, weight functions are applied to descriptors, providing more value for those carry more information, and dimensionality reduction, where a subset of eigenvectors are select to form the new basis of the vector space and the new molecule\'s description was made in the new space, which variation was tested in a new benchmarking. Lira showed to be as fast as USR and a big potential to select similar molecules, for the majority of the molecules tested, because ROC curves had points over the random line. Dimensionality reduction and weight functions improved LIRA results raising velocity, due to the use of less descriptors to model molecule\'s surface, and the selection power, for both cases. In summary, the proposed method showed to be an efficient and fast tool for measure similarity between ligands based in molecular shape.
|
163 |
Uncovering Features in Behaviorally Similar ProgramsSu, Fang-Hsiang January 2018 (has links)
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation.
|
164 |
Hypothesis formulation in medical records spaceBa-Dhfari, Thamer Omer Faraj January 2017 (has links)
Patient medical records are a valuable resource that can be used for many purposes including managing and planning for future health needs as well as clinical research. Health databases such as the clinical practice research datalink (CPRD) and many other similar initiatives can provide researchers with a useful data source on which they can test their medical hypotheses. However, this can only be the case when researchers have a good set of hypotheses to test on the data. Conversely, the data may have other equally important areas that remain unexplored. There is a chance that some important signals in the data could be missed. Therefore, further analysis is required to make such hidden areas become more obvious and attainable for future exploration and investigation. Data mining techniques can be effective tools in discovering patterns and signals in large-scale patient data sets. These techniques have been widely applied to different areas in medical domain. Therefore, analysing patient data using such techniques has the potential to explore the data and to provide a better understanding of the information in patient records. However, the heterogeneity and complexity of medical data can be an obstacle in applying data mining techniques. Much of the potential value of this data therefore goes untapped. This thesis describes a novel methodology that reduces the dimensionality of primary care data, to make it more amenable to visualisation, mining and clustering. The methodology involves employing a combination of ontology-based semantic similarity and principal component analysis (PCA) to map the data into an appropriate and informative low dimensional space. The aim of this thesis is to develop a novel methodology that provides a visualisation of patient records. This visualisation provides a systematic method that allows the formulation of new and testable hypotheses which can be fed to researchers to carry out the subsequent phases of research. In a small-scale study based on Salford Integrated Record (SIR) data, I have demonstrated that this mapping provides informative views of patient phenotypes across a population and allows the construction of clusters of patients sharing common diagnosis and treatments. The next phase of the research was to develop this methodology and explore its application using larger patient cohorts. This data contains more precise relationships between features than small-scale data. It also leads to the understanding of distinct population patterns and extracting common features. For such reasons, I applied the mapping methodology to patient records from the CPRD database. The study data set consisted of anonymised patient records for a population of 2.7 million patients. The work done in this analysis shows that methodology scales as O(n) in ways that did not require large computing resources. The low dimensional visualisation of high dimensional patient data allowed the identification of different subpopulations of patients across the study data set, where each subpopulation consisted of patients sharing similar characteristics such as age, gender and certain types of diseases. A key finding of this research is the wealth of data that can be produced. In the first use case of looking at the stratification of patients with falls, the methodology gave important hypotheses; however, this work has barely scratched the surface of how this mapping could be used. It opens up the possibility of applying a wide range of data mining strategies that have not yet been explored. What the thesis has shown is one strategy that works, but there could be many more. Furthermore, there is no aspect of the implementation of this methodology that restricts it to medical data. The same methodology could equally be applied to the analysis and visualisation of many other sources of data that are described using terms from taxonomies or ontologies.
|
165 |
Múltiplas visões coordenadas para exploração de mapas de similaridade / Coordinated and multiple views for exploration of similarity mapsDanilo Medeiros Eler 18 March 2011 (has links)
Atualmente, diversas áreas de aplicação necessitam de mecanismos mais efetivos para analisar dados provenientes de naturezas distintas. Tipicamente, esses dados são abstratos, não estruturados e possuem uma natureza multidimensional (e.g., coleções de documentos). Dados que não possuem uma natureza multidimensional podem ser representados como tal por meio da aplicação de algoritmos extratores de características (e.g., coleções de imagens). Assim, técnicas de visualização de informação projetadas para interpretar dados multidimensionais podem ser aproveitadas para analisar dados não estruturados. Esta tese empregou técnicas de visualização de informação para construir mapas de similaridade a partir de dados multidimensionais como uma forma de representação desses dados, uma vez que as técnicas para construilos tem evoluído com a expansão dos campos de aplicação. Novas técnicas para coordenação de múltiplas visões foram desenvolvidas para permitir a exploração de conjuntos de dados, a partir de mapas de similaridade gerados por diferentes técnicas de construção de mapas, diferentes parâmetros ou ainda diferentes conjuntos de dados. As técnicas de coordenação desenvolvidas são baseadas em identificador, em distância, em tópicos, na identificação de tópicos em coleções que evoluem no tempo, e em uma técnica que combina o mapeamento de diferentes técnicas de coordenação. Esta tese também apresenta aplicações das técnicas de coordenação desenvolvidas e das ferramentas construídas para análise de coleções de documentos, coleções de imagens e dados volumétricos, empregando coordenações de mapas de similaridade. As técnicas de coordenação desenvolvidas são apoiadas por um modelo de coordenação que estende um modelo previamente proposto na literatura. O modelo estendido permite a configuração de técnicas de coordenação durante a exploração, admitindo diferentes tipos de mapeamentos. Uma característica importante do modelo é permitir o desenvolvimento de mapeamentos dinâmicos para técnicas de coordenação, isto é, mapeamentos que podem mudar o comportamento de acordo com a interação do usuário. Como resultado desta tese, está disponível um arcabouço para visualização coordenada de múltiplos mapas de similaridade, composto por um modelo, um conjunto de técnicas e um conjunto de ferramentas que efetivamente permitem a análise visual de conjuntos de dados multidimensionais / Currently, various fields of application need effective mechanisms to analyse data differing in nature. Typically these data are abstract, unstructured and multidimensional (e.g. document collections). Data that do not present multidimensional description can be represented as such by means of feature extraction algorithms (e.g. image collections). Thus, information visualization techniques designed to interpret multidimensional data sets can be employed to analyse unstructured data. This thesis employed information visualization techniques that build similarity maps from multidimensional data as a form of data representation, since the techniques to construct them have evolved lately with expanding fields of application. Novel techniques for coordination of multiple views were developed that allow exploration of data sets, from similarity maps generated using different techniques for building maps, different parameters or even different data sets. The developed coordination techniques are based on identity relationships, on distance relationships, on topic coverage (for text or other annotated data) and on evolution of topic coverage (also for text). An approach to combine different coordination techniques was also developed. This thesis also reports on applications of the coordination techniques developed, and on tools built for analysis of image, text and volumetric data employing coordinated similarity maps. The techniques developed in this work are supported by a coordination model that extends a model previously proposed in literature. The extended model allows the definition and configuration of coordination techniques during coordination tasks and performing various types of mappings. An important feature of the model is to support the development of dynamic mappings, which are mappings that may change behavior according to user interaction. As a result of this thesis, a framework is available for coordinated visualization of multiple similarity maps, composed by a model, a set of techniques and a set of implemented tools that effectively support the visual analysis of multidimensional data sets
|
166 |
Estudo e implementação de um gerador de tráfego com dependência de longa duração. / Study and implementation of a network traffic generator with long range dependency.Fernando Lemos de Mello 10 November 2006 (has links)
Medidas mostraram que o tráfego das redes multisserviço possui propriedades fractais tais como auto-similaridade e memória longa ou dependência de longa duração (LRD). A memória longa é caracterizada pela existência de um pólo na origem da função densidade espectral de potência (formato 1/f). Também foi constatado que o tráfego pode apresentar dependência de curta duração (SRD) em algumas escalas temporais. A utilização de um gerador de tráfego agregado ?realista?, que sintetize séries temporais fractais, é fundamental para a validação de algoritmos de controle de tráfego. Neste trabalho, a síntese de realizações aproximadas de dois tipos de processos aleatórios auto-similares é efetuada via transformada wavelet. O primeiro deles é denominado Ruído Gaussiano Fracionário (fGN) e o segundo Modelo Wavelet Multifractal (MWM). O método proposto também é capaz de sintetizar séries Gaussianas (fGN) e não-Gaussianas (MWM) com espectros mais genéricos do que 1/f, ou seja, séries que também apresentam dependência de curta duração. A geração é feita em dois estágios. O primeiro gera uma realização aproximada do fGN ou do MWM via Transformada Wavelet Discreta (DWT). O segundo estágio introduz SRD através de uma filtragem IIR da saída do primeiro estágio. Efetuou-se uma caracterização detalhada das séries resultantes, utilizando-se nas análises momentos estatísticos de 2ª., 3ª. e 4ª. ordens, além de testes estatísticos específicos para séries auto-similares. Adicionalmente, duas alternativas de conversão são apresentadas para que as séries temporais geradas sejam transformadas em séries de pacotes, que é o formato adequado para transmissão por um módulo gerador de pacotes. As séries de pacotes são novamente analisadas a fim de identificar se o método de conversão introduz distorção nas características auto-similares das séries sintetizadas. Mostra-se que as séries de pacotes auto-similares podem ser utilizadas em softwares simuladores de rede ou, alternativamente, serem utilizadas para injetar pacotes em redes de teste. Utilizando-se recursos do simulador NS-2, as séries de pacotes sintetizadas foram introduzidas em cenários de simulação adequados. Os resultados (medidas de atraso médio, perda de pacotes para o tráfego de interesse e tamanho da fila) dos cenários com tráfego interferente correspondente às séries de pacotes baseadas em modelos fGN e MWM foram comparados com resultados obtidos em cenários cujo tráfego interferente foi gerado com modelo Poisson. / Measurements have shown that multiservice network traffic has fractal properties such as self-similarity and long memory or long-range dependence (LRD). Long memory is characterized by the existence of a pole at the origin of the power spectrum density function (1/f shape). It was also noticed that traffic may present short-range dependence (SRD) at some time scales. The use of a ?realistic? aggregated network traffic generator, one that synthesizes fractal time series, is fundamental to the validation of traffic control algorithms. In this document, the synthesis of approximate realizations of two kinds of self-similar random process is done via wavelet transform. The first one is named Fractional Gaussian Noise (fGN) and the second Multifractal Wavelet Model (MWM). The proposed method is also capable of synthesizing Gaussian (fGN) and non-Gaussian (MWM) time series with more generic spectra than 1/f, that is, time series that also have short-range dependence. The generation is done in two stages. The first one generates an approximate realization of fGN or MWM via Discrete Wavelet Transform (DWT). The second one introduces SRD through Infinite Impulse Response (IIR) filtering at the output of the first stage. A detailed characterization of the resulting series was done, using statistical moments of first, second, third and forth orders, as well as specific statistical tests for self-similar series. Additionally, two alternatives for conversion are introduced in order to generate packet series, which is the suitable format for transmission by a packet generator module, from the original synthesized time series. Packet series are also analyzed to find if the conversion method has introduced distortion in the self-similar characteristics of the synthesized series. It is shown that the self-similar packet series can be used in network simulator software or, alternatively, be used to inject packets in a testbed network. Using resources from the NS-2 simulator, the synthesized packet series were introduced in appropriate network simulator scenarios. The results (average delay measurements, packet loss for interest traffic and queue length) from scenarios with interfering traffic corresponding to the packet series based on fGN and MWM models were compared to results from scenarios with interfering traffic generated by Poisson model.
|
167 |
Detecção, gerenciamento e consulta a réplicas e a versões de documentos XML / Detection, management and querying of replicas and versions of XML documentsSaccol, Deise de Brum January 2008 (has links)
O objetivo geral desta tese é a detecção, o gerenciamento e a consulta às réplicas e às versões de documentos XML. Denota-se por réplica uma cópia idêntica de um objeto do mundo real, enquanto versão é uma representação diferente, mas muito similar, deste objeto. Trabalhos prévios focam em gerenciamento e consulta a versões conhecidas, e não no problema da detecção de que dois ou mais objetos, aparentemente distintos, são variações (versões) do mesmo objeto. No entanto, o problema da detecção é crítico e pode ser observado em diversos cenários, tais como detecção de plágio, ranking de páginas Web, identificação de clones de software e busca em sistemas peer-to-peer (P2P). Nesta tese assume-se que podem existir diversas réplicas de um documento XML. Documentos XML também podem ser modificados ao longo do tempo, ocasionando o surgimento de versões. A detecção de réplicas é relativamente simples e pode ser feita através do uso de funções hash. Já a detecção de versões engloba conceitos de similaridade, a qual pode ser medida por várias métricas, tais como similaridade de conteúdo, de estrutura, de assunto, etc. Além da análise da similaridade entre os arquivos também se faz necessária a definição de um mecanismo de detecção de versões. O mecanismo deve possibilitar o gerenciamento e a posterior consulta às réplicas e às versões detectadas. Para que o objetivo da tese fosse alcançado foram definidos um conjunto de funções de similaridade para arquivos XML e o mecanismo de detecção de réplicas e de versões. Também foi especificado um framework onde tal mecanismo pode ser inserido e os seus respectivos componentes, que possibilitam o gerenciamento e a consulta às réplicas e às versões detectadas. Foi realizado um conjunto de experimentos que validam o mecanismo proposto juntamente com a implementação de protótipos que demonstram a eficácia dos componentes do framework. Como diferencial desta tese, o problema de detecção de versões é tratado como um problema de classificação, para o qual o uso de limiares não é necessário. Esta abordagem é alcançada pelo uso da técnica baseada em classificadores Naïve Bayesianos. Resultados demonstram a boa qualidade obtida com o mecanismo proposto na tese. / The overall goals of this thesis are the detection, management and querying of replicas and versions of XML documents. We denote by replica an identical copy of a real-world object, and by version a different but very similar representation of this object. Previous works focus on version management and querying rather than version detection. However, the version detection problem is critical in many scenarios, such as plagiarism detection, Web page ranking, software clone identification, and peer-to-peer (P2P) searching. In this thesis, we assume the existence of several replicas of a XML document. XML documents can be modified over time, causing the creation of versions. Replica detection is relatively simple and can be achieved by using hash functions. The version detection uses similarity concepts, which can be assessed by some metrics such as content similariy, structure similarity, subject similarity, and so on. Besides the similarity analysis among files, it is also necessary to define the version detection mechanism. The mechanism should allow the management and the querying of the detected replicas and versions. In order to achieve the goals of the thesis, we defined a set of similarity functions for XML files, the replica and version detection mechanism, the framework where such mechanism can be included and its components that allow managing and querying the detected replicas and versions. We performed a set of experiments for evaluating the proposed mechanism and we implemented tool prototypes that demonstrate the accuracy of some framework components. As the main distinguishing point, this thesis considers the version detection problem as a classification problem, for which the use of thresholds is not necessary. This approach is achieved by using Naïve Bayesian classifiers.
|
168 |
TOWARD A TWO-STAGE MODEL OF FREE CATEGORIZATIONSmith, Gregory J 01 September 2015 (has links)
This research examines how comparison of objects underlies free categorization, an essential component of human cognition. Previous results using our binomial labeling task have shown that classification probabilities are affected in a graded manner as a function of similarity, i.e., the number of features shared by two objects. In a similarity rating task, people also rated objects sharing more features as more similar. However, the effect of matching features was approximately linear in the similarity task, but superadditive (exponential) in the labeling task. We hypothesize that this difference is due to the fact that people must select specific objects to compare prior to deciding whether to put them in the same category in the labeling task, while they were given specific pairs to compare in the rating task. Thus, the number of features shared by two objects could affect both stages (selection and comparison) in the labeling task, which might explain their super-additive effect, whereas it affected only the latter comparison stage in the similarity rating task. In this experiment, participants saw visual displays consisting of 16 objects from three novel superordinate artificial categories, and were asked to generate binomial (letter-number) labels for each object to indicate their super-and-subordinate category membership. Only one object could be viewed at a time, and these objects could be viewed in any order. This made it possible to record what objects people examine when labeling a given object, which in turn permits separate assessment of stage 1 (selection) versus stage 2 (comparison/decision). Our primary objective in this experiment was to determine whether the increase in category labeling probabilities as a function of level of match (similarity) can be explained by increased sampling alone (stage 1 model), an increased perception of similarity following sampling (stage 2 model), or some combination (mixed model). The results were consistent with earlier studies in showing that the number of matching discrete features shred by two objects affected the probability of same-category label assignment. However, there was no effect of the level of match on the probability of visiting the first matching object while labeling the second. This suggests that the labeling effect is not due to differences in the likelihood of comparing matching objects (stage 1) as a function of the level of match. Thus, the present data provides support for a stage 2 only model, in which the evaluation of similarity is the primary component underlying the level of match effect on free categorization.
|
169 |
Grouping Biological DataRundqvist, David January 2006 (has links)
<p>Today, scientists in various biomedical fields rely on biological data sources in their research. Large amounts of information concerning, for instance, genes, proteins and diseases are publicly available on the internet, and are used daily for acquiring knowledge. Typically, biological data is spread across multiple sources, which has led to heterogeneity and redundancy.</p><p>The current thesis suggests grouping as one way of computationally managing biological data. A conceptual model for this purpose is presented, which takes properties specific for biological data into account. The model defines sub-tasks and key issues where multiple solutions are possible, and describes what approaches for these that have been used in earlier work. Further, an implementation of this model is described, as well as test cases which show that the model is indeed useful.</p><p>Since the use of ontologies is relatively new in the management of biological data, the main focus of the thesis is on how semantic similarity of ontological annotations can be used for grouping. The results of the test cases show for example that the implementation of the model, using Gene Ontology, is capable of producing groups of data entries with similar molecular functions.</p>
|
170 |
Efficient Semantic-based Content Search in P2P NetworkShen, Heng Tao, Shu, Yan Feng, Yu, Bei 01 1900 (has links)
Most existing Peer-to-Peer (P2P) systems support only title-based searches and are limited in functionality when compared to today’s search engines. In this paper, we present the design of a distributed P2P information sharing system that supports semantic-based content searches of relevant documents. First, we propose a general and extensible framework for searching similar documents in P2P network. The framework is based on the novel concept of Hierarchical Summary Structure. Second, based on the framework, we develop our efficient document searching system, by effectively summarizing and maintaining all documents within the network with different granularity. Finally, an experimental study is conducted on a real P2P prototype, and a large-scale network is further simulated. The results show the effectiveness, efficiency and scalability of the proposed system. / Singapore-MIT Alliance (SMA)
|
Page generated in 0.0834 seconds