Spelling suggestions: "subject:"[een] SIMILARITY"" "subject:"[enn] SIMILARITY""
1131 |
Measuring Semantic Distance using Distributional Profiles of ConceptsMohammad, Saif 01 August 2008 (has links)
Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine
translation and word sense disambiguation,
can be viewed as semantic distance problems.
The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each.
This thesis argues that estimating semantic distance is essentially a property of
concepts (rather than words) and that
two concepts are semantically close if they occur in similar contexts.
Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus).
The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks.
I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems
in one, possibly resource-poor, language using a knowledge source from another,
possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization.
The proposed approach is computationally inexpensive, it can estimate both semantic
relatedness and semantic similarity, and it can be applied to all parts of speech.
Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and
word translation show that the new approach is markedly superior to previous ones.
1132 |
Studying the effectiveness of dynamic analysis for fingerprinting Android malware behavior / En studie av effektivitet hos dynamisk analys för kartläggning av beteenden hos Android malwareRegard, Viktor January 2019 (has links)
Android is the second most targeted operating system for malware authors and to counter the development of Android malware, more knowledge about their behavior is needed. There are mainly two approaches to analyze Android malware, namely static and dynamic analysis. Recently in 2017, a study and well labeled dataset, named AMD (Android Malware Dataset), consisting of over 24,000 malware samples was released. It is divided into 135 varieties based on similar malicious behavior, retrieved through static analysis of the file classes.dex in the APK of each malware, whereas the labeled features were determined by manual inspection of three samples in each variety. However, static analysis is known to be weak against obfuscation techniques, such as repackaging or dynamic loading, which can be exploited to avoid the analysis. In this study the second approach is utilized and all malware in the dataset are analyzed at run-time in order to monitor their dynamic behavior. However, analyzing malware at run-time has known weaknesses as well, as it can be avoided through, for instance, anti-emulator techniques. Therefore, the study aimed to explore the available sandbox environments for dynamic analysis, study the effectiveness of fingerprinting Android malware using one of the tools and investigate whether static features from AMD and the dynamic analysis correlate. For instance, by an attempt to classify the samples based on similar dynamic features and calculating the Pearson Correlation Coefficient (r) for all combinations of features from AMD and the dynamic analysis. The comparison of tools for dynamic analysis, showed a need of development, as most popular tools has been released for a long time and the common factor is a lack of continuous maintenance. As a result, the choice of sandbox environment for this study ended up as Droidbox, because of aspects like ease of use/install and easily adaptable for large scale analysis. Based on the dynamic features extracted with Droidbox, it could be shown that Android malware are more similar to the varieties which they belong to. The best metric for classifying samples to varieties, out of four investigated metrics, turned out to be Cosine Similarity, which received an accuracy of 83.6% for the entire dataset. The high accuracy indicated a correlation between the dynamic features and static features which the varieties are based on. Furthermore, the Pearson Correlation Coefficient confirmed that the manually extracted features, used to describe the varieties, and the dynamic features are correlated to some extent, which could be partially confirmed by a manual inspection in the end of the study.
1133 |
PDF document search within a very large databaseWang, Lizhong January 2017 (has links)
Digital search engine, taking a search request from user and then returning a result responded to the request to the user, is indispensable for modern humans who are used to surfing the Internet. On the other hand, the digital document PDF is accepted by more and more people and becomes widely used in this day and age due to the convenience and effectiveness. It follows that, the traditional library has already started to be replaced by the digital one. Combining these two factors, a document based search engine that is able to query a digital document database with an input file is urgently needed. This thesis is a software development that aims to design and implement a prototype of such search engine, and propose latent optimization methods for Loredge. This research can be mainly divided into two categories: Prototype Development and Optimization Analysis. It involves an analytical research on sample documents provided by Loredge and a multi-perspective performance analysis. The prototype contains reading, preprocessing and similarity measurement. The reading part reads in a PDF file by using an imported Java library Apache PDFBox. The preprocessing processes the in-reading document and generates document fingerprint. The similarity measurement is the final stage that measures the similarity between the input fingerprint with all the document fingerprints in the database. The optimization analysis is to balance resource consumptions involving response time, accuracy rate and memory consumption. According to the performance analysis, the shorter the document fingerprint is, the better performance the search program presents. Moreover, a permanent feature database and a similarity based filtration mechanism are proposed to further optimize the program. This project has laid a solid foundation for further study in the document based search engine by providing a feasible prototype and enough relevant experimental data. This study figures out that the following study should mainly focuses on improving the effectiveness of the database access, which involves data entry labeling and search algorithm optimization. / Digital sökmotor, som tar en sökfråga från användaren och sedan returnerar ett resultat som svarar på den begäran tillbaka till användaren, är oumbärligt för moderna människor som brukar surfa på Internet. Å andra sidan, det digitala dokumentets format PDF accepteras av fler och fler människor, och det används i stor utsträckning i denna tidsålder på grund av bekvämlighet och effektivitet. Det följer att det traditionella biblioteket redan har börjat bytas ut av det digitala biblioteket. När dessa två faktorer kombineras, framgår det att det brådskande behövs en dokumentbaserad sökmotor, som har förmåga att fråga en digital databas om en viss fil. Den här uppsatsen är en mjukvaruutveckling som syftar till att designa och implementera en prototyp av en sådan sökmotor, och föreslå relevant optimeringsmetod för Loredge. Den här undersökningen kan huvudsakligen delas in i två kategorier, prototyputveckling och optimeringsanalys. Arbeten involverar en analytisk forskning om exempeldokument som kommer från Loredge och en prestandaanalys utifrån flera perspektiv. Prototypen innehåller läsning, förbehandling och likhetsmätning. Läsningsdelen läser in en PDF-fil med hjälp av en importerad Java bibliotek, Apache PDFBox. Förbehandlingsdelen bearbetar det inlästa dokumentet och genererar ett dokumentfingeravtryck. Likhetsmätningen är det sista steget, som mäter likheten mellan det inlästa fingeravtrycket och fingeravtryck av alla dokument i Loredge databas. Målet med optimeringsanalysen är att balansera resursförbrukningen, som involverar responstid, noggrannhet och minnesförbrukning. Ju kortare ett dokuments fingeravtryck är, desto bättre prestanda visar sökprogram enligt resultat av prestandaanalysen. Dessutom föreslås en permanent databas med fingeravtryck, och en likhetsbaserad filtreringsmekanism för att ytterligare optimera sökprogrammet. Det här projektet har lagt en solid grund för vidare studier om dokumentbaserad sökmotorn, genom att tillhandahålla en genomförbar prototyp och tillräckligt relevanta experimentella data. Den här studie visar att kommande forskning bör huvudsakligen inriktas på att förbättra effektivitet i databasåtkomsten, vilken innefattar data märkning och optimering av sökalgoritm.
1134 |
Three unknown Carthusian liturgical manuscripts with music of the 14th to the 16th centuries in the Grey Collection, South African Library, Cape TownSteyn, Frances Caroline 11 1900 (has links)
Of the three manuscripts that form the basis of this thesis, MS Cape Town, South African
Library, Grey 4c7 is, in musicological terms the most important of the three manuscripts. It
is a complete Carthusian Antiphonary, of the late 14th century, written for the Charterhouse
of Champmol, near Dijon, the mausoleum of the Dukes of Burgundy. It also contains an
extensive Tonary, a Hymnary and a Kyriale. The two didactic verses which form part of the
Tonary are of particular importance, since MS 4c7is one of the few manuscripts in the world
intended for musical performance to contain the Ter terni by William of Hirsau; furthermore
it is apparently the only Carthusian manuscript of any kind to contain the Oyapente et
dyatessaron by Hucbald. The manuscript is placed in the context of the Carthusian liturgy
of the 12th to the 16th centuries and is compared with 33 manuscripts of this period. It is
shown that, although a marked textual similarity exists between the manuscripts, there are
variant melodies. The conclusion is therefore drawn that the Carthusians did not have a
single exemplar for the melodies in their liturgical books. It is shown that MS 4c7 and MS
Oijon, Bibliotheque municipale 118, also written for Champmol, were copied from the same
exemplar and that they are closely related to MSS Beaune, Bibliotheque municipale 27, 34
and 41, ot the neighbouring Charterhouse of Fontenay.
The second manuscript, MS Grey 3c23, an Antiphonary for nuns, for Lauds and Vespers,
written for the Charterhouse of Mont-Sainte-Marie, at Gosnay, near Arras, has been dated
1538 by the original scribe. This manuscript is almost identical to MS AGC C II 817. The
presence of a Sequence, foreign to the Carthusian tradition, is however unique toMS 3c23.
The third manuscript, MS Grey 6b3, is an Evangeliary, signed by the scribe, Amelontius de
Ercklems, in 1520. Its provenance is the Charterhouse of Our Lady of the Twelve Apostles
at Mont-Cornillon near Liege. Musicological features of the manuscript which are discussed
are the Hymn 'Te decet laus', and the accent neumes at the ends of pericopes. / Art History, Visual Arts & Musicology / D.Mus. (Musicology)
1135 |
An investigation of the level of selected trace metals in plant species within the vicinity of tantalum mining area in Gatumba, Ngororero District, RwandaGakwerere, François 02 April 2013 (has links)
Due to mining activities, the natural vegetation cover in Gatumba area was removed and replaced either by crops or bare wasteland with reduced available arable land. The main aim of the study was to assess the impact of the mining activities on the plant mineral uptake and the dynamics of the vegetation. The vegetation in this area under investigation was diversified and heterogeneous. Trace element concentrations in soils were similar to those in plant parts but some elements were highly concentrated in soils than in plants. According to the bioaccumulation factors of the analyzed trace elements in plant parts, two categories of plants were identified, and these are excluders and accumulators. No toxic levels of the evaluated trace elements were found in the analyzed plant samples. As a recommendation for the adaptation of plants to Gatumba mining environment, the most useful plant species for the revegetation/restitution of the technosols should be Sesbania sesban, Crotalaria dewildemaniana and Tithonia diversifolia subject to further experiments on trace elements bioaccumulation and organic matter production / Environmental Sciences / M.A. Science (Environmental Sciences)
1136 |
多項分配之分類方法比較與實證研究 / An empirical study of classification on multinomial data高靖翔, Kao, Ching Hsiang Unknown Date (has links)
由於電腦科技的快速發展,網際網路(World Wide Web;簡稱WWW)使得資料共享及搜尋更為便利,其中的網路搜尋引擎(Search Engine)更是尋找資料的利器,最知名的「Google」公司就是藉由搜尋引擎而發跡。網頁搜尋多半依賴各網頁的特徵,像是熵(Entropy)即是最為常用的特徵指標,藉由使用者選取「關鍵字詞」,找出與使用者最相似的網頁,換言之,找出相似指標函數最高的網頁。藉由相似指標函數分類也常見於生物學及生態學,但多半會計算兩個社群間的相似性,再判定兩個社群是否相似,與搜尋引擎只計算單一社群的想法不同。
本文的目標在於研究若資料服從多項分配,特別是似幾何分配的多項分配(許多生態社群都滿足這個假設),單一社群的指標、兩個社群間的相似指標,何者會有較佳的分類正確性。本文考慮的指標包括單一社群的熵及Simpson指標、兩社群間的熵及相似指標(Yue and Clayton, 2005)、支持向量機(Support Vector Machine)、邏輯斯迴歸等方法,透過電腦模擬及交叉驗證(cross-validation)比較方法的優劣。本文發現單一社群熵指標之表現,在本文的模擬研究有不錯的分類結果,甚至普遍優於支持向量機,但單一社群熵指標分類法的結果並不穩定,為該分類方法之主要缺點。 / Since computer science had changed rapidly, the worldwide web made it much easier to share and receive the information. Search engines would be the ones to help us find the target information conveniently. The famous Google was also founded by the search engine. The searching process is always depends on the characteristics of the web pages, for example, entropy is one of the characteristics index. The target web pages could be found by combining the index with the keywords information given by user. Or in other words, it is to find out the web pages which are the most similar to the user’s demands. In biology and ecology, similarity index function is commonly used for classification problems. But in practice, the pairwise instead of single similarity would be obtained to check if two communities are similar or not. It is dislike the thinking of search engines.
This research is to find out which has better classification result between single index and pairwise index for the data which is multinomial distributed, especially distributed like a geometry distribution. This data assumption is often satisfied in ecology area. The following classification methods would be considered into this research: single index including entropy and Simpson index, pairwise index including pairwise entropy and similarity index (Yue and Clayton, 2005), and also support vector machine and logistic regression. Computer simulations and cross validations would also be considered here. In this research, it is found that the single index, entropy, has good classification result than imagine. Sometime using entropy to classify would even better than using support vector machine with raw data. But using entropy to classify is not very robust, it is the one needed to be improved in future.
1137 |
Extracting group relationships within changing software using text analysisGreen, Pamela Dilys January 2013 (has links)
This research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user.
1138 |
Les effets de la similarité physique dans l’observation d’actions : études comportementales et neurophysiologiquesDésy, Marie-Christine 06 1900 (has links)
Il a été suggéré que la similarité physique entre un observateur et une action observée facilite la perception et la compréhension d’action. Par exemple, l’observation d’un acteur exécutant des gestes de la main ayant une signification culturelle est associée à une augmentation de l’excitabilité corticospinale lorsque les deux individus sont de la même ethnicité (Molnar-Szakacs et al., 2007). La perception tactile serait également facilitée lorsqu’un individu regarde un modèle de sa propre race être touché (Serino et al., 2009), tandis que des études en imagerie cérébrale fonctionnelle suggèrent la présence d’activations plus importantes dans le cortex cingulaire lorsqu’un sujet observe une personne de son propre groupe racial ressentir de la douleur (Xu et al., 2009). Certaines études ont lié ces résultats à un mécanisme de résonance motrice, possiblement associé au système des neurones miroirs (SNM), suggérant que la représentation de l’action dans les aires motrices est facilitée par la similarité physique. Toutefois, la grande majorité des stimuli utilisés dans ces études comportent une composante émotionnelle ou culturelle pouvant masquer les effets purement moteurs liant la similarité physique à un mécanisme de résonance motrice. De plus, la sélectivité de l’activation du SNM face à des stimuli biologiques a récemment été remise en question en raison de biais méthodologiques.
La présente thèse présente trois études visant à évaluer l’effet de la similarité physique et des caractéristiques biologiques d’un mouvement sur la résonance motrice à l’aide de mesures comportementales et neurophysiologiques. À cet effet, l’imitation automatique de mouvements de la main, l’excitabilité corticospinale et la désynchronisation du rythme électroencéphalographique mu ont servi de marqueurs de l’activité du SNM. Dans les trois études présentées, la couleur de la peau et l’aspect biologique du stimulus observé ou imité ont été systématiquement manipulés.
Nos données confirment la sélectivité du SNM pour le mouvement biologique en démontrant une réponse imitative plus rapide et une désynchronisation du rythme mu plus prononcée lors de la présentation de stimuli biologiques comparativement à des stimuli non-biologiques répliquant les aspects physiques du mouvement humain. Les deux mêmes mesures montrent une réponse neurophysiologique et comportementale équivalente lorsque l’action est exécutée par un agent de couleur similaire ou dissimilaire au participant. Nous rapportons aussi un effet surprenant de la similarité physique sur l’excitabilité corticospinale, où l’observation d’une action exécutée par un agent de couleur différente est associée à une activation plus grande du cortex moteur primaire droit de participants de sexe féminin.
Prises dans leur ensemble, ces données suggèrent que la similarité physique avec une action observée ne module généralement pas l’activité du SNM au niveau des aires sensorimotrices en l’absence de composantes culturelles et émotionnelles. De plus, les résultats présentés suggèrent que le SNM est sélectif au mouvement biologique plutôt qu’à l’aspect kinématique du mouvement. / It has been suggested that physical similarity with an observed model facilitates action perception and understanding. For example, increased corticospinal excitability is found in participants observing actors of their own ethnicity performing culture-specific hand movements (Molnar-Szakacs et al., 2007). Tactile perception is also said to be increased when individuals watch a model of the same race being touched (Serino et al., 2009). Moreover, imaging data suggest that stronger activations are observed in the cingulate cortex when a subject observes a person of their own race feeling pain (Xu et al., 2009). Some studies have linked these findings with a motor resonance mechanism, possibly associated with the mirror neuron system (MNS), suggesting that action representation in motor areas is facilitated by physical similarity. However, most of the observed stimuli in those studies include emotional or cultural components, which may blur the link between physical similarity and motor resonance per se.
The present thesis is comprised of three studies that aimed at evaluating the effect of physical similarity on motor resonance using stimuli that are purely motor in nature. The effect of physical similarity on motor responses during action observation was assessed with behavioral and electrophysiological measures. To this end, imitation of simple finger movements, as well as corticospinal excitability and mu rhythm desynchronization during passive observation of simple finger movements was evaluated, using stimuli that were similar or dissimilar to the participant in terms of skin color. In line with previous results, observation of biological movement resulted in faster reaction times and greater mu desynchronization compared to non-biological movement. Physical similarity with the imitated or observed hand did not affect imitation speed or mu desynchronization. It did, however, have a surprising effect on corticospinal excitability, where the amplitude of transcranial magnetic stimulation-induced motor evoked potentials was greater in the right hemisphere of female participants observing hand movement executed by hands of a different color.
These data suggest that physical similarity with an observed action in terms of skin color does not modulate MNS activity in sensorimotor cortex when cultural and emotional components are absent. The present results also strengthen the notion that the motor cortex node of the MNS is tuned to the biological nature of an observed action.
1139 |
Performance analysis of management techniques for SONET/SDH telecommunications networksNg, Hwee Ping. 03 1900 (has links)
Approved for public release, distribution is unlimited / The performance of network management tools for SONET/SDH networks subject to the load conditions is studied and discussed in this thesis. Specifically, a SONET network which consists of four CISCO ONS 15454s, managed by a CISCO Transport Manager, is set up in the Advanced Network Laboratory of the Naval Postgraduate School. To simulate a realistic data transfer environment for the analysis, Smartbits Avalanche software is deployed to simulate multiple client-server scenarios in the SONET network. Traffic from the management channel is then captured using a packet sniffer. Queuing analysis on the captured data is performed with particular emphasis on properties of self-similarity. In particular, the Hurst parameter which determines the captured traffic's degree of self-similarity is estimated using the Variance-Index plot technique. Link utilization is also derived from the computation of first-order statistics of the captured traffic distribution. The study shows that less management data was exchanged when the SONET network was fully loaded. In addition, it is recommended that CTM 4.6 be used to manage not more than 1552 NEs for safe operation. The results presented in this thesis will aid network planners to optimize the management of their SONET/SDH networks. / Civilian, Ministry of Defense, Singapore
1140 |
Neue Indexingverfahren für die Ähnlichkeitssuche in metrischen Räumen über großen Datenmengen / New indexing techniques for similarity search in metric spacesGuhlemann, Steffen 06 July 2016 (has links) (PDF)
Ein zunehmend wichtiges Thema in der Informatik ist der Umgang mit Ähnlichkeit in einer großen Anzahl unterschiedlicher Domänen. Derzeit existiert keine universell verwendbare Infrastruktur für die Ähnlichkeitssuche in allgemeinen metrischen Räumen. Ziel der Arbeit ist es, die Grundlage für eine derartige Infrastruktur zu legen, die in klassische Datenbankmanagementsysteme integriert werden könnte.
Im Rahmen einer Analyse des State of the Art wird der M-Baum als am besten geeignete Basisstruktur identifiziert. Dieser wird anschließend zum EM-Baum erweitert, wobei strukturelle Kompatibilität mit dem M-Baum erhalten wird. Die Abfragealgorithmen werden im Hinblick auf eine Minimierung notwendiger Distanzberechnungen optimiert. Aufbauend auf einer mathematischen Analyse der Beziehung zwischen Baumstruktur und Abfrageaufwand werden Freiheitsgrade in Baumänderungsalgorithmen genutzt, um Bäume so zu konstruieren, dass Ähnlichkeitsanfragen mit einer minimalen Anzahl an Anfrageoperationen beantwortet werden können. / A topic of growing importance in computer science is the handling of similarity in multiple heterogenous domains. Currently there is no common infrastructure to support this for the general metric space. The goal of this work is lay the foundation for such an infrastructure, which could be integrated into classical data base management systems.
After some analysis of the state of the art the M-Tree is identified as most suitable base and enhanced in multiple ways to the EM-Tree retaining structural compatibility. The query algorithms are optimized to reduce the number of necessary distance calculations. On the basis of a mathematical analysis of the relation between the tree structure and the query performance degrees of freedom in the tree edit algorithms are used to build trees optimized for answering similarity queries using a minimal number of distance calculations.
Page generated in 0.0704 seconds