Global ETD Search

311	Effiziente MapReduce-Parallelisierung von Entity Resolution-Workflows Kolb, Lars 08 December 2014 (has links) In den vergangenen Jahren hat das neu entstandene Paradigma Infrastructure as a Service die IT-Welt massiv verändert. Die Bereitstellung von Recheninfrastruktur durch externe Dienstleister bietet die Möglichkeit, bei Bedarf in kurzer Zeit eine große Menge von Rechenleistung, Speicherplatz und Bandbreite ohne Vorabinvestitionen zu akquirieren. Gleichzeitig steigt sowohl die Menge der frei verfügbaren als auch der in Unternehmen zu verwaltenden Daten dramatisch an. Die Notwendigkeit zur effizienten Verwaltung und Auswertung dieser Datenmengen erforderte eine Weiterentwicklung bestehender IT-Technologien und führte zur Entstehung neuer Forschungsgebiete und einer Vielzahl innovativer Systeme. Ein typisches Merkmal dieser Systeme ist die verteilte Speicherung und Datenverarbeitung in großen Rechnerclustern bestehend aus Standard-Hardware. Besonders das MapReduce-Programmiermodell hat in den vergangenen zehn Jahren zunehmend an Bedeutung gewonnen. Es ermöglicht eine verteilte Verarbeitung großer Datenmengen und abstrahiert von den Details des verteilten Rechnens sowie der Behandlung von Hardwarefehlern. Innerhalb dieser Dissertation steht die Nutzung des MapReduce-Konzeptes zur automatischen Parallelisierung rechenintensiver Entity Resolution-Aufgaben im Mittelpunkt. Entity Resolution ist ein wichtiger Teilbereich der Informationsintegration, dessen Ziel die Entdeckung von Datensätzen einer oder mehrerer Datenquellen ist, die dasselbe Realweltobjekt beschreiben. Im Rahmen der Dissertation werden schrittweise Verfahren präsentiert, welche verschiedene Teilprobleme der MapReduce-basierten Ausführung von Entity Resolution-Workflows lösen. Zur Erkennung von Duplikaten vergleichen Entity Resolution-Verfahren üblicherweise Paare von Datensätzen mithilfe mehrerer Ähnlichkeitsmaße. Die Auswertung des Kartesischen Produktes von n Datensätzen führt dabei zu einer quadratischen Komplexität von O(n²) und ist deswegen nur für kleine bis mittelgroße Datenquellen praktikabel. Für Datenquellen mit mehr als 100.000 Datensätzen entstehen selbst bei verteilter Ausführung Laufzeiten von mehreren Stunden. Deswegen kommen sogenannte Blocking-Techniken zum Einsatz, die zur Reduzierung des Suchraums dienen. Die zugrundeliegende Annahme ist, dass Datensätze, die eine gewisse Mindestähnlichkeit unterschreiten, nicht miteinander verglichen werden müssen. Die Arbeit stellt eine MapReduce-basierte Umsetzung der Auswertung des Kartesischen Produktes sowie einiger bekannter Blocking-Verfahren vor. Nach dem Vergleich der Datensätze erfolgt abschließend eine Klassifikation der verglichenen Kandidaten-Paare in Match beziehungsweise Non-Match. Mit einer steigenden Anzahl verwendeter Attributwerte und Ähnlichkeitsmaße ist eine manuelle Festlegung einer qualitativ hochwertigen Strategie zur Kombination der resultierenden Ähnlichkeitswerte kaum mehr handhabbar. Aus diesem Grund untersucht die Arbeit die Integration maschineller Lernverfahren in MapReduce-basierte Entity Resolution-Workflows. Eine Umsetzung von Blocking-Verfahren mit MapReduce bedingt eine Partitionierung der Menge der zu vergleichenden Paare sowie eine Zuweisung der Partitionen zu verfügbaren Prozessen. Die Zuweisung erfolgt auf Basis eines semantischen Schlüssels, der entsprechend der konkreten Blocking-Strategie aus den Attributwerten der Datensätze abgeleitet ist. Beispielsweise wäre es bei der Deduplizierung von Produktdatensätzen denkbar, lediglich Produkte des gleichen Herstellers miteinander zu vergleichen. Die Bearbeitung aller Datensätze desselben Schlüssels durch einen Prozess führt bei Datenungleichverteilung zu erheblichen Lastbalancierungsproblemen, die durch die inhärente quadratische Komplexität verschärft werden. Dies reduziert in drastischem Maße die Laufzeiteffizienz und Skalierbarkeit der entsprechenden MapReduce-Programme, da ein Großteil der Ressourcen eines Clusters nicht ausgelastet ist, wohingegen wenige Prozesse den Großteil der Arbeit verrichten müssen. Die Bereitstellung verschiedener Verfahren zur gleichmäßigen Ausnutzung der zur Verfügung stehenden Ressourcen stellt einen weiteren Schwerpunkt der Arbeit dar. Blocking-Strategien müssen stets zwischen Effizienz und Datenqualität abwägen. Eine große Reduktion des Suchraums verspricht zwar eine signifikante Beschleunigung, führt jedoch dazu, dass ähnliche Datensätze, z. B. aufgrund fehlerhafter Attributwerte, nicht miteinander verglichen werden. Aus diesem Grunde ist es hilfreich, für jeden Datensatz mehrere von verschiedenen Attributen abgeleitete semantische Schlüssel zu generieren. Dies führt jedoch dazu, dass ähnliche Datensätze unnötigerweise mehrfach bezüglich verschiedener Schlüssel miteinander verglichen werden. Innerhalb der Arbeit werden deswegen Algorithmen zur Vermeidung solch redundanter Ähnlichkeitsberechnungen präsentiert. Als Ergebnis dieser Arbeit wird das Entity Resolution-Framework Dedoop präsentiert, welches von den entwickelten MapReduce-Algorithmen abstrahiert und eine High-Level-Spezifikation komplexer Entity Resolution-Workflows ermöglicht. Dedoop fasst alle in dieser Arbeit vorgestellten Techniken und Optimierungen in einem nutzerfreundlichen System zusammen. Der Prototyp überführt nutzerdefinierte Workflows automatisch in eine Menge von MapReduce-Jobs und verwaltet deren parallele Ausführung in MapReduce-Clustern. Durch die vollständige Integration der Cloud-Dienste Amazon EC2 und Amazon S3 in Dedoop sowie dessen Verfügbarmachung ist es für Endnutzer ohne MapReduce-Kenntnisse möglich, komplexe Entity Resolution-Workflows in privaten oder dynamisch erstellten externen MapReduce-Clustern zu berechnen. info:eu-repo/classification/ddc/500 ddc:500
312	Komparativer Ähnlichkeitsalgorithmus: Algorithmus zur komparativen Bewertung der Ähnlichkeiten von Objekten anhand von kollaborativen Priorisierungen Schwartz, Eva-Maria 15 January 2010 (has links) Die Notwendigkeit zur Nutzung von nicht-individuell entwickelter Software entsteht im Geschäfts- und Arbeitsfeld auf Grund der Entwicklung in diesem Bereich. Unternehmen müssen sich ständig ändernden Anforderungen im Geschäftsumfeld stellen. Mit dem immer stärker werdenden Wettbewerb ist es erforderlich, sich auf eigene Kernkompetenzen zu konzentrieren und zeitliche Kooperation bzw. Beziehungen mit anderen Organisationen einzugehen. Um diesen Beziehungen und Anforderungen gerecht zu werden, müssen Software bzw. Softwarebausteine flexibel und temporär bezogen werden. Um den Nutzern dieser Software eine bestmögliche Unterstützung bei der Auswahl ihrer bedarfsgerechten Komponenten zu geben, sollen Ihnen, anhand von Entscheidungen bereits bestehender Kunden, Vorschläge für Objekte unterbreitet werden. Diese Objekte können je nach System zum Beispiel Konfigurationseigenschaften, Inhaltsmodule oder Layoutdarstellungen sein. Es wird davon ausgegangen, dass ähnliche Nutzer auch ähnliche Objekte benötigen. Aus diesem Grund sollen die Nutzer miteinander verglichen werden. Das Problem liegt an dieser Stelle in der Beschreibung eines Nutzers. Dieser kann durch eine Vielzahl von Merkmalen gekennzeichnet werden, welche je nach Objekt eine unterschiedliche Wichtigkeit bei der Entscheidung haben. Aus diesem Grund müssen die einzelnen Merkmale unabhängig von einander betrachtet werden. Bei der Bewertung eines Objektes sollen dann entsprechende Wichtungen für das jeweilige Merkmal integriert werden. Der Vergleich ist erst dadurch möglich, dass der Kontext und damit die Aufgabe des Nutzers bekannt sind. Nur mit diesen Informationen können gezielte Empfehlungen erstellt werden. Es wird ein Verfahren vorgestellt, welches die priorisierte Bewertung einzelner Merkmale einbezieht. Ausgehend von diesem Verfahren wird ein Algorithmus vorgestellt, welcher Nutzer anhand ihrer Merkmale vergleicht und daraus folgend Empfehlungen für Objekte ausgibt. Der Algorithmus soll in ein Recommender-System integriert werden. info:eu-repo/classification/ddc/004 ddc:004
313	Framställning av en GIS-metod samt analys av ingående parametrar för att lokalisera representativa delområden av ett avrinningsområde för snödjupsmätningar / Development of a GIS method and analysis of input parameters to locate representative sub-areas of a catchment area for snow depth measurements Kaplin, Jennifer, Leierdahl, Lisa January 2022 (has links) Vattenkraft är en stor källa till energi i Sverige, främst i de norra delarna av landet. För att få ut maximal potential från vattenkraftverken behövs information om hur mycket vatten eller snö det finns uppströms från kraftverken. Genom att få fram tillförlitliga värden av snömängd är det möjligt att minska osäkerheten i uppskattningarna.Eftersom det är svårt att kartera större avrinningsområden via markbundna observationer, både praktiskt och ekonomiskt, har drönarobservationer utvecklats. För att använda sig av drönare krävs det vetskap om var de ska flygas i för område för att hela avrinningsområdet ska representeras. I projektet tas en modell fram i ArcGIS för att hitta mindre områden inom avrinningsområden som ska vara representativa inom utvalda parametrar. I projektet berörs parametrarna vegetation, höjd, lutningsgrad samt dess riktning.Arbetet för att ta fram en modell som ska underlätta framtida arbete inom och utanför forskningsprojektet DRONES är uppdelat i två delar. Den första delen är att ta fram och granska vilka parametrar som påverkar snödjupet i avrinningsområdet. Den andra delen innefattar arbetet med att skapa en modell i ArcGIS som ska analysera ett avrinningsområde med framtagna parametrar för att hitta mindre områden som representerar det hela.Resultatet från de framtagna modellerna kan tillämpas för att underlätta kartläggningen och snödjupsmätningar i avrinningsområden, vilket kan utnyttjas vid effektivisering av vattenreglering. / Hydropower is a major source of energy in Sweden mainly in the northern parts of the country. To get the maximum potential from the hydropower plants, information is required on how much water or snow there is upstream from the power plants. By obtaining reliable values of the amount of snow, it is possible to reduce the uncertainty in forecasts on spring flood.Due to difficulties in mapping larger catchment areas via ground-level observations, drone observations have been developed. In order to use drone observations, knowledge of where they are to be flown to represent the entire catchment area is required. In this project, a model was developed in ArcGIS to find smaller areas within catchments that are to be representative within selected parameters. The project touches upon the parameters vegetation, height, slope and aspect.The work to develop a model that will facilitate future work within and outside the DRONES research project is divided into two parts. The first part is to analyze which parameters affect the snow depth in the catchment area. The second part consists of creating a model in ArcGIS that will find a smaller area inside a catchment that represents the snow depth for the whole catchment.The results from the developed model can be applied to facilitate the mapping and snow depth measurements in catchment areas, which can be used to streamline water regulation. Snow accumulation snow depth water regulation snow hydrology DRONES Överuman ModelBuilder Similarity Search Snöackumulation snödjup vattenreglering snöhydrologi DRONES Överuman ModelBuilder Similarity Search Physical Geography Naturgeografi
314	Efficient Graph Summarization of Large Networks Hajiabadi, Mahdi 24 June 2022 (has links) In this thesis, we study the notion of graph summarization, which is a fundamental task of finding a compact representation of the original graph called the summary. Graph summarization can be used for reducing the footprint of the input graph, better visualization, anonymizing the identity of users, and query answering. There are two different frameworks of graph summarization we consider in this thesis, the utility-based framework and the correction set-based framework. In the utility-based framework, the input graph is summarized until a utility threshold is not violated. In the correction set-based framework a set of correction edges is produced along with the summary graph. In this thesis we propose two algorithms for the utility-based framework and one for the correction set-based framework. All these three algorithms are for static graphs (i.e. graphs that do not change over time). Then, we propose two more utility-based algorithms for fully dynamic graphs (i.e. graphs with edge insertions and deletions). Algorithms for graph summarization can be lossless (summarizing the input graph without losing any information) or lossy (losing some information about the input graph in order to summarize it more). Some of our algorithms are lossless and some lossy, but with controlled utility loss. Our first utility-driven graph summarization algorithm, G-SCIS, is based on a clique and independent set decomposition, that produces optimal compression with zero loss of utility. The compression provided is significantly better than state-of-the-art in lossless graph summarization, while the runtime is two orders of magnitude lower. Our second algorithm is T-BUDS, a highly scalable, utility-driven algorithm for fully controlled lossy summarization. It achieves high scalability by combining memory reduction using Maximum Spanning Tree with a novel binary search procedure. T-BUDS outperforms state-of-the-art drastically in terms of the quality of summarization and is about two orders of magnitude better in terms of speed. In contrast to the competition, we are able to handle web-scale graphs in a single machine without performance impediment as the utility threshold (and size of summary) decreases. Also, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank and shortest paths. We then propose algorithm LDME, a correction set-based graph summarization algorithm that produces compact output representations in a fast and scalable manner. To achieve this, we introduce (1) weighted locality sensitive hashing to drastically reduce the number of comparisons required to find good node merges, (2) an efficient way to compute the best quality merges that produces more compact outputs, and (3) a new sort-based encoding algorithm that is faster and more robust. More interestingly, our algorithm provides performance tuning settings to allow the option of trading compression for running time. On high compression settings, LDME achieves compression equal to or better than the state of the art with up to 53x speedup in running time. On high speed settings, LDME achieves up to two orders of magnitude speedup with only slightly lower compression. We also present two lossless summarization algorithms, Optimal and Scalable, for summarizing fully dynamic graphs. More concretely, we follow the framework of G-SCIS, which produces summaries that can be used as-is in several graph analytics tasks. Different from G-SCIS, which is a batch algorithm, Optimal and Scalable are fully dynamic and can respond rapidly to each change in the graph. Not only are Optimal and Scalable able to outperform G-SCIS and other batch algorithms by several orders of magnitude, but they also significantly outperform MoSSo, the state-of-the-art in lossless dynamic graph summarization. While Optimal produces always the most optimal summary, Scalable is able to trade the amount of node reduction for extra scalability. For reasonable values of the parameter $K$, Scalable is able to outperform Optimal by an order of magnitude in speed, while keeping the rate of node reduction close to that of Optimal. An interesting fact that we observed experimentally is that even if we were to run a batch algorithm, such as G-SCIS, once for every big batch of changes, still they would be much slower than Scalable. For instance, if 1 million changes occur in a graph, Scalable is two orders of magnitude faster than running G-SCIS just once at the end of the 1 million-edge sequence. / Graduate Graph Summarization Query Answering Lossless summary Lossy summary Locality Sensitive Hashing Jaccard Similarity Weighted Jaccard Similarity Hashing Incremental Algorithms Randomized Algorithms
315	Similarity measures for scientific workflows Starlinger, Johannes 08 January 2016 (has links) In Laufe der letzten zehn Jahre haben Scientific Workflows als Werkzeug zur Erstellung von reproduzierbaren, datenverarbeitenden in-silico Experimenten an Aufmerksamkeit gewonnen, in die sowohl lokale Skripte und Anwendungen, als auch Web-Services eingebunden werden können. Über spezialisierte Online-Bibliotheken, sogenannte Repositories, können solche Workflows veröffentlicht und wiederverwendet werden. Mit zunehmender Größe dieser Repositories werden Ähnlichkeitsmaße für Scientific Workflows notwendig, etwa für Duplikaterkennung, Ähnlichkeitssuche oder Clustering von funktional ähnlichen Workflows. Die vorliegende Arbeit untersucht solche Ähnlichkeitsmaße für Scientific Workflows. Als erstes untersuchen wir ähnlichkeitsrelevante Eigenschaften von Scientific Workflows und identifizieren Charakteristika der Wiederverwendung ihrer Komponenten. Als zweites analysieren und reimplementieren wir existierende Lösungen für den Vergleich von Scientific Workflows entlang definierter Teilschritte des Vergleichsprozesses. Wir erstellen einen großen Gold-Standard Corpus von Workflowähnlichkeiten, der über 2400 Bewertungen für 485 Workflowpaare enthält, die von 15 Experten aus 6 Institutionen beigetragen wurden. Zum ersten Mal erlauben diese Vorarbeiten eine umfassende, vergleichende Evaluation verschiedener Ähnlichkeitsmaße für Scientific Workflows, in der wir einige vorige Ergebnisse bestätigen, andere aber revidieren. Als drittes stellen wir ein neue Methode für das Vergleichen von Scientific Workflows vor. Unsere Evaluation zeigt, dass diese neue Methode bessere und konsistentere Ergebnisse liefert und leicht mit anderen Ansätzen kombiniert werden kann, um eine weitere Qualitätssteigerung zu erreichen. Als viertes zweigen wir, wie die Resultate aus den vorangegangenen Schritten genutzt werden können, um aus Standardkomponenten eine Suchmaschine für schnelle, qualitativ hochwertige Ähnlichkeitssuche im Repositorymaßstab zu implementieren. / Over the last decade, scientific workflows have gained attention as a valuable tool to create reproducible in-silico experiments. Specialized online repositories have emerged which allow such workflows to be shared and reused by the scientific community. With increasing size of these repositories, methods to compare scientific workflows regarding their functional similarity become a necessity. To allow duplicate detection, similarity search, or clustering, similarity measures for scientific workflows are an essential prerequisite. This thesis investigates similarity measures for scientific workflows. We carry out four consecutive research tasks: First, we closely investigate the relevant properties of scientific workflows regarding their similarity and identify characteristics of re-use of their components. Second, we review and dissect existing approaches to scientific workflow comparison into a defined set of subtasks necessary in the process of workflow comparison, and re-implement previous approaches to each subtask. We create a large gold-standard corpus of expert-ratings on workflow similarity, with more than 2400 ratings provided for 485 pairs of workflows by 15 workflow experts from 6 institutions. For the first time, this allows comprehensive, comparative evaluation of different scientific workflow similarity measures, confirming some previous findings, but rejecting others. Third, we propose and evaluate a novel method for scientific workflow comparison. We show that this novel method provides results of both higher quality and higher consistency than previous approaches, and can easily be stacked and ensembled with other approaches for still better performance and higher speed. Fourth, we show how our findings can be leveraged to implement a search engine using off-the-shelf tools that performs fast, high quality similarity search for scientific workflows at repository-scale, a premier area of application for similarity measures for scientific workflows. Wissensmanagement Information Retrieval Ähnlichkeitssuche Ähnlichkeitsmaße Scientific Workflows Information Retrieval Knowledge Management Similarity Measures Scientific Workflows Similarity Serach 004 Informatik 28 Informatik, Datenverarbeitung ST 530 ddc:004
316	CLustering of Web Services Based on Semantic Similarity Konduri, Aparna 12 May 2008 (has links) No description available. Computer Science SIMILARITY OF WEB SERVICES Stemming Word sense disambiguation WORDNET BASED SEMANTIC SIMILARITY CLUSTERING OF WEB SERVICES Prediction of similar web services LERS-M algorithm
317	Deriving pilots’ knowledge structures for weather information: an evaluation of elicitation techniques Raddatz, Kimberly R. January 1900 (has links) Doctor of Philosophy / Department of Psychology / Richard J. Harris / Systems that support or require human interaction are generally easier to learn, use, and remember when their organization is consistent with the user’s knowledge and experiences (Norman, 1983; Roske-Hofstrand & Paap, 1986). Thus, in order for interface designers to truly design for the user, they must first have a way of deriving a representation of what the user knows about the domain of interest. The current study evaluated three techniques for eliciting knowledge structures for how General Aviation pilots think about weather information. Weather was chosen because of its varying implications for pilots of different levels of experience. Two elicitation techniques (Relationship Judgment and Card Sort) asked pilots to explicitly consider the relationship between 15 weather-related information concepts. The third technique, Prime Recognition Task, used response times and priming to implicitly reflect the strength of relationship between concepts in semantic memory. Techniques were evaluated in terms of pilot performance, conceptual structure validity, and required resources for employment. Validity was assessed in terms of the extent to which each technique identified differences in organization of weather information among pilots of different experience levels. Multidimensional scaling was used to transform proximity data collected by each technique into conceptual structures representing the relationship between concepts. Results indicated that Card Sort was the technique that most consistently tapped into knowledge structure affected by experience. Only conceptual structures based on Card Sort data were able to be used to both discriminate between pilots of different experience levels and accurately classify experienced pilots as “experienced”. Additionally, Card Sort was the most efficient and effective technique to employ in terms of preparation time, time on task, flexibility, and face validity. The Card Sort provided opportunities for deliberation, revision, and visual feedback that allowed the pilots to engage in a deeper level of processing at which experience may play a stronger role. Relationship Judgment and Prime Recognition Task characteristics (e.g., time pressure, independent judgments) may have motivated pilots to rely on a more shallow or text-based level of processing (i.e., general semantic meaning) that is less affected by experience. Implications for menu structure design and assessment are discussed. Knowledge Elicitation Techniques Card Sort Similarity Ratings General Aviation Weather Knowledge Structures Psychology (0621)
318	Evaluation and development of conceptual document similarity metrics with content-based recommender applications Gouws, Stephan 12 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The World Wide Web brought with it an unprecedented level of information overload. Computers are very effective at processing and clustering numerical and binary data, however, the automated conceptual clustering of natural-language data is considerably harder to automate. Most past techniques rely on simple keyword-matching techniques or probabilistic methods to measure semantic relatedness. However, these approaches do not always accurately capture conceptual relatedness as measured by humans. In this thesis we propose and evaluate the use of novel Spreading Activation (SA) techniques for computing semantic relatedness, by modelling the article hyperlink structure of Wikipedia as an associative network structure for knowledge representation. The SA technique is adapted and several problems are addressed for it to function over the Wikipedia hyperlink structure. Inter-concept and inter-document similarity metrics are developed which make use of SA to compute the conceptual similarity between two concepts and between two natural-language documents. We evaluate these approaches over two document similarity datasets and achieve results which compare favourably with the state of the art. Furthermore, document preprocessing techniques are evaluated in terms of the performance gain these techniques can have on the well-known cosine document similarity metric and the Normalised Compression Distance (NCD) metric. Results indicate that a near two-fold increase in accuracy can be achieved for NCD by applying simple preprocessing techniques. Nonetheless, the cosine similarity metric still significantly outperforms NCD. Finally, we show that using our Wikipedia-based method to augment the cosine vector space model provides superior results to either in isolation. Combining the two methods leads to an increased correlation of Pearson p = 0:72 over the Lee (2005) document similarity dataset, which matches the reported result for the state-of-the-art Explicit Semantic Analysis (ESA) technique, while requiring less than 10% of the Wikipedia database as required by ESA. As a use case for document similarity techniques, a purely content-based news-article recommender system is designed and implemented for a large online media company. This system is used to gather additional human-generated relevance ratings which we use to evaluate the performance of three state-of-the-art document similarity metrics for providing content-based document recommendations. / AFRIKAANSE OPSOMMING: Die Wêreldwye-Web het ’n vlak van inligting-oorbelading tot gevolg gehad soos nog nooit tevore. Rekenaars is baie effektief met die verwerking en groepering van numeriese en binêre data, maar die konsepsuele groepering van natuurlike-taal data is aansienlik moeiliker om te outomatiseer. Tradisioneel berus sulke algoritmes op eenvoudige sleutelwoordherkenningstegnieke of waarskynlikheidsmetodes om semantiese verwantskappe te bereken, maar hierdie benaderings modelleer nie konsepsuele verwantskappe, soos gemeet deur die mens, baie akkuraat nie. In hierdie tesis stel ons die gebruik van ’n nuwe aktiverings-verspreidingstrategie (AV) voor waarmee inter-konsep verwantskappe bereken kan word, deur die artikel skakelstruktuur van Wikipedia te modelleer as ’n assosiatiewe netwerk. Die AV tegniek word aangepas om te funksioneer oor die Wikipedia skakelstruktuur, en verskeie probleme wat hiermee gepaard gaan word aangespreek. Inter-konsep en inter-dokument verwantskapsmaatstawwe word ontwikkel wat gebruik maak van AV om die konsepsuele verwantskap tussen twee konsepte en twee natuurlike-taal dokumente te bereken. Ons evalueer hierdie benadering oor twee dokument-verwantskap datastelle en die resultate vergelyk goed met die van ander toonaangewende metodes. Verder word teks-voorverwerkingstegnieke ondersoek in terme van die moontlike verbetering wat dit tot gevolg kan hê op die werksverrigting van die bekende kosinus vektorruimtemaatstaf en die genormaliseerde kompressie-afstandmaatstaf (GKA). Resultate dui daarop dat GKA se akkuraatheid byna verdubbel kan word deur gebruik te maak van eenvoudige voorverwerkingstegnieke, maar dat die kosinus vektorruimtemaatstaf steeds aansienlike beter resultate lewer. Laastens wys ons dat die Wikipedia-gebasseerde metode gebruik kan word om die vektorruimtemaatstaf aan te vul tot ’n gekombineerde maatstaf wat beter resultate lewer as enige van die twee metodes afsonderlik. Deur die twee metodes te kombineer lei tot ’n verhoogde korrelasie van Pearson p = 0:72 oor die Lee dokument-verwantskap datastel. Dit is gelyk aan die gerapporteerde resultaat vir Explicit Semantic Analysis (ESA), die huidige beste Wikipedia-gebasseerde tegniek. Ons benadering benodig egter minder as 10% van die Wikipedia databasis wat benodig word vir ESA. As ’n toetstoepassing vir dokument-verwantskaptegnieke ontwerp en implementeer ons ’n stelsel vir ’n aanlyn media-maatskappy wat nuusartikels aanbeveel vir gebruikers, slegs op grond van die artikels se inhoud. Joernaliste wat die stelsel gebruik ken ’n punt toe aan elke aanbeveling en ons gebruik hierdie data om die akkuraatheid van drie toonaangewende maatstawwe vir dokument-verwantskap te evalueer in die konteks van inhoud-gebasseerde nuus-artikel aanbevelings. Document similarity Wikipedia Spreading activation Information retrieval Dissertations -- Electronic engineering Theses -- Electronic engineering
319	Computational approaches for time series analysis and prediction : data-driven methods for pseudo-periodical sequences Lan, Yang January 2009 (has links) Time series data mining is one branch of data mining. Time series analysis and prediction have always played an important role in human activities and natural sciences. A Pseudo-Periodical time series has a complex structure, with fluctuations and frequencies of the times series changing over time. Currently, Pseudo-Periodicity of time series brings new properties and challenges to time series analysis and prediction. This thesis proposes two original computational approaches for time series analysis and prediction: Moving Average of nth-order Difference (MANoD) and Series Features Extraction (SFE). Based on data-driven methods, the two original approaches open new insights in time series analysis and prediction contributing with new feature detection techniques. The proposed algorithms can reveal hidden patterns based on the characteristics of time series, and they can be applied for predicting forthcoming events. This thesis also presents the evaluation results of proposed algorithms on various pseudo-periodical time series, and compares the predicting results with classical time series prediction methods. The results of the original approaches applied to real world and synthetic time series are very good and show that the contributions open promising research directions. 005.3
320	Large scale optimization methods for metric and kernel learning Jain, Prateek 06 November 2014 (has links) A large number of machine learning algorithms are critically dependent on the underlying distance/metric/similarity function. Learning an appropriate distance function is therefore crucial to the success of many methods. The class of distance functions that can be learned accurately is characterized by the amount and type of supervision available to the particular application. In this thesis, we explore a variety of such distance learning problems using different amounts/types of supervision and provide efficient and scalable algorithms to learn appropriate distance functions for each of these problems. First, we propose a generic regularized framework for Mahalanobis metric learning and prove that for a wide variety of regularization functions, metric learning can be used for efficiently learning a kernel function incorporating the available side-information. Furthermore, we provide a method for fast nearest neighbor search using the learned distance/kernel function. We show that a variety of existing metric learning methods are special cases of our general framework. Hence, our framework also provides a kernelization scheme and fast similarity search scheme for such methods. Second, we consider a variation of our standard metric learning framework where the side-information is incremental, streaming and cannot be stored. For this problem, we provide an efficient online metric learning algorithm that compares favorably to existing methods both theoretically and empirically. Next, we consider a contrasting scenario where the amount of supervision being provided is extremely small compared to the number of training points. For this problem, we consider two different modeling assumptions: 1) data lies on a low-dimensional linear subspace, 2) data lies on a low-dimensional non-linear manifold. The first assumption, in particular, leads to the problem of matrix rank minimization over polyhedral sets, which is a problem of immense interest in numerous fields including optimization, machine learning, computer vision, and control theory. We propose a novel online learning based optimization method for the rank minimization problem and provide provable approximation guarantees for it. The second assumption leads to our geometry-aware metric/kernel learning formulation, where we jointly model the metric/kernel over the data along with the underlying manifold. We provide an efficient alternating minimization algorithm for this problem and demonstrate its wide applicability and effectiveness by applying it to various machine learning tasks such as semi-supervised classification, colored dimensionality reduction, manifold alignment etc. Finally, we consider the task of learning distance functions under no supervision, which we cast as a problem of learning disparate clusterings of the data. To this end, we propose a discriminative approach and a generative model based approach and we provide efficient algorithms with convergence guarantees for both the approaches. / text Rank minimization Metric learning Kernel learning Fast similarity search Locality sensitive hashing

Search results