Global ETD Search

1	Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation Dubois, Simon January 2013 (has links) Approximate string matching consists in identifying strings as similar even ifthere is a number of mismatch between them. This technique is one of thesolutions to reduce the exact matching strictness in data comparison. In manycases it is useful to identify stream variation (e.g. audio) or word declension (e.g.prefix, suffix, plural). Approximate string matching can be used to score terms in InformationRetrieval (IR) systems. The benefit is to return results even if query terms doesnot exactly match indexed terms. However, as approximate string matchingalgorithms only consider characters (nor context neither meaning), there is noguarantee that additional matches are relevant matches. This paper presents the effects of some approximate string matchingalgorithms on search results in IR systems. An experimental research design hasbeen conducting to evaluate such effects from two perspectives. First, resultrelevance is analysed with precision and recall. Second, performance is measuredthanks to the execution time required to compute matches. Six approximate string matching algorithms are studied. Levenshtein andDamerau-Levenshtein computes edit distance between two terms. Soundex andMetaphone index terms based on their pronunciation. Jaccard similarity calculatesthe overlap coefficient between two strings. Tests are performed through IR scenarios regarding to different context,information need and search query designed to query on a technicaldocumentation related to software development (man pages from Ubuntu). Apurposive sample is selected to assess document relevance to IR scenarios andcompute IR metrics (precision, recall, F-Measure). Experiments reveal that all tested approximate matching methods increaserecall on average, but, except Metaphone, they also decrease precision. Soundexand Jaccard Similarity are not advised because they fail on too many IR scenarios.Highest recall is obtained by edit distance algorithms that are also the most timeconsuming. Because Levenshtein-Damerau has no significant improvementcompared to Levenshtein but costs much more time, the last one is recommendedfor use with a specialised documentation. Finally some other related recommendations are given to practitioners toimplement IR systems on technical documentation. Algorithm comparison Approximate string matching Information retrieval Offline string matching Overlap coefficient Phonetic indexation String distance String metric String searching algorithm
2	Vermeidung von Interferenzen bei der Konsolidierung von Diensten auf zeitlich geteilten Ressourcen Hähnel, Markus 09 July 2019 (has links) Der steigende Bedarf an Internettraffic, Speicher und Verarbeitung benötigt immer mehr Hardwareressourcen. Zusätzlich überdimensionieren Datenzentrumbetreiber ihre Infrastruktur, um auch bei Bedarfsspitzen hinreichend Leistung zur Verfügung stellen zu können. Das führt zu einer geringen Ressourcenauslastung und damit zu einem erhöhten Energieverbrauch. Durch Konsolidierung der aktiven Dienste auf einer Teilmenge der physischen Server zu Zeiten geringer Auslastung können zum einen nicht benötigte Server ausgeschaltet werden und zum anderen sind die verbleibenden Server besser ausgelastet. Jedoch müssen sich Dienste nach der Konsolidierung die physischen Ressourcen mit anderen Diensten teilen. Durch Wechselwirkungen auf gemeinsam genutzten Ressourcen, sogenannten Interferenzen, verschlechtert sich die Performanz der Dienste. In dieser Arbeit wird auf Interferenzen eingegangen, die aufgrund des zeitlich variierenden Ressourcenverbrauchs von Diensten entstehen. Am Beispiel von der Rechenzeit einzelner Prozessorkerne wird mit Hilfe des Cutting Stock Problems mit nichtdeterministischen Längen (ND-CSP) der Energieverbrauch durch die Zahl der benötigten Ressourcen um bis zu 64,1% gesenkt. Durch Berücksichtigung der zeitlichen Variation des Ressourcenverbrauchs verbessert sich die Performanz um bis zu 59,6% gegenüber anderen Konsolidierungsstrategien. Außerdem wird das Konzept des Überlappungskoeffizienten eingeführt. Dieser beschreibt die probabilistische Beziehung zweier parallel laufender Dienste, inwiefern sie gleichzeitig aktiv sind. Sind Dienste nicht gleichzeitig aktiv, können sie ohne zu erwartende Interferenzen konsolidiert werden. Umgekehrt sollte die Konsolidierung gleichzeitig aktiver Dienste vermieden werden. Die Analyse eines Datenzentrums von Google zeigt, dass beide Szenarien einen signifikanten Anteil darstellen. Zur Berücksichtigung des Überlappungskoeffizienten wird das ND-CSP erweitert und näherungsweise gelöst. Hier zeigt sich jedoch weder eine Verbesserung noch eine Verschlechterung der Performanz der Dienste bei gleichem Energieverbrauch. Perspektivisch, bei der exakten Lösung und weiterer Optimierung, können aber damit Dienste eventuell so allokiert werden, dass ihre Interferenzen reduziert oder im Idealfall sogar weitgehend ausgeschlossen werden können. / An increasing portion of IP traffic is processed and stored in data centers. However, data center providers tend to over-provision their resources. Therefore, underutilized resources unnecessarily waste energy. Consolidating services allows them to be executed within a subset of the entire data center and to turn off the unnecessary, idling machines. Additionally, the remaining machines are properly utilized and, hence, more energy-efficient. Nevertheless, this has to be balanced against degrading the quality of service due to the shared resources of the physical machines after the consolidation. This thesis focuses on the above mentioned interferences due to fluctuating workloads. These are treated in the framework of the Cutting Stock Problem, where items with non-deterministic length are introduced. This reduces the power consumption by minimizing the necessary, active resources by up to 64.1% for the exemplary CPU time of individual cores. Thanks to the awareness of workload fluctuations, it improves the performance of services by up to 59.6% compared to other allocation schemes. Additionally, the concept of 'overlap coefficients' is introduced, which describes the probabilistic relation between two services which run in parallel. The more often these services are active at the same time the higher the overlap coefficient and vice versa. Services which are not active at the same time can be consolidated without any expected interference effects, while these with common activity periods should not be consolidated. The analysis of one of Google's data centers unveils that most of the services can be mapped onto one of the two patterns, while few with undetermined relation remain. The ND-CSP is extended by the 'overlap coefficient' and approximatively solved. In contrast to the former ND-CSP, neither an improvement nor a deterioration of the consolidation results is obtained. In the future, the services can be allocated with reduced or even without interference effects if an exact solution or a multi-objective optimization is applied. info:eu-repo/classification/ddc/004 ddc:004

Search results

Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation

Vermeidung von Interferenzen bei der Konsolidierung von Diensten auf zeitlich geteilten Ressourcen