Global ETD Search

11	Natural Language Programming for Controlled Object-Oriented English Zhan, Yue 11 July 2022 (has links) Natural language (NL) is a common medium humans use to express ideas and communicate with others, while programming languages (PL) are the ``language'' humans use to communicate with machines. As NL and PL were designed for different purposes, a considerable difference exists in the structure and capabilities. Programming using PL can take novices months to learn. Meanwhile, users are already familiar with NL. Therefore, natural language programming (NLPr) holds excellent potential by giving non-experts the ability to ``program'' with the language they already know and a Low-Code/No-Code development experience. However, many challenges with developing NLPr systems are yet to be addressed, namely how to disambiguate NL semantics, validate inputs and provide helpful feedback, and generate the executable programs based on semantic meanings effectively. This dissertation addresses these issues by proposing a Controlled Object-Oriented Language (COOL) model to disambiguate and analyze the English inputs' semantic meanings and implement a LEGO robot NLPr platform. Two main approaches that connect the current research in general-purpose NLP to NLPr are taken: (1) A domain-specific lexicon and function library serve as the syntax and semantic space. Even though NL can be complex and expressive, functions for the specific robot domain can be fulfilled with libraries built of a finite set of objects and functions. (2) An error-reporting and feedback mechanism detects erroneous sentences, explains possible reasons, and provides debugging and rewriting suggestions. The error-reporting and feedback systems are developed with a hybrid approach that combines rule-based methods such as FSM and dependency-based structural analysis with the data-based multi-label classification (MLC) method. Experiment results and user studies show that, with the proposed model and approaches reducing the ambiguity within the target domain, the NLPr system can process a relatively expressive controlled NL for robot motion control and generate executable codes based on the English input. When the system is confronted with erroneous sentences, it produces error messages, suggestions, and example sentences for users. NL's structural and semantic information can be transformed into the intermediate representations used for program synthesis with the language model and system proposed to resolve the situation where the considerable amount of data needed for a data-based model is unavailable. / Doctor of Philosophy / Natural language (NL) is one of the most common mediums humans use daily to express and explain ideas and communicate with each other. In contrast, programming languages (PL) are the ``language'' humans use to communicate with machines. Because of the difference in the purpose, media, and audience, there is a considerable difference in their structure and capabilities. NL is more expressive and natural and sometimes can be rather complex, while PL is primarily short, straightforward, and not as expressive as NL. The need for programming has increased in recent years. However, the learning curve of programming languages can easily be months or more for novice users to learn. At the same time, all potential users are familiar with at least one NL. As such, natural language programming (NLPr), a technology that enables people to program with NL, holds excellent potential since it gives non-experts the ability to ``program'' with the language they already know and a Low-Code or even No-Code development experience. However, despite recent research into NLPr, many challenges with developing NLPr systems are yet to be addressed, namely how to disambiguate natural language semantics, how to validate inputs and provide helpful feedback with a limited amount of data, and how to effectively generate the executable programs based on the semantic meanings. This dissertation addresses these issues by proposing a Controlled Object-Oriented Language (COOL) model to disambiguate and analyze the English inputs' semantic meanings and implement a LEGO robot NLPr platform. Two main approaches that connect the current research in general-purpose NLP techniques to NLPr are taken: (1) The first is developing a domain-specific lexicon and function library with the designed COOL model to serve as the syntax and semantic space. Even though natural language can be extremely complex and expressive, the functions for the specific robot domain can be fulfilled with libraries built of a finite set of objects and functions. (2) An error-reporting and feedback mechanism detects erroneous sentences, explains possible reasons, and provides debugging and rewriting suggestions. The error-reporting and feedback systems are developed with a hybrid approach that combines rule-based methods such as FSM and dependency-based structural analysis with the data-based multi-label classification (MLC) method. Experiment results and user studies show that, with the proposed language model and approaches reducing the ambiguity within the target domain, the designed NLPr system can process a relatively expressive controlled natural language designed for robot motion control and generate executable codes based on the semantic information extracted. When the NLPr system is confronted with erroneous sentences, it produces detailed error messages and provides suggestions and sample sentences for possible fixes to users. NL's structural and semantic information can be transformed into the intermediate representations used for program synthesis with the simple language model and system proposed to resolve the situation where the considerable amount of data needed for a data-based model is unavailable. Natural language programming Natural language processing Semantic extraction Multi-label classification LEGO Mindstorm EV3
12	Konzeption eines dreistufigen Transfers für die maschinelle Übersetzung natürlicher Sprachen Laube, Annett, Karl, Hans-Ulrich 14 December 2012 (has links) 0 VORWORT Die für die Übersetzung von Programmiersprachen benötigten Analyse- und Synthesealgorithmen können bereits seit geraumer Zeit relativ gut sprachunabhängig formuliert werden. Dies findet seinen Ausdruck unter anderem in einer Vielzahl von Generatoren, die den Übersetzungsproze? ganz oder teilweise automatisieren lassen. Die Syntax der zu verarbeitenden Sprache steht gewöhnlich in Datenform (Graphen, Listen) auf der Basis formaler Beschreibungsmittel (z.B. BNF) zur Verfügung. Im Bereich der Übersetzung natürlicher Sprachen ist die Trennung von Sprache und Verarbeitungsalgorithmen - wenn überhaupt - erst ansatzweise vollzogen. Die Gründe liegen auf der Hand. Natürliche Sprachen sind mächtiger, ihre formale Darstellung schwierig. Soll die Übersetzung auch die mündliche Kommunikation umfassen, d.h. den menschlichen Dolmetscher auf einer internationalen Konferenz oder beim Telefonieren mit einem Partner, der eine andere Sprache spricht, ersetzen, kommen Echtzeitanforderungen dazu, die dazu zwingen werden, hochparallele Ansätze zu verfolgen. Der Prozess der Übersetzung ist auch dann, wenn keine Echtzeiterforderungen vorliegen, außerordentlich komplex. Lösungen werden mit Hilfe des Interlingua- und des Transferansatzes gesucht. Verstärkt werden dabei formale Beschreibungsmittel realtiv gut erforschter Teilgebiete der Informatik eingesetzt (Operationen über dekorierten Bäumen, Baum-zu-Baum-Übersetzungsstrategien), von denen man hofft, daß die Ergebnisse weiter führen werden als spektakuläre Prototypen, die sich jetzt schon am Markt befinden und oft aus heuristischen Ansätzen abgeleitet sind. [...]:0 Vorwort S. 2 1 Einleitung 2. 4 2 Die Komponenten des dreistufigen Transfers S. 5 3 Formalisierung der Komposition S. 8 4 Pre-Transfer-Phase S. 11 5 Formalisierung der Pre-Transfer-Phase S. 13 6 Transfer-Phase S. 18 7 Formalisierung der Transfer-Phase S. 20 8 Post-Transfer-Phase S. 24 9 Transfer-Beispiel S. 25 10 Zusammenfassung S. 29 info:eu-repo/classification/ddc/004 ddc:004
13	Obtenção dos níveis de significância para os testes de Kruskal-Wallis, Friedman e comparações múltiplas não-paramétricas. / Obtaining significance levels for Kruskal-Wallis, Friedman and nonparametric multiple comparisons tests. Pontes, Antonio Carlos Fonseca 29 June 2000 (has links) Uma das principais dificuldades encontradas pelos pesquisadores na utilização da Estatística Experimental Não-Paramétrica é a obtenção de resultados confiáveis. Os testes mais utilizados para os delineamentos com um fator de classificação simples inteiramente casualizados e blocos casualizados são o de Kruskal-Wallis e o de Friedman, respectivamente. As tabelas disponíveis para estes testes são pouco abrangentes, fazendo com que o pesquisador seja obrigado a recorrer a aproximações. Estas aproximações diferem dependendo do autor a ser consultado, podendo levar a resultados contraditórios. Além disso, tais tabelas não consideram empates, mesmo no caso de pequenas amostras. No caso de comparações múltiplas isto é mais evidente ainda, em especial quando ocorrem empates ou ainda, nos delineamentos inteiramente casualizados onde se tem número diferente de repetições entre tratamentos. Nota-se ainda que os softwares mais utilizados em geral recorrem a aproximações para fornecer os níveis de significância, além de não apresentarem resultados para as comparações múltiplas. Assim, o objetivo deste trabalho é apresentar um programa, em linguagem C, que realiza os testes de Kruskal-Wallis, de Friedman e de comparações múltiplas entre todos os tratamentos (bilateral) e entre os tratamentos e o controle (uni e bilateral) considerando todas as configurações sistemáticas de postos ou com 1.000.000 de configurações aleatórias, dependendo do número total de permutações possíveis. Dois níveis de significância são apresentados: o DW ou MaxDif , baseado na comparação com a diferença máxima dentro de cada configuração e o Geral, baseado na comparação com todas as diferenças em cada configuração. Os valores do nível de significância Geral assemelham-se aos fornecidos pela aproximação normal. Os resultados obtidos através da utilização do programa mostram, ainda, que os testes utilizando as permutações aleatórias podem ser bons substitutos nos casos em que o número de permutações sistemáticas é muito grande, já que os níveis de probabilidade são bastante próximos. / One of the most difficulties for the researchers in using Nonparametric Methods is to obtain reliable results. Kruskal-Wallis and Friedman tests are the most used for one-way layout and for randomized blocks, respectively. Tables available for these tests are not too wild, so the research must use approximate values. These approximations are different, depending on the author and the results can be not similar. Furthermore, these tables do not taking account tied observations, even in the case of small sample. For multiple comparisons, this is more evident, specially when tied observations occur or the number of replications is different. Many softwares like SAS, STATISTICA, S-Plus, MINITAB, etc., use approximation in order to get the significance levels and they do not present results for multiple comparisons. Thus, the aim of this work is to present a routine in C language that runs Kruskal-Wallis, Friedman and multiple comparisons among all treatments (bi-tailed) and between treatment and control (uni and bi-tailed), considering all the systematic configurations of the ranks or with more than 1,000,000 random ones, depending on the total of possible permutations. Two levels of significance are presented: DW or MaxDif, based on the comparison of the maximum difference within each configuration and the Geral, based on the comparison of all differences for each configuration. The Geral values of the significance level are very similar for the normal approximation. The obtaining results through this routine show that, the tests using random permutations can be nice substitutes for the case of the number of systematic permutations is too large, once the levels of probability are very near. análise de variância analysis of variance C language estatística não paramétrica inferencia estatística language programming linguagem c linguagem de programação método estatítico nonparametric statistics statistical inference statistical method
14	Visualization of microprocessor execution in computer architecture courses: a case study at Kabul University Hedayati, Mohammad Hadi January 2010 (has links) <p>Computer architecture and assembly language programming microprocessor execution are basic courses taught in every computer science department. Generally, however, students have&nbsp / difficulties in mastering many of the concepts in the courses, particularly students whose first language is not English. In addition to their difficulties in understanding the purpose of given&nbsp / instructions, students struggle to mentally visualize the data movement, control and processing operations. To address this problem, this research proposed a graphical visualization approach&nbsp / and investigated the visual illustrations of such concepts and instruction execution by implementing a graphical visualization simulator as a teaching aid. The graphical simulator developed during the course of this research was applied in a computer architecture course at Kabul University, Afghanistan. Results obtained from student evaluation of the simulator show significant&nbsp / levels of success using the visual simulation teaching aid. The results showed that improved learning was achieved, suggesting that this approach could be useful in other computer science departments in Afghanistan, and elsewhere where similar challenges are experienced.</p>
15	Visualization of microprocessor execution in computer architecture courses: a case study at Kabul University Hedayati, Mohammad Hadi January 2010 (has links) <p>Computer architecture and assembly language programming microprocessor execution are basic courses taught in every computer science department. Generally, however, students have&nbsp / difficulties in mastering many of the concepts in the courses, particularly students whose first language is not English. In addition to their difficulties in understanding the purpose of given&nbsp / instructions, students struggle to mentally visualize the data movement, control and processing operations. To address this problem, this research proposed a graphical visualization approach&nbsp / and investigated the visual illustrations of such concepts and instruction execution by implementing a graphical visualization simulator as a teaching aid. The graphical simulator developed during the course of this research was applied in a computer architecture course at Kabul University, Afghanistan. Results obtained from student evaluation of the simulator show significant&nbsp / levels of success using the visual simulation teaching aid. The results showed that improved learning was achieved, suggesting that this approach could be useful in other computer science departments in Afghanistan, and elsewhere where similar challenges are experienced.</p>
16	Obtenção dos níveis de significância para os testes de Kruskal-Wallis, Friedman e comparações múltiplas não-paramétricas. / Obtaining significance levels for Kruskal-Wallis, Friedman and nonparametric multiple comparisons tests. Antonio Carlos Fonseca Pontes 29 June 2000 (has links) Uma das principais dificuldades encontradas pelos pesquisadores na utilização da Estatística Experimental Não-Paramétrica é a obtenção de resultados confiáveis. Os testes mais utilizados para os delineamentos com um fator de classificação simples inteiramente casualizados e blocos casualizados são o de Kruskal-Wallis e o de Friedman, respectivamente. As tabelas disponíveis para estes testes são pouco abrangentes, fazendo com que o pesquisador seja obrigado a recorrer a aproximações. Estas aproximações diferem dependendo do autor a ser consultado, podendo levar a resultados contraditórios. Além disso, tais tabelas não consideram empates, mesmo no caso de pequenas amostras. No caso de comparações múltiplas isto é mais evidente ainda, em especial quando ocorrem empates ou ainda, nos delineamentos inteiramente casualizados onde se tem número diferente de repetições entre tratamentos. Nota-se ainda que os softwares mais utilizados em geral recorrem a aproximações para fornecer os níveis de significância, além de não apresentarem resultados para as comparações múltiplas. Assim, o objetivo deste trabalho é apresentar um programa, em linguagem C, que realiza os testes de Kruskal-Wallis, de Friedman e de comparações múltiplas entre todos os tratamentos (bilateral) e entre os tratamentos e o controle (uni e bilateral) considerando todas as configurações sistemáticas de postos ou com 1.000.000 de configurações aleatórias, dependendo do número total de permutações possíveis. Dois níveis de significância são apresentados: o DW ou MaxDif , baseado na comparação com a diferença máxima dentro de cada configuração e o Geral, baseado na comparação com todas as diferenças em cada configuração. Os valores do nível de significância Geral assemelham-se aos fornecidos pela aproximação normal. Os resultados obtidos através da utilização do programa mostram, ainda, que os testes utilizando as permutações aleatórias podem ser bons substitutos nos casos em que o número de permutações sistemáticas é muito grande, já que os níveis de probabilidade são bastante próximos. / One of the most difficulties for the researchers in using Nonparametric Methods is to obtain reliable results. Kruskal-Wallis and Friedman tests are the most used for one-way layout and for randomized blocks, respectively. Tables available for these tests are not too wild, so the research must use approximate values. These approximations are different, depending on the author and the results can be not similar. Furthermore, these tables do not taking account tied observations, even in the case of small sample. For multiple comparisons, this is more evident, specially when tied observations occur or the number of replications is different. Many softwares like SAS, STATISTICA, S-Plus, MINITAB, etc., use approximation in order to get the significance levels and they do not present results for multiple comparisons. Thus, the aim of this work is to present a routine in C language that runs Kruskal-Wallis, Friedman and multiple comparisons among all treatments (bi-tailed) and between treatment and control (uni and bi-tailed), considering all the systematic configurations of the ranks or with more than 1,000,000 random ones, depending on the total of possible permutations. Two levels of significance are presented: DW or MaxDif, based on the comparison of the maximum difference within each configuration and the Geral, based on the comparison of all differences for each configuration. The Geral values of the significance level are very similar for the normal approximation. The obtaining results through this routine show that, the tests using random permutations can be nice substitutes for the case of the number of systematic permutations is too large, once the levels of probability are very near. análise de variância estatística não paramétrica inferencia estatística linguagem c linguagem de programação método estatítico analysis of variance C language language programming nonparametric statistics statistical inference statistical method
17	Zur Beziehung von Raum und Inhalt nutzergenerierter geographischer Informationen Hahmann, Stefan 21 July 2014 (has links) (PDF) In the last ten years there has been a significant progress of the World Wide Web, which evolved to become the so-called “Web 2.0”. The most important feature of this new quality of the WWW is the participation of the users in generating contents. This trend facilitates the formation of user communities which collaborate on diverse projects, where they collect and publish information. Prominent examples of such projects are the online-encyclopedia “Wikipedia”, the microblogging-platform “Twitter”, the photo-platform “Flickr” and the database of topographic information “OpenStreetMap”. User-generated content, which is directly or indirectly geospatially referenced, is of-ten termed more specifically as “volunteered geographic information”. The geospatial reference of this information is constituted either directly by coordinates that are given as meta-information or indirectly through georeferencing of toponyms or addresses that are contained in this information. Volunteered geographic information is particularly suited for research, as it can be accessed with low or even at no costs at all. Furthermore it reflects a variety of human decisions which are linked to geographic space. In this thesis, the relationship of space and content of volunteered geographic information is investigated from two different perspectives. The first part of this thesis addresses the question for which share of information there exists a relationship between space and content of the information, such that the information is locatable in geospace. In this context, the assumption that about 80% of all information has a reference to space has been well known within the community of geographic information system users. Since the 1980s it has served as a marketing tool within the whole geoinformation sector, although there has not been any empirical evidence. This thesis contributes to fill this research gap. For the validation of the ‘80%-hypothesis’ two approaches are presented. The first approach is based on a corpus of information that is as representative as possible for world knowledge. For this purpose the German language edition of Wikipedia has been selected. This corpus is modeled as a network of information where the articles are considered the nodes and the cross references are considered the edges of a directed graph. With the help of this network a graduated definition of geospatial references is possible. It is implemented by computing the distance of each article to its closest article within the network that is assigned with spatial coordinates. Parallel to this, a survey-based approach is developed where participants have the task to assign pieces of information to one of the categories “direct geospatial reference”, “indirect geospatial reference” and “no geospatial reference”. A synthesis of both approaches leads to an empirically justified figure for the “80%-assertion”. The result of the investigation is that for the corpus of Wikipedia 27% of the information may be categorized as directly geospatially referenced and 30% of the information may be categorized as indirectly geospatially referenced. In the second part of the thesis the question is investigated in how far volunteered geographic information that is produced on mobile devices is related to the locations where it is published. For this purpose, a collection of microblogging-texts produced on mobile devices serve as research corpus. Microblogging-texts are short texts that are published via the World Wide Web. For this type of information the relationship be-tween the content of the information and their position is less obvious than e.g. for topographic information or photo descriptions. The analysis of microblogging-texts offers new possibilities for market and opinion research, the monitoring of natural events and human activities as well as for decision support in disaster management. The spatial analysis of the texts may add extra value. In fact for some of the applications the spatial analysis is a necessary condition. For this reason, the investigation of the relationship of the published contents with the locations where they are generated is of interest. Within this thesis, methods are described that support the investigation of this relationship. In the presented approach, classified Points of Interest serve as a model for the environment. For the purpose of the investigation of the correlation between these points and the microblogging-texts, manual classification and natural language processing are used in order to classify these texts according to their relevance in regard to the respective feature classes. Subsequently, it is tested whether the share of relevant texts in the proximity of objects of the tested classes is above average. The results of the investigation show that the strength of the location-content-correlation depends on the tested feature class. While for the feature classes ‘train station’, ‘airport’ and ‘restaurant’ a significant dependency of the share of relevant texts on the distance to the respective objects may be observed, this is not confirmed for objects of other feature classes, such as ‘cinema’ and ‘supermarket’. However, as prior research that describes investigations on small cartographic scale has detected correlations between space and content of microblogging-texts, it can be concluded that the strength of the correlation between space and content of microblogging-texts depends on scale and topic. / Während der vergangenen zehn Jahre vollzog sich eine signifikante Veränderung des World Wide Webs, das sich zum sogenannten „Web 2.0“ entwickelte. Das wesentlichste Merkmal dieser neuen Qualität des WWW ist die Beteiligung der Nutzer bei der Erstellung der Inhalte. Diese Entwicklung fördert das Entstehen von Nutzergemeinschaften, die kollaborativ in unterschiedlichsten Projekten Informationen sammeln und veröffentlichen. Prominente Beispiele für solche Projekte sind die Online-Enzyklopädie „Wikipedia“, die Microblogging-Plattform „Twitter“, die Foto-Plattform „Flickr“ und die Sammlung topographischer Informationen „OpenStreetMap“. Nutzergenerierte Inhalte, die direkt oder indirekt raumbezogen sind, können spezifischer als „nutzergenerierte geographische Informationen“ bezeichnet werden. Der Raumbezug dieser Informationen entsteht entweder direkt durch die Angabe räumlicher Koordinaten als Metainformationen oder er kann indirekt durch die Georeferenzierung von in den Informationen enthaltenen Toponymen oder Adressen hergestellt werden. Nutzergenerierte geographische Informationen haben für die Forschung den besonderen Vorteil, dass sie einerseits häufig gänzlich ohne oder nur mit geringen Kosten verfügbar gemacht werden können und andererseits eine Vielzahl von menschlichen Entscheidungen widerspiegeln, die mit dem Raum verknüpft sind. In der vorliegenden Dissertation wird die Beziehung von Raum und Inhalt nutzergenerierter geographischer Informationen aus zwei Perspektiven untersucht. Im ersten Teil der Arbeit steht die Frage im Vordergrund, für welchen Anteil an Informationen eine Beziehung zwischen Raum und Informationsinhalt in der Art besteht, dass die Informationen im Georaum lokalisierbar sind. In diesem Zusammenhang existiert seit den 1980er Jahren die unter Nutzern von geographischen Informationssystemen weit verbreitete These, dass 80% aller Informationen einen Raumbezug haben. Diese These dient im gesamten Spektrum der Branche als Marketinginstrument, ist jedoch nicht empirisch belegt. Diese Arbeit trägt dazu bei, die bestehende Forschungslücke zu schließen. Für die Prüfung dieser These, die in der Arbeit als „Raumbezugshypothese“ bezeichnet wird, werden zwei Ansätze vorgestellt. Der erste Ansatz basiert auf der Analyse eines möglichst repräsentativen Informationskorpus, wofür die deutsche Sprachversion der Wikipedia ausgewählt wird. Diese wird als Informationsnetzwerk modelliert, indem deren Artikel als Knoten und deren interne Querverweise als Kanten eines gerichteten Graphen betrachtet werden. Mit Hilfe dieses Netzwerkes ist es möglich eine abgestufte Definition des Raumbezuges von Informationen einzuführen, indem die Entfernung jedes Artikels innerhalb des Netzwerkes zum jeweils nächstgelegenen Artikel, der mit räumlichen Koordinaten gekennzeichnet ist, berechnet wird. Parallel dazu wird ein Befragungsansatz entwickelt, bei dem Probanden die Aufgabe haben, Informationen in die Kategorien „Direkter Raumbezug“, „Indirekter Raumbezug“ und „Kein Raumbezug“ einzuordnen. Die Synthese beider Ansätze führt zu einer empirisch begründeten Zahl für die „Raumbezugsthese“. Das Ergebnis ist, dass für das Untersuchungskorpus Wikipedia 27% der Informationen als direkt raumbezogenen und 30% der Informationen als indirekt raumbezogen kategorisiert werden können. Im zweiten Teil der Arbeit wird die Forschungsfrage untersucht, inwiefern nutzergenerierte Informationen, die über mobile Geräte erzeugt werden, in Beziehung zu den Orten stehen, an denen sie veröffentlicht werden. Als Forschungskorpus dienen mobil verfasste Microblogging-Texte. Dies sind kurze Texte, die über das WWW veröffentlicht werden. Bei dieser Informationsart liegt im Gegensatz zu beispielsweise topographischen Information oder Fotobeschreibungen die Vermutung eines starken Zusammenhanges zwischen dem Inhalt der Informationen und deren Positionen nicht nahe. Die Analyse von Microblogging-Texten bietet unter anderem Potential für die Markt- und Meinungsforschung, die Beobachtung von Naturereignissen und menschlichen Aktivitäten sowie die Entscheidungsunterstützung in Katastrophenfällen. Aus der räumlichen Auswertung kann sich dabei ein Mehrwert ergeben, für einen Teil der Anwendungen ist die räumliche Auswertung sogar die notwendige Voraussetzung. Aus diesem Grund ist die Erforschung des Zusammenhanges der veröffentlichten Inhalte mit den Orten, an denen diese entstehen, von Interesse. In der Arbeit werden eine Methoden vorgestellt, mit deren Hilfe die Untersuchung dieser Korrelation am Beispiel von klassifizierten Points of Interest durchgeführt wird. Zu diesem Zweck werden die Texte mit Hilfe von manueller Klassifikation und maschineller Sprachverarbeitung entsprechend ihrer Relevanz für die getesteten Objektklassen klassifiziert. Anschließend wird geprüft, ob der Anteil der relevanten Texte in der Nähe von Objekten der getesteten Klassen überdurchschnittlich hoch ist. Die Ergebnisse der Untersuchungen zeigen, dass die Stärke der Raum-Inhalt-Korrelation von den getesteten Objektklassen abhängig ist. Während sich beispielsweise bei Bahnhöfen, Flughäfen und Restaurants eine deutliche Abhängigkeit des Anteils der relevanten Texte von der Entfernung zu den betreffenden Objekten zeigt, kann dies für andere Objektklassen, wie z.B. Kino oder Supermarkt nicht bestätigt werden. Da frühere Forschungsarbeiten bei der Analyse im kleinmaßstäbigen Bereich eine Korrelation der Informationsinhalte mit deren Entstehungsorten feststellten, kann geschlussfolgert werden, dass der Zusammenhang zwischen Raum und Inhalt bei Microblogging-Texten sowohl vom Maßstab als auch vom Thema abhängig ist. Nutzgenerierte Inhalte Wikipedia OpenStreetMap Twitter Netzwerke Raumbezug Geographische Informationssuche Maschinelles Lernen Computerlinguistik Volunteered Geographic Information VGI User Generated Content UGC Geographical information science Wikipedia Twitter OpenStreetMap Networks Geospatial reference Geographic information retrieval machine learning natural language programming ddc:550 rvk:RB 10104
18	Zur Beziehung von Raum und Inhalt nutzergenerierter geographischer Informationen Hahmann, Stefan 12 June 2014 (has links) In the last ten years there has been a significant progress of the World Wide Web, which evolved to become the so-called “Web 2.0”. The most important feature of this new quality of the WWW is the participation of the users in generating contents. This trend facilitates the formation of user communities which collaborate on diverse projects, where they collect and publish information. Prominent examples of such projects are the online-encyclopedia “Wikipedia”, the microblogging-platform “Twitter”, the photo-platform “Flickr” and the database of topographic information “OpenStreetMap”. User-generated content, which is directly or indirectly geospatially referenced, is of-ten termed more specifically as “volunteered geographic information”. The geospatial reference of this information is constituted either directly by coordinates that are given as meta-information or indirectly through georeferencing of toponyms or addresses that are contained in this information. Volunteered geographic information is particularly suited for research, as it can be accessed with low or even at no costs at all. Furthermore it reflects a variety of human decisions which are linked to geographic space. In this thesis, the relationship of space and content of volunteered geographic information is investigated from two different perspectives. The first part of this thesis addresses the question for which share of information there exists a relationship between space and content of the information, such that the information is locatable in geospace. In this context, the assumption that about 80% of all information has a reference to space has been well known within the community of geographic information system users. Since the 1980s it has served as a marketing tool within the whole geoinformation sector, although there has not been any empirical evidence. This thesis contributes to fill this research gap. For the validation of the ‘80%-hypothesis’ two approaches are presented. The first approach is based on a corpus of information that is as representative as possible for world knowledge. For this purpose the German language edition of Wikipedia has been selected. This corpus is modeled as a network of information where the articles are considered the nodes and the cross references are considered the edges of a directed graph. With the help of this network a graduated definition of geospatial references is possible. It is implemented by computing the distance of each article to its closest article within the network that is assigned with spatial coordinates. Parallel to this, a survey-based approach is developed where participants have the task to assign pieces of information to one of the categories “direct geospatial reference”, “indirect geospatial reference” and “no geospatial reference”. A synthesis of both approaches leads to an empirically justified figure for the “80%-assertion”. The result of the investigation is that for the corpus of Wikipedia 27% of the information may be categorized as directly geospatially referenced and 30% of the information may be categorized as indirectly geospatially referenced. In the second part of the thesis the question is investigated in how far volunteered geographic information that is produced on mobile devices is related to the locations where it is published. For this purpose, a collection of microblogging-texts produced on mobile devices serve as research corpus. Microblogging-texts are short texts that are published via the World Wide Web. For this type of information the relationship be-tween the content of the information and their position is less obvious than e.g. for topographic information or photo descriptions. The analysis of microblogging-texts offers new possibilities for market and opinion research, the monitoring of natural events and human activities as well as for decision support in disaster management. The spatial analysis of the texts may add extra value. In fact for some of the applications the spatial analysis is a necessary condition. For this reason, the investigation of the relationship of the published contents with the locations where they are generated is of interest. Within this thesis, methods are described that support the investigation of this relationship. In the presented approach, classified Points of Interest serve as a model for the environment. For the purpose of the investigation of the correlation between these points and the microblogging-texts, manual classification and natural language processing are used in order to classify these texts according to their relevance in regard to the respective feature classes. Subsequently, it is tested whether the share of relevant texts in the proximity of objects of the tested classes is above average. The results of the investigation show that the strength of the location-content-correlation depends on the tested feature class. While for the feature classes ‘train station’, ‘airport’ and ‘restaurant’ a significant dependency of the share of relevant texts on the distance to the respective objects may be observed, this is not confirmed for objects of other feature classes, such as ‘cinema’ and ‘supermarket’. However, as prior research that describes investigations on small cartographic scale has detected correlations between space and content of microblogging-texts, it can be concluded that the strength of the correlation between space and content of microblogging-texts depends on scale and topic.:1 Einleitung 1 1.1 Motivation 1 1.1.1 Bedeutung raumbezogener nutzergenerierter Inhalte für die geographische Informationswissenschaft und die Kartographie 1 1.1.2 Die Raumbezugshypothese 3 1.1.3 Die Korrelation von Ort und Inhalt bei nutzergenerierten Inhalten 4 1.2 Forschungsziele und Forschungsfragen 5 1.2.1 Prüfung der Raumbezugshypothese 5 1.2.2 Untersuchung der Korrelation von Ort und Inhalt von nutzergenerierten Inhalten 6 1.3 Aufbau der Arbeit 7 1.3.1 Die Beziehung zwischen Raum und Inhalt von nutzergenerierten geographischen Informationen 7 1.3.2 Gliederung der Arbeit 7 1.3.3 Verwendete Publikationen 8 2 Forschungsstand 11 2.1 Relevante Begriffe 11 2.1.1 Web 2.0 11 2.1.2 User Generated Content / Nutzergenerierte Inhalte 12 2.1.2.1 Bedeutung und Begriffsherkunft 12 2.1.2.2 Begriffsklärung 12 2.1.2.3 Arten von UGC 13 2.1.2.4 Kritik 14 2.1.2.5 Forschungspotential 14 2.1.3 Raumbezug 14 2.1.3.1 Der Begriff ‚Raumbezug‘ in der Fachliteratur 14 2.1.3.2 Kategorien des Georaumbezuges 16 2.1.4 Georäumlich 16 2.1.5 Geographische Information und Geodaten 17 2.1.5.1 Begriffsklärung 17 2.1.5.2 Points of Interest als Spezialfall 19 2.1.6 Volunteered Geographic Information / Nutzergenerierte geographische Informationen 19 2.1.6.1 Begriffsherkunft und Charakteristika von VGI 19 2.1.6.2 Das Konzept der menschlichen Sensoren 20 2.1.6.3 Kommunikation geographischer Informationen bei VGI 21 2.1.6.4 Der Mehrwert von VGI 21 2.1.6.5 Motive der Beitragenden 22 2.1.6.6 VGI im globalen Kontext 22 2.1.6.7 Erfassung der Informationen: partizipativ vs. opportunistisch 23 2.1.6.8 Formale Definition 23 2.1.6.9 Deutsche Entsprechung des Begriffs 24 2.1.7 Semantik nutzergenerierter geographischer Informationen 25 2.1.7.1 Strukturierte Form 25 2.1.7.2 Unstrukturierte Form 26 2.2 Arten nutzergenerierter geographischer Informationen 26 2.2.1 Topographische Informationen – OpenStreetMap 28 2.2.1.1 Korpusbeschreibung 28 2.2.1.2 Forschungsüberblick 30 2.2.1.3 Raumbezug 32 2.2.2 Enzyklopädische Informationen – Wikipedia 34 2.2.2.1 Korpusbeschreibung 34 2.2.2.2 Forschungsüberblick 35 2.2.2.3 Raumbezug 36 2.2.2.4 Metaeigenschaften von Artikeln der deutschen Wikipedia 37 2.2.3 Microblogging-Texte – Twitter 39 2.2.3.1 Korpusbeschreibung 39 2.2.3.2 Forschungsüberblick 41 2.2.3.3 Raumbezug 42 2.2.4 Bilder und Bildmetainformationen – Flickr, Instagram, Picasa, Panoramio, Geograph 43 2.2.4.1 Korpusbeschreibung 43 2.2.4.2 Forschungsüberblick 45 2.3 Informationen und Netzwerke 46 2.3.1 Beispiele für Netzwerkstrukturen 46 2.3.2 Implikationen vernetzter Informationen für die Raumbezugshypothese 47 2.3.3 Netzwerkeigenschaften der Wikipedia 47 2.4 Geographische Informationen und Kognition 49 2.5 Informationen klassifizieren durch maschinelle Sprachverarbeitung 50 2.5.1 Naive Bayes 51 2.5.2 Maximum Entropy 51 2.5.3 Support Vector Machines 52 3 Methoden und Ergebnisse 53 3.1 Korpusanalytischer Ansatz für die Prüfung der Raumbezugshypothese 53 3.1.1 Netzwerkgrad des Georaumbezuges 53 3.1.2 Datenprozessierung 56 3.1.3 Ergebnisse der NGGR-Berechnung 57 3.1.4 Korrelation zwischen NGGR und den Eigenschaften von Wikipedia-Artikeln 60 3.2 Befragungsansatz für die Prüfung der Raumbezugshypothese 65 3.2.1 Kategorisierungsaufgabe zur Untersuchung des Georaumbezuges 65 3.2.1.1 Material 66 3.2.1.2 Prozedur 66 3.2.1.3 Teilnehmer 67 3.2.2 Hypothesen 68 3.2.3 Daten zur Beteiligung an der Befragung 68 3.2.4 Ergebnisse 70 3.3 Synthese von korpusanalytischem Ansatz und Befragungsansatz für die Prüfung der Raumbezugshypothese 71 3.3.1 Methodik 71 3.3.2 Ergebnisse 72 3.3.3 Einfluss des Faktors Wissen auf die Ergebnisse der Befragung 73 3.3.4 Einfluss des fachlichen Hintergrundes auf die Ergebnisse der Befragung 74 3.3.5 Prädiktion des Anteils raumbezogener Informationen für das gesamte Korpus der deutschen Wikipedia 76 3.4 Klassifikation nutzergenerierter geographischer Informationen hinsichtlich der Korrelation Ort-Inhalt am Beispiel von mobil verfassten Microblogging-Texten 77 3.4.1 Manuelle Textklassifikation 78 3.4.2 Überwachte maschinelle Textklassifikation mit manuell klassifizierten Trainingsdaten 80 3.4.2.1 Vorverarbeitung der Microblogging-Texte 81 3.4.2.2 Evaluation der Ergebnisse der maschinellen Textklassifikation 82 3.4.2.3 Tuning der maschinellen Klassifikation 83 3.4.3 Überwachte maschinelle Textklassifikation mit lexikalischen Trainingsdaten 83 3.4.4 Verwendete Daten 86 3.4.4.1 Aufzeichnung von mobilen Microblogging-Texten mit der Twitter-Streaming-API 86 3.4.4.2 Filterung verwendbarer Microblogging-Texte 87 3.4.4.3 Zeitliche und räumliche Muster der Microblogging-Texte 89 3.4.4.4 Verwendete Points of Interest 91 3.4.5 Ergebnisse 92 3.4.5.1 Manuelle Annotation von Texten 92 3.4.5.2 Überwachte maschinelle Klassifikation von Texten mit manuell klassifizierten Trainingsdaten 95 3.4.5.3 Überwachte maschinelle Klassifikation von Texten mit lexikalischen Trainingsdaten 99 3.5 Bestimmung der Entfernungsabhängigkeit des Anteils von für spezifische Orte relevanten Informationen am Beispiel von mobil verfassten Microblogging-Texten 103 3.5.1 Methodik 103 3.5.2 Ergebnisse 104 4 Diskussion 111 4.1 Methoden zur Prüfung der Raumbezugshypothese am Beispiel des Korpus Wikipedia 111 4.1.1 Wahl des Korpus 111 4.1.2 Abstraktes Konzept und Instanz 112 4.1.3 Korpusanalytischer Ansatz 112 4.1.4 Befragungsansatz 114 4.2 Methoden zur Bestimmung der Korrelation Ort-Inhalt von nutzergenerierten Informationen am Beispiel von mobil erzeugten Microblogging-Texten 115 4.2.1 Manuelle Klassifikation 116 4.2.2 Überwachte maschinelle Klassifikation mit manuell klassifizierten Trainingsdaten 117 4.2.3 Unüberwachte maschinelle Klassifikation mit lexikalischen Trainingsdaten 118 4.2.4 Berechnung der Entfernungsabhängigkeit des Anteils ortsbezogener Texte 119 4.2.5 Points of Interest als Modell für den räumlichen Kontext 120 4.3 Der Begriff ‚Raumbezug‘ im Kontext von nutzergenerierten geographischen Informationen 120 5 Schlussfolgerungen und Forschungsausblick 123 5.1 Beantwortung der Forschungsfragen 123 5.1.1 Zur Überprüfung der Raumbezugshypothese 123 5.1.2 Zur Korrelation von Ort und Inhalt von nutzergenerierten geographischen Informationen 125 5.2 Implikationen der Forschungsergebnisse 128 5.3 Forschungsausblick nutzergenerierte geographische Informationen 130 5.3.1 Qualität von VGI 130 5.3.2 Synthese von VGI mit amtlichen Daten 132 5.3.3 Weitere aktuelle Entwicklungen im Bereich VGI-Forschung 132 6 Literaturverzeichnis 135 7 Anhang 151 Anhang A Dokumentation des „Experiments Geoaumbezug“ 152 Anhang B Ergebnisse der Kategorisierungsaufgabe des „Experiments Georaumbezug“ 157 Anhang C Rückmeldungen der Teilnehmer des „Experiments Georaumbezug“ 163 Anhang D Einfluss der Faktoren fachlicher Hintergrund und Wissen auf die Kategorisierung von Begriffen hinsichtlich ihrer Georäumlichkeit 166 Anhang E Ergebnisse der manuellen Klassifikation der Microblogging-Texte 168 Anhang F Klassifikationsmodelle resultierend aus manuellen und lexikalischen Trainingsdaten 177 Anhang G Forschungsdaten-Anhang 181 / Während der vergangenen zehn Jahre vollzog sich eine signifikante Veränderung des World Wide Webs, das sich zum sogenannten „Web 2.0“ entwickelte. Das wesentlichste Merkmal dieser neuen Qualität des WWW ist die Beteiligung der Nutzer bei der Erstellung der Inhalte. Diese Entwicklung fördert das Entstehen von Nutzergemeinschaften, die kollaborativ in unterschiedlichsten Projekten Informationen sammeln und veröffentlichen. Prominente Beispiele für solche Projekte sind die Online-Enzyklopädie „Wikipedia“, die Microblogging-Plattform „Twitter“, die Foto-Plattform „Flickr“ und die Sammlung topographischer Informationen „OpenStreetMap“. Nutzergenerierte Inhalte, die direkt oder indirekt raumbezogen sind, können spezifischer als „nutzergenerierte geographische Informationen“ bezeichnet werden. Der Raumbezug dieser Informationen entsteht entweder direkt durch die Angabe räumlicher Koordinaten als Metainformationen oder er kann indirekt durch die Georeferenzierung von in den Informationen enthaltenen Toponymen oder Adressen hergestellt werden. Nutzergenerierte geographische Informationen haben für die Forschung den besonderen Vorteil, dass sie einerseits häufig gänzlich ohne oder nur mit geringen Kosten verfügbar gemacht werden können und andererseits eine Vielzahl von menschlichen Entscheidungen widerspiegeln, die mit dem Raum verknüpft sind. In der vorliegenden Dissertation wird die Beziehung von Raum und Inhalt nutzergenerierter geographischer Informationen aus zwei Perspektiven untersucht. Im ersten Teil der Arbeit steht die Frage im Vordergrund, für welchen Anteil an Informationen eine Beziehung zwischen Raum und Informationsinhalt in der Art besteht, dass die Informationen im Georaum lokalisierbar sind. In diesem Zusammenhang existiert seit den 1980er Jahren die unter Nutzern von geographischen Informationssystemen weit verbreitete These, dass 80% aller Informationen einen Raumbezug haben. Diese These dient im gesamten Spektrum der Branche als Marketinginstrument, ist jedoch nicht empirisch belegt. Diese Arbeit trägt dazu bei, die bestehende Forschungslücke zu schließen. Für die Prüfung dieser These, die in der Arbeit als „Raumbezugshypothese“ bezeichnet wird, werden zwei Ansätze vorgestellt. Der erste Ansatz basiert auf der Analyse eines möglichst repräsentativen Informationskorpus, wofür die deutsche Sprachversion der Wikipedia ausgewählt wird. Diese wird als Informationsnetzwerk modelliert, indem deren Artikel als Knoten und deren interne Querverweise als Kanten eines gerichteten Graphen betrachtet werden. Mit Hilfe dieses Netzwerkes ist es möglich eine abgestufte Definition des Raumbezuges von Informationen einzuführen, indem die Entfernung jedes Artikels innerhalb des Netzwerkes zum jeweils nächstgelegenen Artikel, der mit räumlichen Koordinaten gekennzeichnet ist, berechnet wird. Parallel dazu wird ein Befragungsansatz entwickelt, bei dem Probanden die Aufgabe haben, Informationen in die Kategorien „Direkter Raumbezug“, „Indirekter Raumbezug“ und „Kein Raumbezug“ einzuordnen. Die Synthese beider Ansätze führt zu einer empirisch begründeten Zahl für die „Raumbezugsthese“. Das Ergebnis ist, dass für das Untersuchungskorpus Wikipedia 27% der Informationen als direkt raumbezogenen und 30% der Informationen als indirekt raumbezogen kategorisiert werden können. Im zweiten Teil der Arbeit wird die Forschungsfrage untersucht, inwiefern nutzergenerierte Informationen, die über mobile Geräte erzeugt werden, in Beziehung zu den Orten stehen, an denen sie veröffentlicht werden. Als Forschungskorpus dienen mobil verfasste Microblogging-Texte. Dies sind kurze Texte, die über das WWW veröffentlicht werden. Bei dieser Informationsart liegt im Gegensatz zu beispielsweise topographischen Information oder Fotobeschreibungen die Vermutung eines starken Zusammenhanges zwischen dem Inhalt der Informationen und deren Positionen nicht nahe. Die Analyse von Microblogging-Texten bietet unter anderem Potential für die Markt- und Meinungsforschung, die Beobachtung von Naturereignissen und menschlichen Aktivitäten sowie die Entscheidungsunterstützung in Katastrophenfällen. Aus der räumlichen Auswertung kann sich dabei ein Mehrwert ergeben, für einen Teil der Anwendungen ist die räumliche Auswertung sogar die notwendige Voraussetzung. Aus diesem Grund ist die Erforschung des Zusammenhanges der veröffentlichten Inhalte mit den Orten, an denen diese entstehen, von Interesse. In der Arbeit werden eine Methoden vorgestellt, mit deren Hilfe die Untersuchung dieser Korrelation am Beispiel von klassifizierten Points of Interest durchgeführt wird. Zu diesem Zweck werden die Texte mit Hilfe von manueller Klassifikation und maschineller Sprachverarbeitung entsprechend ihrer Relevanz für die getesteten Objektklassen klassifiziert. Anschließend wird geprüft, ob der Anteil der relevanten Texte in der Nähe von Objekten der getesteten Klassen überdurchschnittlich hoch ist. Die Ergebnisse der Untersuchungen zeigen, dass die Stärke der Raum-Inhalt-Korrelation von den getesteten Objektklassen abhängig ist. Während sich beispielsweise bei Bahnhöfen, Flughäfen und Restaurants eine deutliche Abhängigkeit des Anteils der relevanten Texte von der Entfernung zu den betreffenden Objekten zeigt, kann dies für andere Objektklassen, wie z.B. Kino oder Supermarkt nicht bestätigt werden. Da frühere Forschungsarbeiten bei der Analyse im kleinmaßstäbigen Bereich eine Korrelation der Informationsinhalte mit deren Entstehungsorten feststellten, kann geschlussfolgert werden, dass der Zusammenhang zwischen Raum und Inhalt bei Microblogging-Texten sowohl vom Maßstab als auch vom Thema abhängig ist.:1 Einleitung 1 1.1 Motivation 1 1.1.1 Bedeutung raumbezogener nutzergenerierter Inhalte für die geographische Informationswissenschaft und die Kartographie 1 1.1.2 Die Raumbezugshypothese 3 1.1.3 Die Korrelation von Ort und Inhalt bei nutzergenerierten Inhalten 4 1.2 Forschungsziele und Forschungsfragen 5 1.2.1 Prüfung der Raumbezugshypothese 5 1.2.2 Untersuchung der Korrelation von Ort und Inhalt von nutzergenerierten Inhalten 6 1.3 Aufbau der Arbeit 7 1.3.1 Die Beziehung zwischen Raum und Inhalt von nutzergenerierten geographischen Informationen 7 1.3.2 Gliederung der Arbeit 7 1.3.3 Verwendete Publikationen 8 2 Forschungsstand 11 2.1 Relevante Begriffe 11 2.1.1 Web 2.0 11 2.1.2 User Generated Content / Nutzergenerierte Inhalte 12 2.1.2.1 Bedeutung und Begriffsherkunft 12 2.1.2.2 Begriffsklärung 12 2.1.2.3 Arten von UGC 13 2.1.2.4 Kritik 14 2.1.2.5 Forschungspotential 14 2.1.3 Raumbezug 14 2.1.3.1 Der Begriff ‚Raumbezug‘ in der Fachliteratur 14 2.1.3.2 Kategorien des Georaumbezuges 16 2.1.4 Georäumlich 16 2.1.5 Geographische Information und Geodaten 17 2.1.5.1 Begriffsklärung 17 2.1.5.2 Points of Interest als Spezialfall 19 2.1.6 Volunteered Geographic Information / Nutzergenerierte geographische Informationen 19 2.1.6.1 Begriffsherkunft und Charakteristika von VGI 19 2.1.6.2 Das Konzept der menschlichen Sensoren 20 2.1.6.3 Kommunikation geographischer Informationen bei VGI 21 2.1.6.4 Der Mehrwert von VGI 21 2.1.6.5 Motive der Beitragenden 22 2.1.6.6 VGI im globalen Kontext 22 2.1.6.7 Erfassung der Informationen: partizipativ vs. opportunistisch 23 2.1.6.8 Formale Definition 23 2.1.6.9 Deutsche Entsprechung des Begriffs 24 2.1.7 Semantik nutzergenerierter geographischer Informationen 25 2.1.7.1 Strukturierte Form 25 2.1.7.2 Unstrukturierte Form 26 2.2 Arten nutzergenerierter geographischer Informationen 26 2.2.1 Topographische Informationen – OpenStreetMap 28 2.2.1.1 Korpusbeschreibung 28 2.2.1.2 Forschungsüberblick 30 2.2.1.3 Raumbezug 32 2.2.2 Enzyklopädische Informationen – Wikipedia 34 2.2.2.1 Korpusbeschreibung 34 2.2.2.2 Forschungsüberblick 35 2.2.2.3 Raumbezug 36 2.2.2.4 Metaeigenschaften von Artikeln der deutschen Wikipedia 37 2.2.3 Microblogging-Texte – Twitter 39 2.2.3.1 Korpusbeschreibung 39 2.2.3.2 Forschungsüberblick 41 2.2.3.3 Raumbezug 42 2.2.4 Bilder und Bildmetainformationen – Flickr, Instagram, Picasa, Panoramio, Geograph 43 2.2.4.1 Korpusbeschreibung 43 2.2.4.2 Forschungsüberblick 45 2.3 Informationen und Netzwerke 46 2.3.1 Beispiele für Netzwerkstrukturen 46 2.3.2 Implikationen vernetzter Informationen für die Raumbezugshypothese 47 2.3.3 Netzwerkeigenschaften der Wikipedia 47 2.4 Geographische Informationen und Kognition 49 2.5 Informationen klassifizieren durch maschinelle Sprachverarbeitung 50 2.5.1 Naive Bayes 51 2.5.2 Maximum Entropy 51 2.5.3 Support Vector Machines 52 3 Methoden und Ergebnisse 53 3.1 Korpusanalytischer Ansatz für die Prüfung der Raumbezugshypothese 53 3.1.1 Netzwerkgrad des Georaumbezuges 53 3.1.2 Datenprozessierung 56 3.1.3 Ergebnisse der NGGR-Berechnung 57 3.1.4 Korrelation zwischen NGGR und den Eigenschaften von Wikipedia-Artikeln 60 3.2 Befragungsansatz für die Prüfung der Raumbezugshypothese 65 3.2.1 Kategorisierungsaufgabe zur Untersuchung des Georaumbezuges 65 3.2.1.1 Material 66 3.2.1.2 Prozedur 66 3.2.1.3 Teilnehmer 67 3.2.2 Hypothesen 68 3.2.3 Daten zur Beteiligung an der Befragung 68 3.2.4 Ergebnisse 70 3.3 Synthese von korpusanalytischem Ansatz und Befragungsansatz für die Prüfung der Raumbezugshypothese 71 3.3.1 Methodik 71 3.3.2 Ergebnisse 72 3.3.3 Einfluss des Faktors Wissen auf die Ergebnisse der Befragung 73 3.3.4 Einfluss des fachlichen Hintergrundes auf die Ergebnisse der Befragung 74 3.3.5 Prädiktion des Anteils raumbezogener Informationen für das gesamte Korpus der deutschen Wikipedia 76 3.4 Klassifikation nutzergenerierter geographischer Informationen hinsichtlich der Korrelation Ort-Inhalt am Beispiel von mobil verfassten Microblogging-Texten 77 3.4.1 Manuelle Textklassifikation 78 3.4.2 Überwachte maschinelle Textklassifikation mit manuell klassifizierten Trainingsdaten 80 3.4.2.1 Vorverarbeitung der Microblogging-Texte 81 3.4.2.2 Evaluation der Ergebnisse der maschinellen Textklassifikation 82 3.4.2.3 Tuning der maschinellen Klassifikation 83 3.4.3 Überwachte maschinelle Textklassifikation mit lexikalischen Trainingsdaten 83 3.4.4 Verwendete Daten 86 3.4.4.1 Aufzeichnung von mobilen Microblogging-Texten mit der Twitter-Streaming-API 86 3.4.4.2 Filterung verwendbarer Microblogging-Texte 87 3.4.4.3 Zeitliche und räumliche Muster der Microblogging-Texte 89 3.4.4.4 Verwendete Points of Interest 91 3.4.5 Ergebnisse 92 3.4.5.1 Manuelle Annotation von Texten 92 3.4.5.2 Überwachte maschinelle Klassifikation von Texten mit manuell klassifizierten Trainingsdaten 95 3.4.5.3 Überwachte maschinelle Klassifikation von Texten mit lexikalischen Trainingsdaten 99 3.5 Bestimmung der Entfernungsabhängigkeit des Anteils von für spezifische Orte relevanten Informationen am Beispiel von mobil verfassten Microblogging-Texten 103 3.5.1 Methodik 103 3.5.2 Ergebnisse 104 4 Diskussion 111 4.1 Methoden zur Prüfung der Raumbezugshypothese am Beispiel des Korpus Wikipedia 111 4.1.1 Wahl des Korpus 111 4.1.2 Abstraktes Konzept und Instanz 112 4.1.3 Korpusanalytischer Ansatz 112 4.1.4 Befragungsansatz 114 4.2 Methoden zur Bestimmung der Korrelation Ort-Inhalt von nutzergenerierten Informationen am Beispiel von mobil erzeugten Microblogging-Texten 115 4.2.1 Manuelle Klassifikation 116 4.2.2 Überwachte maschinelle Klassifikation mit manuell klassifizierten Trainingsdaten 117 4.2.3 Unüberwachte maschinelle Klassifikation mit lexikalischen Trainingsdaten 118 4.2.4 Berechnung der Entfernungsabhängigkeit des Anteils ortsbezogener Texte 119 4.2.5 Points of Interest als Modell für den räumlichen Kontext 120 4.3 Der Begriff ‚Raumbezug‘ im Kontext von nutzergenerierten geographischen Informationen 120 5 Schlussfolgerungen und Forschungsausblick 123 5.1 Beantwortung der Forschungsfragen 123 5.1.1 Zur Überprüfung der Raumbezugshypothese 123 5.1.2 Zur Korrelation von Ort und Inhalt von nutzergenerierten geographischen Informationen 125 5.2 Implikationen der Forschungsergebnisse 128 5.3 Forschungsausblick nutzergenerierte geographische Informationen 130 5.3.1 Qualität von VGI 130 5.3.2 Synthese von VGI mit amtlichen Daten 132 5.3.3 Weitere aktuelle Entwicklungen im Bereich VGI-Forschung 132 6 Literaturverzeichnis 135 7 Anhang 151 Anhang A Dokumentation des „Experiments Geoaumbezug“ 152 Anhang B Ergebnisse der Kategorisierungsaufgabe des „Experiments Georaumbezug“ 157 Anhang C Rückmeldungen der Teilnehmer des „Experiments Georaumbezug“ 163 Anhang D Einfluss der Faktoren fachlicher Hintergrund und Wissen auf die Kategorisierung von Begriffen hinsichtlich ihrer Georäumlichkeit 166 Anhang E Ergebnisse der manuellen Klassifikation der Microblogging-Texte 168 Anhang F Klassifikationsmodelle resultierend aus manuellen und lexikalischen Trainingsdaten 177 Anhang G Forschungsdaten-Anhang 181 info:eu-repo/classification/ddc/550 ddc:550

Search results