Global ETD Search

1	Studies on Annotated Diverse Corpus Construction and Zero Reference Resolution in Japanese / 日本語の多様な文書からなるタグ付きコーパスの構築及びゼロ照応解析に関する研究 Hangyo, Masatsugu 24 March 2014 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第18407号 / 情博第522号 / 新制\|\|情\|\|92(附属図書館) / 31265 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授黒橋禎夫, 教授西田豊明, 教授河原達也 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM Zero reference resoution Annotated corpus Author and reader of documents 007
2	Robust Dependency Parsing of Spontaneous Japanese Spoken Language Ohno, Tomohiro, Matsubara, Shigeki, Kawaguchi, Nobuo, Inagaki, Yasuyoshi 03 1900 (has links) No description available. dependency parsing stochastic parsing Japanese speech linguistic phenomena syntactically annotated corpus
3	eDictor: da plataforma para a nuvem / eDictor: from platform to the cloud Veronesi, Luiz Henrique Lima 04 February 2015 (has links) Neste trabalho, apresentamos uma nova proposta para edição de textos que fazem parte de um corpus eletrônico. Partindo do histórico de desenvolvimento do corpus Tycho Brahe e da ferramenta eDictor, propõe-se a análise de todo o processo de trabalho de criação de um corpus para obter uma forma de organização da informação mais concisa e sem redundâncias, através do uso de um único repositório de informações contendo os dados textuais e morfossintáticos do texto. Esta forma foi atingida através da criação de uma estrutura de dados baseada em unidades mínimas chamadas tokens e blocos de unidades chamados chunks. A relação entre os tokens e os chunks, da forma como considerada neste trabalho, é capaz de guardar a informação de como o texto é estruturado em sua visualização (página, parágrafos, sentenças) e na sua estrutura sintática em árvores. A base de análise é composta por todos os arquivos pertencentes ao catálogo de textos do corpus Tycho Brahe. Através desta análise, foi possível chegar a elementos genéricos que se relacionam, desconstruindo o texto e criando uma relação de pontos de início e fim relativos às palavras (tokens) e não seguindo sua forma linear. A introdução do conceito de orientação a objetos possibilitou a criação de uma relação entre unidades ainda menores que o token, os split tokens que também são tokens, pois herdam as características do elemento mais significativo, o token. O intuito neste trabalho foi buscar uma forma com o menor número possível de atributos buscando diminuir a necessidade de se criar atributos específicos demais ou genéricos de menos. Na busca deste equilíbrio, foi verificada a necessidade de se criar um atributo específico para o chunk sintático, um atributo de nível que indica a distância de um nó da árvore para o nó raiz. Organizada a informação, o acesso a ela se torna mais simples e parte-se para definição da interface do usuário. A tecnologia web disponível permite que elementos sejam posicionados na tela reproduzindo a visualização que ocorre no livro e também permite que haja uma independência entre um e outro elemento. Esta independência é o que permite que a informação trafegue entre o computador do usuário e a central de processamento na nuvem sem que o usuário perceba. O processamento ocorre em background, utilizando tecnologias assíncronas. A semelhança entre as tecnologias html e xml introduziu uma necessidade de adaptação da informação para apresentação ao usuário. A solução apresentada neste trabalho é pensada de forma a atribuir aos tokens informações que indiquem que eles fazem parte de um chunk. Assim, não seriam as palavras que pertencem a uma sentença, mas cada palavra que possuiria um pedaço de informação que a faz pertencente à sentença. Esta forma de se pensar muda a maneira como a informação é exibida. / In this work, we present a new proposal for text edition organized under an electronic corpus. Starting from Tycho Brahe corpus development history and the eDictor tool, we propose to analyze the whole work process of corpus creation in order to obtain a more concise and less redudant way of organizing information by using a single source repository for textual and morphosyntactic data. This single source repository was achieved by the creation of a data structure based on minimal significative units called tokens and grouping units named chunks. The relationship between tokens and chunks, in the way considered on this work, allows storage of information about how the text is organized visually (pages, paragraphs, sentences) and on how they are organized syntactically as represented by syntactic trees. All files referred to the Tycho Brahe corpus catalog were used as base for analysis. That way, it was possible to achieve generic elements that relate to each other in a manner that the text is deconstructed by using relative pointers to each token in the text instead of following the usual linear form. The introduction of oriented-object conception made the creation of relationship among even smaller units possible, they are the split tokens, but split tokens are also tokens, as they inherit characteristics from the most significative element (the token). The aim here was being attributeless avoiding the necessity of too specific or too vague attributes. Looking for that balance, it was verified the necessity of creating a level attribute for syntactic data that indicates the distance of a tree node to its root node. After information is organized, access to it become simpler and then focus is turned to user-interface definition. Available web technology allows the use of elements that may be positioned on the screen reproducing the way the text is viewed within a book and it also allows each element to be indepedent of each other. This independence is what allows information to travel between user computer and central processing unit at the cloud without user perception. Processing occurs in background using asynchronous technology. Resemblance between html and xml introduced a necessity of adaption to present the information to the user. The adopted solution in this work realizes that tokens must contain the information about the chunk to which they belong. So this is not a point of view where words belong to sentences, but that each word have a piece of information that make them belong to the sentence. This subtile change of behavioring changes the way information is displayed. Annotated corpus Arquitetura web Computational linguistics Corpus anotado Corpus eletrônico Corpus linguistics Edição filológica digital Electronic corpus Linguística computacional Linguística de corpus Philological digital edition Web architecture
4	eDictor: da plataforma para a nuvem / eDictor: from platform to the cloud Luiz Henrique Lima Veronesi 04 February 2015 (has links) Neste trabalho, apresentamos uma nova proposta para edição de textos que fazem parte de um corpus eletrônico. Partindo do histórico de desenvolvimento do corpus Tycho Brahe e da ferramenta eDictor, propõe-se a análise de todo o processo de trabalho de criação de um corpus para obter uma forma de organização da informação mais concisa e sem redundâncias, através do uso de um único repositório de informações contendo os dados textuais e morfossintáticos do texto. Esta forma foi atingida através da criação de uma estrutura de dados baseada em unidades mínimas chamadas tokens e blocos de unidades chamados chunks. A relação entre os tokens e os chunks, da forma como considerada neste trabalho, é capaz de guardar a informação de como o texto é estruturado em sua visualização (página, parágrafos, sentenças) e na sua estrutura sintática em árvores. A base de análise é composta por todos os arquivos pertencentes ao catálogo de textos do corpus Tycho Brahe. Através desta análise, foi possível chegar a elementos genéricos que se relacionam, desconstruindo o texto e criando uma relação de pontos de início e fim relativos às palavras (tokens) e não seguindo sua forma linear. A introdução do conceito de orientação a objetos possibilitou a criação de uma relação entre unidades ainda menores que o token, os split tokens que também são tokens, pois herdam as características do elemento mais significativo, o token. O intuito neste trabalho foi buscar uma forma com o menor número possível de atributos buscando diminuir a necessidade de se criar atributos específicos demais ou genéricos de menos. Na busca deste equilíbrio, foi verificada a necessidade de se criar um atributo específico para o chunk sintático, um atributo de nível que indica a distância de um nó da árvore para o nó raiz. Organizada a informação, o acesso a ela se torna mais simples e parte-se para definição da interface do usuário. A tecnologia web disponível permite que elementos sejam posicionados na tela reproduzindo a visualização que ocorre no livro e também permite que haja uma independência entre um e outro elemento. Esta independência é o que permite que a informação trafegue entre o computador do usuário e a central de processamento na nuvem sem que o usuário perceba. O processamento ocorre em background, utilizando tecnologias assíncronas. A semelhança entre as tecnologias html e xml introduziu uma necessidade de adaptação da informação para apresentação ao usuário. A solução apresentada neste trabalho é pensada de forma a atribuir aos tokens informações que indiquem que eles fazem parte de um chunk. Assim, não seriam as palavras que pertencem a uma sentença, mas cada palavra que possuiria um pedaço de informação que a faz pertencente à sentença. Esta forma de se pensar muda a maneira como a informação é exibida. / In this work, we present a new proposal for text edition organized under an electronic corpus. Starting from Tycho Brahe corpus development history and the eDictor tool, we propose to analyze the whole work process of corpus creation in order to obtain a more concise and less redudant way of organizing information by using a single source repository for textual and morphosyntactic data. This single source repository was achieved by the creation of a data structure based on minimal significative units called tokens and grouping units named chunks. The relationship between tokens and chunks, in the way considered on this work, allows storage of information about how the text is organized visually (pages, paragraphs, sentences) and on how they are organized syntactically as represented by syntactic trees. All files referred to the Tycho Brahe corpus catalog were used as base for analysis. That way, it was possible to achieve generic elements that relate to each other in a manner that the text is deconstructed by using relative pointers to each token in the text instead of following the usual linear form. The introduction of oriented-object conception made the creation of relationship among even smaller units possible, they are the split tokens, but split tokens are also tokens, as they inherit characteristics from the most significative element (the token). The aim here was being attributeless avoiding the necessity of too specific or too vague attributes. Looking for that balance, it was verified the necessity of creating a level attribute for syntactic data that indicates the distance of a tree node to its root node. After information is organized, access to it become simpler and then focus is turned to user-interface definition. Available web technology allows the use of elements that may be positioned on the screen reproducing the way the text is viewed within a book and it also allows each element to be indepedent of each other. This independence is what allows information to travel between user computer and central processing unit at the cloud without user perception. Processing occurs in background using asynchronous technology. Resemblance between html and xml introduced a necessity of adaption to present the information to the user. The adopted solution in this work realizes that tokens must contain the information about the chunk to which they belong. So this is not a point of view where words belong to sentences, but that each word have a piece of information that make them belong to the sentence. This subtile change of behavioring changes the way information is displayed. Arquitetura web Corpus anotado Corpus eletrônico Edição filológica digital Linguística computacional Linguística de corpus Annotated corpus Computational linguistics Corpus linguistics Electronic corpus Philological digital edition Web architecture
5	Rozpoznávání emocí v česky psaných textech / Recognition of emotions in Czech texts Červenec, Radek January 2011 (has links) With advances in information and communication technologies over the past few years, the amount of information stored in the form of electronic text documents has been rapidly growing. Since the human abilities to effectively process and analyze large amounts of information are limited, there is an increasing demand for tools enabling to automatically analyze these documents and benefit from their emotional content. These kinds of systems have extensive applications. The purpose of this work is to design and implement a system for identifying expression of emotions in Czech texts. The proposed system is based mainly on machine learning methods and therefore design and creation of a training set is described as well. The training set is eventually utilized to create a model of classifier using the SVM. For the purpose of improving classification results, additional components were integrated into the system, such as lexical database, lemmatizer or derived keyword dictionary. The thesis also presents results of text documents classification into defined emotion classes and evaluates various approaches to categorization.

1

Page generated in 0.057 seconds