Spelling suggestions: "subject:"batural language processing"" "subject:"featural language processing""
771 |
Concept Based Knowledge Discovery from Biomedical Literature.Radovanovic, Aleksandar. January 2009 (has links)
<p>This thesis describes and introduces novel methods for knowledge discovery and presents a software system that is able to extract information from biomedical literature, review interesting connections between various biomedical concepts and in so doing, generates new hypotheses. The experimental results obtained by using methods described in this thesis, are compared to currently published results obtained by other methods and a number of case studies are described. This thesis shows how the technology  / resented can be integrated with the researchers&rsquo / own knowledge, experimentation and observations for optimal progression of scientific research.</p>
|
772 |
Konzeption eines dreistufigen Transfers für die maschinelle Übersetzung natürlicher SprachenLaube, Annett, Karl, Hans-Ulrich 14 December 2012 (has links) (PDF)
0 VORWORT
Die für die Übersetzung von Programmiersprachen benötigten Analyse- und Synthesealgorithmen können bereits seit geraumer Zeit relativ gut sprachunabhängig formuliert werden. Dies findet seinen Ausdruck unter anderem in einer Vielzahl von Generatoren, die den Übersetzungsproze? ganz oder teilweise automatisieren lassen. Die Syntax der zu verarbeitenden Sprache steht gewöhnlich in Datenform (Graphen, Listen) auf der Basis formaler Beschreibungsmittel (z.B. BNF) zur Verfügung. Im Bereich der Übersetzung natürlicher Sprachen ist die Trennung von Sprache und Verarbeitungsalgorithmen - wenn überhaupt - erst ansatzweise vollzogen. Die Gründe liegen auf der Hand. Natürliche Sprachen sind mächtiger, ihre formale Darstellung schwierig. Soll die Übersetzung auch die mündliche Kommunikation umfassen, d.h. den menschlichen Dolmetscher auf einer internationalen Konferenz oder beim Telefonieren mit einem Partner, der eine andere Sprache spricht, ersetzen, kommen Echtzeitanforderungen dazu, die dazu zwingen werden, hochparallele Ansätze zu verfolgen.
Der Prozess der Übersetzung ist auch dann, wenn keine Echtzeiterforderungen vorliegen, außerordentlich komplex. Lösungen werden mit Hilfe des Interlingua- und des Transferansatzes gesucht. Verstärkt werden dabei formale Beschreibungsmittel realtiv gut erforschter Teilgebiete der Informatik eingesetzt (Operationen über dekorierten Bäumen, Baum-zu-Baum-Übersetzungsstrategien), von denen man hofft, daß die Ergebnisse weiter führen werden als spektakuläre Prototypen, die sich jetzt schon am Markt befinden und oft aus heuristischen Ansätzen abgeleitet sind.
[...]
|
773 |
A constraint-based hypergraph partitioning approach to coreference resolutionSapena Masip, Emili 16 May 2012 (has links)
The objectives of this thesis are focused on research in machine learning for
coreference resolution. Coreference resolution is a natural language processing
task that consists of determining the expressions in a discourse that mention or
refer to the same entity.
The main contributions of this thesis are (i) a new approach to coreference
resolution based on constraint satisfaction, using a hypergraph to represent
the problem and solving it by relaxation labeling; and (ii) research towards
improving coreference resolution performance using world knowledge extracted
from Wikipedia.
The developed approach is able to use entity-mention classi cation model
with more expressiveness than the pair-based ones, and overcome the weaknesses
of previous approaches in the state of the art such as linking contradictions,
classi cations without context and lack of information evaluating pairs. Furthermore,
the approach allows the incorporation of new information by adding
constraints, and a research has been done in order to use world knowledge to
improve performances.
RelaxCor, the implementation of the approach, achieved results in the
state of the art, and participated in international competitions: SemEval-2010
and CoNLL-2011. RelaxCor achieved second position in CoNLL-2011. / La resolució de correferències és una tasca de processament del llenguatge natural que consisteix en determinar les expressions
d'un discurs que es refereixen a la mateixa entitat del mon real. La tasca té un efecte directe en la minería de textos així com en
moltes tasques de llenguatge natural que requereixin interpretació del discurs com resumidors, responedors de preguntes o
traducció automàtica. Resoldre les correferències és essencial si es vol poder “entendre” un text o un discurs.
Els objectius d'aquesta tesi es centren en la recerca en resolució de correferències amb aprenentatge automàtic. Concretament,
els objectius de la recerca es centren en els següents camps:
+ Models de classificació: Els models de classificació més comuns a l'estat de l'art estan basats en la classificació independent de
parelles de mencions. Més recentment han aparegut models que classifiquen grups de mencions. Un dels objectius de la tesi és
incorporar el model entity-mention a l'aproximació desenvolupada.
+ Representació del problema: Encara no hi ha una representació definitiva del problema. En aquesta tesi es presenta una
representació en hypergraf.
+ Algorismes de resolució. Depenent de la representació del problema i del model de classificació, els algorismes de ressolució
poden ser molt diversos. Un dels objectius d'aquesta tesi és trobar un algorisme de resolució capaç d'utilitzar els models de
classificació en la representació d'hypergraf.
+ Representació del coneixement: Per poder administrar coneixement de diverses fonts, cal una representació simbòlica i
expressiva d'aquest coneixement. En aquesta tesi es proposa l'ús de restriccions.
+ Incorporació de coneixement del mon: Algunes correferències no es poden resoldre només amb informació lingüística. Sovint
cal sentit comú i coneixement del mon per poder resoldre coreferències. En aquesta tesi es proposa un mètode per extreure
coneixement del mon de Wikipedia i incorporar-lo al sistem de resolució.
Les contribucions principals d'aquesta tesi son (i) una nova aproximació al problema de resolució de correferències basada en
satisfacció de restriccions, fent servir un hypergraf per representar el problema, i resolent-ho amb l'algorisme relaxation labeling; i
(ii) una recerca per millorar els resultats afegint informació del mon extreta de la Wikipedia.
L'aproximació presentada pot fer servir els models mention-pair i entity-mention de forma combinada evitant així els problemes
que es troben moltes altres aproximacions de l'estat de l'art com per exemple: contradiccions de classificacions independents,
falta de context i falta d'informació. A més a més, l'aproximació presentada permet incorporar informació afegint restriccions i s'ha
fet recerca per aconseguir afegir informació del mon que millori els resultats.
RelaxCor, el sistema que ha estat implementat durant la tesi per experimentar amb l'aproximació proposada, ha aconseguit uns
resultats comparables als millors que hi ha a l'estat de l'art. S'ha participat a les competicions internacionals SemEval-2010 i
CoNLL-2011. RelaxCor va obtenir la segona posició al CoNLL-2010.
|
774 |
Identifying Architectural Concerns From Non-functional Requirements Using Support Vector MachineGokyer, Gokhan 01 August 2008 (has links) (PDF)
There has been no commonsense on how to identify problem domain concerns in architectural
modeling of software systems. Even, there is no commonly accepted method for modeling the
Non-Functional Requirements (NFRs) effectively associated with the architectural aspects in
the solution domain. This thesis introduces the use of a Machine Learning (ML) method based
on Support Vector Machines to relate NFRs to classified " / architectural concerns" / in an
automated way. This method uses Natural Language Processing techniques to fragment the
plain NFR texts under the supervision of domain experts. The contribution of this approach
lies in continuously applying ML techniques against previously discovered &ldquo / NFR -
architectural concerns&rdquo / associations to improve the intelligence of repositories for
requirements engineering. The study illustrates a charted roadmap and demonstrates the
automated requirements engineering toolset for this roadmap. It also validates the approach
and effectiveness of the toolset on the snapshot of a real-life project.
|
775 |
Processing Turkish Radiology ReportsHadimli, Kerem 01 May 2011 (has links) (PDF)
Radiology departments utilize various visualization techniques of patients&rsquo / bodies, and narrative free text reports describing the findings in these visualizations are written by medical doctors. The information within these narrative reports is required to be extracted for medical information systems. Turkish is an highly agglutinative language and this poses problems in information retrieval and extraction from Turkish free texts.
In this thesis one rule-based and one data-driven alternate methods for information retrieval and structured information extraction from Turkish radiology reports are presented. Contrary to previous studies in medical NLP systems, both of these methods do not utilize any medical lexicon or ontology.
Information extraction is performed on the level of extracting medically related phrases from the sentence. The aim is to measure baseline performance Turkish language can provide for medical information extraction and retrieval, in isolation of other factors.
|
776 |
Ensembles of Semantic Spaces : On Combining Models of Distributional Semantics with Applications in HealthcareHenriksson, Aron January 2015 (has links)
Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain. All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies. The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized. / <p>At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4 and 5: Unpublished conference papers.</p> / High-Performance Data Mining for Drug Effect Detection
|
777 |
Automatic lemmatisation for Afrikaans / by Hendrik J. GroenewaldGroenewald, Hendrik Johannes January 2006 (has links)
A lemmatiser is an important component of various human language technology applicalions
for any language. At present, a rule-based le~nmatiserf or Afrikaans already exists, but this
lermrlatiser produces disappoinringly low accuracy figures. The performimce of the current
lemmatiser serves as motivation for developing another lemmatiser based on an alternative
approach than language-specific rules. The alternalive method of lemmatiser corlstruction
investigated in this study is memory-based learning.
Thus, in this research project we develop an automatic lemmatiser for Afrikaans called Liu
"Le~?rnru-idc~)~rifisv~ir'e Arfdr(i~ku~u-n s" 'hmmatiser for Afrikaans'. In order to construct Liu,
thc following research objectives are sel: i) to define the classes for Afrikaans lemmatisation,
ii) to determine the influence of data size and various feature options on the performance of
I h , iii) to uutomalically determine the algorithm and parameters settings that deliver the best
performancc in Lcrms of linguistic accuracy, execution time and memory usage.
In order to achieve the first objective, we investigate the processes of inflecrion and
derivation in Afrikaans, since automatic lemmatisation requires a clear distinction between
inflection and derivation. We proceed to define the inflectional calegories for Afrikaans,
which represent a number of affixes that should be removed from word-forms during
lemmatisation. The classes for automatic lemmatisation in Afrikaans are derived from these
affixes. It is subsequently shown that accuracy as well as memory usagc and execution lime
increase as the amount of training dala is increased and that Ihe various feature options bave a
significant effect on the performance of Lia. The algorithmic parameters and data
representation that deliver the best results are determincd by the use of I'Senrck, a
programme that implements Wrapped Progre~sive Sampling in order determine a set of
possibly optimal algorithmic parameters for each of the TiMBL classification algorithms.
Aulornaric Lcmlnalisa~ionf or Afrikaans
- -
Evaluation indicates that an accuracy figure of 92,896 is obtained when training Lia with the
best performing parameters for the IB1 algorithm on feature-aligned data with 20 features.
This result indicates that memory-based learning is indeed more suitable than rule-based
methods for Afrikaans lenlmatiser construction. / Thesis (M.Ing. (Computer and Electronical Engineering))--North-West University, Potchefstroom Campus, 2007.
|
778 |
Outomatiese Afrikaanse tekseenheididentifisering / deur Martin J. PuttkammerPuttkammer, Martin Johannes January 2006 (has links)
An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system / Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|
779 |
Answering complex questions : supervised approachesSadid-Al-Hasan, Sheikh, University of Lethbridge. Faculty of Arts and Science January 2009 (has links)
The term “Google” has become a verb for most of us. Search engines, however, have
certain limitations. For example ask it for the impact of the current global financial crisis
in different parts of the world, and you can expect to sift through thousands of results for
the answer. This motivates the research in complex question answering where the purpose
is to create summaries of large volumes of information as answers to complex questions,
rather than simply offering a listing of sources. Unlike simple questions, complex questions
cannot be answered easily as they often require inferencing and synthesizing information
from multiple documents. Hence, this task is accomplished by the query-focused multidocument
summarization systems. In this thesis we apply different supervised learning
techniques to confront the complex question answering problem. To run our experiments,
we consider the DUC-2007 main task.
A huge amount of labeled data is a prerequisite for supervised training. It is expensive
and time consuming when humans perform the labeling task manually. Automatic labeling
can be a good remedy to this problem. We employ five different automatic annotation
techniques to build extracts from human abstracts using ROUGE, Basic Element (BE) overlap,
syntactic similarity measure, semantic similarity measure and Extended String Subsequence
Kernel (ESSK). The representative supervised methods we use are Support Vector
Machines (SVM), Conditional Random Fields (CRF), Hidden Markov Models (HMM) and
Maximum Entropy (MaxEnt). We annotate DUC-2006 data and use them to train our systems,
whereas 25 topics of DUC-2007 data set are used as test data. The evaluation results
reveal the impact of automatic labeling methods on the performance of the supervised approaches
to complex question answering. We also experiment with two ensemble-based
approaches that show promising results for this problem domain. / x, 108 leaves : ill. ; 29 cm
|
780 |
Leveraging supplementary transcriptions and transliterations via re-rankingBhargava, Aditya Unknown Date
No description available.
|
Page generated in 0.0956 seconds