1 |
Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and feature selectionSokolovska, Nataliya 25 February 2010 (has links) (PDF)
Dans cette thèse nous étudions l'estimation de modèles probabilistes discriminants, surtout des aspects d'apprentissage semi-supervisé et de sélection de caractéristiques. Le but de l'apprentissage semi-supervisé est d'améliorer l'efficacité de l'apprentissage supervisé en utilisant des données non-étiquetées. Cet objectif est difficile à atteindre dans les cas des modèles discriminants. Les modèles probabilistes discriminants permettent de manipuler des représentations linguistiques riches, sous la forme de vecteurs de caractéristiques de très grande taille. Travailler en grande dimension pose des problèmes, en particulier computationnels, qui sont exacerbés dans le cadre de modèles de séquences tels que les champs aléatoires conditionnels (CRF). Notre contribution est double. Nous introduisons une méthode originale et simple pour intégrer des données non étiquetées dans une fonction objectif semi-supervisée. Nous démontrons alors que l'estimateur semi-supervisé correspondant est asymptotiquement optimal. Le cas de la régression logistique est illustré par des résultats d'expèriences. Dans cette étude, nous proposons un algorithme d'estimation pour les CRF qui réalise une sélection de modèle, par le truchement d'une pénalisation $L_1$. Nous présentons également les résultats d'expériences menées sur des tâches de traitement des langues (le chunking et la détection des entités nommées), en analysant les performances en généralisation et les caractéristiques sélectionnées. Nous proposons finalement diverses pistes pour améliorer l'efficacité computationelle de cette technique.
|
2 |
Continuous states conditional random fields training using adaptive integrationLeitao, Joao January 2010 (has links)
The extension of Conditional Random Fields (CRF) from discrete states to continuous states will help remove the limitation of the number of states and allow new applications for CRF. In this work, our attempts to obtain a correct procedure to train continuous state conditional random fields through maximum likelihood are presented. By deducing the equations governing the extension of the CRF to continuous states it was possible to merge with the Particle Filter (PF) concept to obtain a formulation governing the training of continuous states CRFs by using particle filters. The results obtained indicated that this process is unsuitable because of the low convergence of the PF integration rate in the needed integrations replacing the summation in CRFs. So a change in concept to an adaptive integration scheme was made. Based on an extension of the Binary Space Partition (BSP) algorithm an adaptive integration process was devised with the aim of producing a more precise integration while retaining a less costly function evaluation than PF. This allowed us to train continuous states conditional random fields with some success. To verify the possibility of increasing the dimension of the states as a vector of continuous states a scalable version was also used to briefly assess its fitness in two-dimensions with quadtrees. This is an asymmetric two-dimensional space partition scheme. In order to increase the knowledge of the problem it would be interesting to have further information of the relevant features. A feature selection embedded method was used based on the lasso regulariser with the intention of pinpointing the most relevant feature functions indicating the relevant features.
|
3 |
Algoritmos eficientes para análise de campos aleatórios condicionais semi-markovianos e sua aplicação em sequências genômicas / Efficient algorithms for semi-markov conditional random fields and their application for the analysis of genomic sequencesBonadio, Ígor 06 August 2018 (has links)
Campos Aleatórios Condicionais são modelos probabilísticos discriminativos que tem sido utilizados com sucesso em diversas áreas como processamento de linguagem natural, reconhecimento de fala e bioinformática. Entretanto, implementar algoritmos eficientes para esse tipo de modelo não é uma tarefa fácil. Nesse trabalho apresentamos um arcabouço que ajuda no desenvolvimento e experimentação de Campos Aleatórios Condicionais Semi Markovianos (semi-CRFs). Desenvolvemos algoritmos eficientes que foram implementados em C++ propondo uma interface de programação flexível e intuitiva que habilita o usuário a definir, treinar e avaliar modelos. Nossa implementação foi construída como uma extensão do arcabouço ToPS que, inclusive, pode utilizar qualquer modelo já definido no ToPS como uma função de característica especializada. Por fim utilizamos nossa implementação de semi-CRF para construir um preditor de promotores que apresentou performance superior aos preditores existentes. / Conditional Random Fields are discriminative probabilistic models that have been successfully used in several areas like natural language processing, speech recognition and bioinformatics. However, implementing efficient algorithms for this kind of model is not an easy task. In this thesis we show a framework that helps the development and experimentation of Semi-Markov Conditional Random Fields (semi-CRFs). It has an efficient implementation in C++ and an intuitive API that allow users to define, train and evaluate models. It was built as an extension of ToPS framework and can use ToPS probabilistic models as specialized feature functions. We also use our implementation of semi-CRFs to build a high performance promoter predictor.
|
4 |
An Enhanced Conditional Random Field Model for Chinese Word SegmentationHuang, Jhao-ming 03 February 2010 (has links)
In Chinese language, the smallest meaningful unit is a word which is composed of a sequence
of characters. A Chinese sentence is composed of a sequence of words without any separation
between them. In the area of information retrieval or data mining, the segmentation of a
sequence of Chinese characters should be done before anyone starts to use these segments of
characters. The process is called the Chinese word segmentation. The researches of Chinese
word segmentation have been developed for many years. Although some recent researches
have achieved very high performance, the recall of those words that are not in the dictionary
only achieves sixty or seventy percent. An approach described in this paper makes use of the
linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation.
The discriminatively trained model that uses two of our proposed feature templates for
deciding the boundaries between characters is used in our study. We also propose three other
methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix
could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment.
In the area of using the conditional random fields for Chinese word segmentation, we have
proposed a feature template for better result and three methods which focus on other specific
segmentation problems.
|
5 |
Scaling conditional random fields for natural language processingCohn, Trevor A Unknown Date (has links) (PDF)
This thesis deals with the use of Conditional Random Fields (CRFs; Lafferty et al. (2001)) for Natural Language Processing (NLP). CRFs are probabilistic models for sequence labelling which are particularly well suited to NLP. They have many compelling advantages over other popular models such as Hidden Markov Models and Maximum Entropy Markov Models (Rabiner, 1990; McCallum et al., 2001), and have been applied to a number of NLP tasks with considerable success (e.g., Sha and Pereira (2003) and Smith et al. (2005)). Despite their apparent success, CRFs suffer from two main failings. Firstly, they often over-fit the training sample. This is a consequence of their considerable expressive power, and can be limited by a prior over the model parameters (Sha and Pereira, 2003; Peng and McCallum, 2004). Their second failing is that the standard methods for CRF training are often very slow, sometimes requiring weeks of processing time. This efficiency problem is largely ignored in current literature, although in practise the cost of training prevents the application of CRFs to many new more complex tasks, and also prevents the use of densely connected graphs, which would allow for much richer feature sets. (For complete abstract open document)
|
6 |
Algoritmos eficientes para análise de campos aleatórios condicionais semi-markovianos e sua aplicação em sequências genômicas / Efficient algorithms for semi-markov conditional random fields and their application for the analysis of genomic sequencesÍgor Bonadio 06 August 2018 (has links)
Campos Aleatórios Condicionais são modelos probabilísticos discriminativos que tem sido utilizados com sucesso em diversas áreas como processamento de linguagem natural, reconhecimento de fala e bioinformática. Entretanto, implementar algoritmos eficientes para esse tipo de modelo não é uma tarefa fácil. Nesse trabalho apresentamos um arcabouço que ajuda no desenvolvimento e experimentação de Campos Aleatórios Condicionais Semi Markovianos (semi-CRFs). Desenvolvemos algoritmos eficientes que foram implementados em C++ propondo uma interface de programação flexível e intuitiva que habilita o usuário a definir, treinar e avaliar modelos. Nossa implementação foi construída como uma extensão do arcabouço ToPS que, inclusive, pode utilizar qualquer modelo já definido no ToPS como uma função de característica especializada. Por fim utilizamos nossa implementação de semi-CRF para construir um preditor de promotores que apresentou performance superior aos preditores existentes. / Conditional Random Fields are discriminative probabilistic models that have been successfully used in several areas like natural language processing, speech recognition and bioinformatics. However, implementing efficient algorithms for this kind of model is not an easy task. In this thesis we show a framework that helps the development and experimentation of Semi-Markov Conditional Random Fields (semi-CRFs). It has an efficient implementation in C++ and an intuitive API that allow users to define, train and evaluate models. It was built as an extension of ToPS framework and can use ToPS probabilistic models as specialized feature functions. We also use our implementation of semi-CRFs to build a high performance promoter predictor.
|
7 |
Extração de informações de conferências em páginas webGarcia, Cássio Alan January 2017 (has links)
A escolha da conferência adequada para o envio de um artigo é uma tarefa que depende de diversos fatores: (i) o tema do trabalho deve estar entre os temas de interesse do evento; (ii) o prazo de submissão do evento deve ser compatível com tempo necessário para a escrita do artigo; (iii) localização da conferência e valores de inscrição são levados em consideração; e (iv) a qualidade da conferência (Qualis) avaliada pela CAPES. Esses fatores aliados à existência de milhares de conferências tornam a busca pelo evento adequado bastante demorada, em especial quando se está pesquisando em uma área nova. A fim de auxiliar os pesquisadores na busca de conferências, o trabalho aqui desenvolvido apresenta um método para a coleta e extração de dados de sites de conferências. Essa é uma tarefa desafiadora, principalmente porque cada conferência possui seu próprio site, com diferentes layouts. O presente trabalho apresenta um método chamado CONFTRACKER que combina a identificação de URLs de conferências da Tabela Qualis à identificação de deadlines a partir de seus sites. A extração das informações é realizada independente da conferência, do layout do site e da forma como são apresentadas as datas (formatação e rótulos). Para avaliar o método proposto, foram realizados experimentos com dados reais de conferências da Ciência da Computação. Os resultados mostraram que CONFTRACKER obteve resultados significativamente melhores em relação a um baseline baseado na posição entre rótulos e datas. Por fim, o processo de extração é executado para todas as conferências da Tabela Qualis e os dados coletados populam uma base de dados que pode ser consultada através de uma interface online. / Choosing the most suitable conference to submit a paper is a task that depends on various factors: (i) the topic of the paper needs to be among the topics of interest of the conference; (ii) submission deadlines need to be compatible with the necessary time for paper writing; (iii) conference location and registration costs; and (iv) the quality or impact of the conference. These factors allied to the existence of thousands of conferences, make the search of the right event very time consuming, especially when researching in a new area. Intending to help researchers finding conferences, this work presents a method developed to retrieve and extract data from conference web sites. Our method combines the identification of conference URL and deadline extraction. This is a challenging task as each web site has its own layout. Here, we propose CONFTRACKER, which combines the identification of the URLs of conferences listed in the Qualis Table and the extraction of their deadlines. Information extraction is carried out independent from the page’s layout and how the dates are presented. To evaluate our proposed method, we carried out experiments with real web data from Computer Science conferences. The results show that CONFTRACKER outperformed a baseline method based on the position of labels and dates. Finaly, the extracted data is stored in a database to be searched with an online tool.
|
8 |
Robust and efficient intrusion detection systemsGupta, Kapil Kumar January 2009 (has links)
Intrusion Detection systems are now an essential component in the overall network and data security arsenal. With the rapid advancement in the network technologies including higher bandwidths and ease of connectivity of wireless and mobile devices, the focus of intrusion detection has shifted from simple signature matching approaches to detecting attacks based on analyzing contextual information which may be specific to individual networks and applications. As a result, anomaly and hybrid intrusion detection approaches have gained significance. However, present anomaly and hybrid detection approaches suffer from three major setbacks; limited attack detection coverage, large number of false alarms and inefficiency in operation. / In this thesis, we address these three issues by introducing efficient intrusion detection frameworks and models which are effective in detecting a wide variety of attacks and which result in very few false alarms. Additionally, using our approach, attacks can not only be accurately detected but can also be identified which helps to initiate effective intrusion response mechanisms in real-time. Experimental results performed on the benchmark KDD 1999 data set and two additional data sets collected locally confirm that layered conditional random fields are particularly well suited to detect attacks at the network level and user session modeling using conditional random fields can effectively detect attacks at the application level. / We first introduce the layered framework with conditional random fields as the core intrusion detector. Layered conditional random field can be used to build scalable and efficient network intrusion detection systems which are highly accurate in attack detection. We show that our systems can operate either at the network level or at the application level and perform better than other well known approaches for intrusion detection. Experimental results further demonstrate that our system is robust to noise in training data and handles noise better than other systems such as the decision trees and the naive Bayes. We then introduce our unified logging framework for audit data collection and perform user session modeling using conditional random fields to build real-time application intrusion detection systems. We demonstrate that our system can effectively detect attacks even when they are disguised within normal events in a single user session. Using our user session modeling approach based on conditional random fields also results in early attack detection. This is desirable since intrusion response mechanisms can be initiated in real-time thereby minimizing the impact of an attack.
|
9 |
Conditional random fields for noisy text normalisationCoetsee, Dirko 12 1900 (has links)
Thesis (MScEng) -- Stellenbosch University, 2014. / ENGLISH ABSTRACT: The increasing popularity of microblogging services such as Twitter means
that more and more unstructured data is available for analysis. The informal
language usage in these media presents a problem for traditional text mining
and natural language processing tools. We develop a pre-processor to normalise
this noisy text so that useful information can be extracted with standard tools.
A system consisting of a tokeniser, out-of-vocabulary token identifier, correct
candidate generator, and N-gram language model is proposed. We compare
the performance of generative and discriminative probabilistic models for
these different modules. The effect of normalising the training and testing
data on the performance of a tweet sentiment classifier is investigated.
A linear-chain conditional random field, which is a discriminative model,
is found to work better than its generative counterpart for the tokenisation
module, achieving a 0.76% character error rate compared to 1.41% for the
finite state automaton. For the candidate generation module, however, the
generative weighted finite state transducer works better, getting the correct
clean version of a word right 36% of the time on the first guess, while the discriminatively
trained hidden alignment conditional random field only achieves
6%. The use of a normaliser as a pre-processing step does not significantly
affect the performance of the sentiment classifier. / AFRIKAANSE OPSOMMING: Mikro-webjoernale soos Twitter word al hoe meer gewild, en die hoeveelheid
ongestruktureerde data wat beskikbaar is vir analise groei daarom soos nooit
tevore nie. Die informele taalgebruik in hierdie media maak dit egter moeilik
om tradisionele tegnieke en bestaande dataverwerkingsgereedskap toe te pas.
’n Stelsel wat hierdie ruiserige teks normaliseer word ontwikkel sodat bestaande
pakkette gebruik kan word om die teks verder te verwerk.
Die stelsel bestaan uit ’n module wat die teks in woordeenhede opdeel, ’n
module wat woorde identifiseer wat gekorrigeer moet word, ’n module wat dan
kandidaat korreksies voorstel, en ’n module wat ’n taalmodel toepas om die
mees waarskynlike skoon teks te vind. Die verrigting van diskriminatiewe
en generatiewe modelle vir ’n paar van hierdie modules word vergelyk en
die invloed wat so ’n normaliseerder op die akkuraatheid van ’n sentimentklassifiseerder
het word ondersoek.
Ons bevind dat ’n lineêre-ketting voorwaardelike toevalsveld—’n diskriminatiewe
model — beter werk as sy generatiewe eweknie vir tekssegmentering.
Die voorwaardelike toevalsveld-model behaal ’n karakterfoutkoers van 0.76%,
terwyl die toestandsmasjien-model 1.41% behaal. Die toestantsmasjien-model werk weer beter om kandidaat woorde te genereer as die verskuilde belyningsmodel
wat ons geïmplementeer het. Die toestandsmasjien kry 36% van die tyd
die regte weergawe van ’n woord met die eerste raaiskoot, terwyl die diskriminatiewe
model dit slegs 6% van die tyd kan doen. Laastens het ons bevind
dat die vooraf normalisering van Twitter boodskappe nie ’n beduidende effek
op die akkuraatheid van ’n sentiment klassifiseerder het nie.
|
10 |
Extração de informações de conferências em páginas webGarcia, Cássio Alan January 2017 (has links)
A escolha da conferência adequada para o envio de um artigo é uma tarefa que depende de diversos fatores: (i) o tema do trabalho deve estar entre os temas de interesse do evento; (ii) o prazo de submissão do evento deve ser compatível com tempo necessário para a escrita do artigo; (iii) localização da conferência e valores de inscrição são levados em consideração; e (iv) a qualidade da conferência (Qualis) avaliada pela CAPES. Esses fatores aliados à existência de milhares de conferências tornam a busca pelo evento adequado bastante demorada, em especial quando se está pesquisando em uma área nova. A fim de auxiliar os pesquisadores na busca de conferências, o trabalho aqui desenvolvido apresenta um método para a coleta e extração de dados de sites de conferências. Essa é uma tarefa desafiadora, principalmente porque cada conferência possui seu próprio site, com diferentes layouts. O presente trabalho apresenta um método chamado CONFTRACKER que combina a identificação de URLs de conferências da Tabela Qualis à identificação de deadlines a partir de seus sites. A extração das informações é realizada independente da conferência, do layout do site e da forma como são apresentadas as datas (formatação e rótulos). Para avaliar o método proposto, foram realizados experimentos com dados reais de conferências da Ciência da Computação. Os resultados mostraram que CONFTRACKER obteve resultados significativamente melhores em relação a um baseline baseado na posição entre rótulos e datas. Por fim, o processo de extração é executado para todas as conferências da Tabela Qualis e os dados coletados populam uma base de dados que pode ser consultada através de uma interface online. / Choosing the most suitable conference to submit a paper is a task that depends on various factors: (i) the topic of the paper needs to be among the topics of interest of the conference; (ii) submission deadlines need to be compatible with the necessary time for paper writing; (iii) conference location and registration costs; and (iv) the quality or impact of the conference. These factors allied to the existence of thousands of conferences, make the search of the right event very time consuming, especially when researching in a new area. Intending to help researchers finding conferences, this work presents a method developed to retrieve and extract data from conference web sites. Our method combines the identification of conference URL and deadline extraction. This is a challenging task as each web site has its own layout. Here, we propose CONFTRACKER, which combines the identification of the URLs of conferences listed in the Qualis Table and the extraction of their deadlines. Information extraction is carried out independent from the page’s layout and how the dates are presented. To evaluate our proposed method, we carried out experiments with real web data from Computer Science conferences. The results show that CONFTRACKER outperformed a baseline method based on the position of labels and dates. Finaly, the extracted data is stored in a database to be searched with an online tool.
|
Page generated in 0.109 seconds