• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 244
  • 85
  • 27
  • 20
  • 10
  • 6
  • 5
  • 3
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 487
  • 487
  • 180
  • 154
  • 117
  • 116
  • 111
  • 70
  • 69
  • 61
  • 55
  • 53
  • 53
  • 50
  • 49
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
421

兩階段特徵選取法在蛋白質質譜儀資料之應用 / A Two-Stage Approach of Feature Selection on Proteomic Spectra Data

王健源, Wang,Chien-yuan Unknown Date (has links)
藉由「早期發現,早期治療」的方式,我們可以降低癌症的死亡率。因此找出與癌症病變有關的生物標記以期及早發現與治療是一項重要的工作。本研究分析了包含正常人以及攝護腺癌症病人實際的蛋白質質譜資料,而這些蛋白質質譜資料是來自於表面強化雷射解吸電離飛行質譜技術(SELDI-TOF MS)的蛋白質晶片實驗。表面增強雷射脫附遊離飛行時間質譜技術可有效地留存生物樣本的蛋白質特徵。如果沒有經過適當的事前處理步驟以消除實驗雜訊,ㄧ 個質譜中可能包含多於數百或數千的特徵變數。為了加速對於可能的蛋白質生物標記的搜尋,我們只考慮可以區分癌症病人與正常人的特徵變數。 基因演算法是一種類似生物基因演化的總體最佳化搜尋機制,它可以有效地在高維度空間中去尋找可能的最佳解。本研究中,我們利用仿基因演算法(GAL)進行蛋白質的特徵選取以區分癌症病人與正常人。另外,我們提出兩種兩階段仿基因演算法(TSGAL),以嘗試改善仿基因演算法的缺點。 / Early detection and diagnosis can effectively reduce the mortality of cancer. The discovery of biomarkers for the early detection and diagnosis of cancer is thus an important task. In this study, a real proteomic spectra data set of prostate cancer patients and normal patients was analyzed. The data were collected from a Surface-Enhanced Laser Desorption/Ionization Time-Of-Flight Mass Spectrometry (SELDI-TOF MS) experiment. The SELDI-TOF MS technology captures protein features in a biological sample. Without suitable pre-processing steps to remove experimental noise, a mass spectrum could consists of more than hundreds or thousands of peaks. To narrow down the search for possible protein biomarkers, only those features that can distinguish between cancer and normal patients are selected. Genetic Algorithm (GA) is a global optimization procedure that uses an analogy of the genetic evolution of biological organisms. It’s shown that GA is effective in searching complex high-dimensional space. In this study, we consider GA-Like algorithm (GAL) for feature selection on proteomic spectra data in classifying prostate cancer patients from normal patients. In addition, we propose two types of Two-Stage GAL algorithm (TSGAL) to improve the GAL.
422

Topological and domain Knowledge-based subgraph mining : application on protein 3D-structures

Dhifli, Wajdi 11 December 2013 (has links) (PDF)
This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequentsubgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis.
423

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems

Kolar, Mladen 01 July 2013 (has links)
Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data. This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning.
424

Mapping of eelgrass (Zostera marina) at Sidney Spit, Gulf Islands National Park Reserve of Canada, using high spatial resolution remote imagery

O'Neill, Jennifer D. 01 February 2011 (has links)
The main goal of this thesis was to evaluate the use of high spatial remote imagery to map the location and biophysical parameters of eelgrass in marine areas around Sidney Spit, a part of the Gulf Islands National Park Reserve of Canada (GINPRC). To meet this goal, three objectives were addressed: (1) Define key spectral variables that provide optimum separation between eelgrass and its associated benthic substrates, using in situ hyperspectral measurements, and simulated IKONOS and Landsat 7ETM+ spectral response; (2) evaluate the efficacy of these key variables in classification of the high spatial resolution imagery, AISA and IKONOS, at various levels of processing, to determine the processing methodology that offers the highest eelgrass mapping accuracy; and (3) evaluate the potential of ―value-added‖ classification of two eelgrass biophysical indicators, LAI and epiphyte type. In situ hyperspectral measurements acquired at Sidney Spit in August 2008 provided four different data sets: above water spectra, below water spectral profiles, water-corrected spectra, and pure endmember spectra. In Chapter 3, these data sets were examined with first derivative analysis to determine the unique spectral variables of eelgrass and associated benthic substrates. The most effective variables in discriminating eelgrass from all other substrates were selected using data reduction statistics: M-statistic analysis and multiple discriminant analyses (MDA). These selected spectral variables enabled eelgrass classification accuracy of 98% when separating six classes on above water data: shallow (< 3 m deep) eelgrass, deep (> 3 m) eelgrass, shallow sand, deep sand, shallow green algae, and spectrally deep water. The variables were located mainly in the green wavelengths, where light penetrates to the greatest depth: the slope from 500 – 530 nm, and the first derivatives at 566 nm, 580 nm, and 602 nm. The same data were classified with 96% accuracy after correcting for the water column, using the ratios 566:600 and 566:710. The only source of confusion for all data sets was between green algae and eelgrass, presumably due to their similar pigment composition. IKONOS and Landsat 7ETM+ simulated datasets performed similarly well, with 92% and 94% eelgrass classification accuracy respectively. In Chapter 4, the efficacy of the selected features was tested in the classification of airborne hyperspectral AISA imagery and satellite platform multispectral IKONOS imagery, and compared with various other classifiers, both supervised and unsupervised: K-means, minimum distance (MD), linear spectral unmixing (LSU), and spectral angle mapper (SAM). The selected features achieved the highest eelgrass classification accuracies in the study, when combined with atmospheric correction, glint correction, and optically deep water masking. AISA achieved eelgrass producer and user accuracies of 85% in water shallower than 3 m, and 93% in deeper areas. IKONOS achieved 79% for shallow waters and 82% for deep waters. Endmember classification also showed accuracies over 84% and 71% in AISA and IKONOS imagery respectively. Again, the largest source of confusion was between eelgrass and green algae, as well as between exposed vegetation (sea asparagus and green algae) and exposed eelgrass. Incompatibilities of the automatable processing steps (Tafkaa, LSU and SAM) made automated mapping less accurate than supervised mapping, but suggestions are made toward improvement. The value-added classification of eelgrass LAI and epiphyte type produced poor results in all cases except one; epiphyte presence / absence could be delineated with 87% accuracy. Before applying the findings of this study, one must consider the spatial scale of the intended management goal and select imagery with suitable spatial resolution. Tidal variations, as well as seasonal variability in water conditions and eelgrass phenology must also be considered as they may affect classification accuracies.
425

Learning algorithms for sparse classification

Sanchez Merchante, Luis Francisco 07 June 2013 (has links) (PDF)
This thesis deals with the development of estimation algorithms with embedded feature selection the context of high dimensional data, in the supervised and unsupervised frameworks. The contributions of this work are materialized by two algorithms, GLOSS for the supervised domain and Mix-GLOSS for unsupervised counterpart. Both algorithms are based on the resolution of optimal scoring regression regularized with a quadratic formulation of the group-Lasso penalty which encourages the removal of uninformative features. The theoretical foundations that prove that a group-Lasso penalized optimal scoring regression can be used to solve a linear discriminant analysis bave been firstly developed in this work. The theory that adapts this technique to the unsupervised domain by means of the EM algorithm is not new, but it has never been clearly exposed for a sparsity-inducing penalty. This thesis solidly demonstrates that the utilization of group-Lasso penalized optimal scoring regression inside an EM algorithm is possible. Our algorithms have been tested with real and artificial high dimensional databases with impressive resuits from the point of view of the parsimony without compromising prediction performances.
426

Ταξινόμηση καρκινικών όγκων εγκεφάλου με χρήση μεθόδων μηχανικής μάθησης

Κανάς, Βασίλειος 29 August 2011 (has links)
Σκοπός αυτής της διπλωματικής εργασίας είναι να ερευνηθούν μέθοδοι μηχανικής μάθησης για την ταξινόμηση διαφόρων τύπων καρκινικών όγκων εγκεφάλου με χρήση δεδομένων μαγνητικής τομογραφίας. Η διάγνωση του τύπου του καρκίνου είναι σημαντική για τον κατάλληλο σχεδιασμό της θεραπείας. Γενικά η ταξινόμηση καρκινικών όγκων αποτελείται από επιμέρους βήματα, όπως καθορισμός των περιοχών ενδιαφέροντος (ROIs), εξαγωγή χαρακτηριστικών, επιλογή χαρακτηριστικών, ταξινόμηση. Η εργασία αυτή εστιάζει στα δύο τελευταία βήματα ώστε να εξαχθεί μια γενική επισκόπηση της επίδρασης των εκάστοτε μεθόδων όσον αφορά την ταξινόμηση των διαφόρων όγκων. Τα εξαγόμενα χαρακτηριστικά περιλαμβάνουν χαρακτηριστικά φωτεινότητας και περιγράμματος από συμβατικές τεχνικές απεικόνισης μαγνητικής τομογραφίας (Τ2, Τ1 με έγχυση σκιαγραφικού, Flair,Τ1) καθώς και μη συμβατικές τεχνικές (Μαγνητική τομογραφία αιματικής διήθησης ). Για την επιλογή των χαρακτηριστικών χρησιμοποιήθηκαν διάφορες μέθοδοι φιλτραρίσματος, όπως CFSsubset, wrapper, consistency σε συνδυασμό με μεθόδους αναζήτησης, όπως scatter, best first, greedy stepwise, με τη βοήθεια του πακέτου Waikato Environment for Knowledge Analysis (WEKA). Οι μέθοδοι εφαρμόστηκαν σε 101 ασθενείς με καρκινικούς όγκους εγκεφάλου οι οποίοι είχαν διαγνωστεί ως μετάσταση (24), μηνιγγίωμα (4), γλοίωμα βαθμού 2 (22), γλοίωμα βαθμού 3 (17) ή γλοίωμα βαθμού 4 (34) και επαληθεύτηκαν με τη στρατηγική του αχρησιμοποίητου παραδείγματος (Leave One Out-LOO) / The objective of this study is to investigate the use of pattern classification methods for distinguishing different types of brain tumors, such as primary gliomas from metastases, and also for grading of gliomas. A computer-assisted classification method combining conventional magnetic resonance imaging (MRI) and perfusion MRI is developed and used for differential diagnosis. The characterization and accurate determination of brain tumor grade and type is very important because it influences and specifies patient's treatment planning. The proposed scheme consists of several steps including ROI definition, feature extraction, feature selection and classification. The extracted features include tumor shape and intensity characteristics. Features subset selection is performed using two filtering methods, correlation-based feature selection method and consistency method, and a wrapper approach in combination with three different search algorithms (best first, greedy stepwise and scatter). These methods are implemented using the assistance of the WEKA software [20]. The highest binary classification accuracy assessed by leave-one-out (LOO) cross-validation on 102 brain tumors, is 94.1% for discrimination of metastases from gliomas, and 91.3% for discrimination of high grade from low grade neoplasms. Multi-class classification is also performed and 76.29% accuracy achieved.
427

Αναγνώριση συναισθημάτων από ομιλία με χρήση τεχνικών ψηφιακής επεξεργασίας σήματος και μηχανικής μάθησης / Emotion recognition from speech using digital signal processing and machine learning techniques

Κωστούλας, Θεόδωρος 28 February 2013 (has links)
Η παρούσα διδακτορική διατριβή πραγματεύεται προβλήματα που αφορούν το χώρο της τεχνολογίας ομιλίας, με στόχο τη αναγνώριση συναισθημάτων από ομιλία με χρήση τεχνικών ψηφιακής επεξεργασίας σήματος και μηχανικής μάθησης. Πιο αναλυτικά, στα πλαίσια της διατριβής προτάθηκαν και μελετήθηκαν καινοτόμες μέθοδοι σε μια σειρά από εφαρμογές που αξιοποιούν σύστημα αναγνώρισης συναισθηματικών καταστάσεων από ομιλία. Ο βασικός στόχος των μεθόδων ήταν η αντιμετώπιση των προκλήσεων που παρουσιάζονται όταν ένα σύστημα αναγνώρισης συναισθηματικών καταστάσεων καλείται να λειτουργήσει σε πραγματικές συνθήκες, με αυθόρμητες αντιδράσεις, ανεξαρτήτως ομιλητή. Πιο συγκεκριμένα, στα πλαίσια της διατριβής, αξιολογήθηκε η συμπεριφορά ενός συστήματος αναγνώρισης συναισθημάτων σε προσποιητή ομιλία και σε διαφορετικές συνθήκες θορύβου, και συγκρίθηκε η απόδοση του συστήματος με την υποκειμενική αξιολόγηση των ακροατών. Επιπλέον, περιγράφηκε ο σχεδιασμός και η υλοποίηση βάση δεδομένων συναισθηματικής ομιλίας, όπως αυτή προκύπτει από την αλληλεπίδραση μη-έμπειρων χρηστών με ένα διαλογικό σύστημα και προτάθηκε ένα σύστημα το οποίο εντοπίζει αρνητικές συναισθηματικές καταστάσεις, στο ανεξάρτητου ομιλητή πρόβλημα, με χρήση μοντέλου Γκαουσιανών κατανομών. Η προτεινόμενη αρχιτεκτονική συνδυάζει παραμέτρους ομιλίας χαμηλού και υψηλού επιπέδου και εφαρμόζεται στα πραγματικά δεδομένα. Επίσης, αξιολογήθηκε και υλοποιήθηκε η πρακτική εφαρμογή ενός συστήματος αναγνώρισης συναισθημάτων βασισμένου σε οικουμενικό μοντέλο Γκαουσιανών κατανομών σε διαφορετικούς τύπους δεδομένων πραγματικής ζωής. Ακόμα, παρουσιάστηκε μια πρωτότυπη αρχιτεκτονική κατηγοριοποίησης για αναγνώριση συνυπαρχόντων συναισθημάτων από ομιλία προερχόμενη από αλληλεπίδραση σε πραγματικά περιβάλλοντα. Σε αντίθεση με γνωστές προσεγγίσεις, η προτεινόμενη αρχιτεκτονική μοντελοποιεί τις συνυπάρχουσες συναισθηματικές καταστάσεις μέσω της κατασκευής μιας πολυσταδιακής αρχιτεκτονικής κατηγοριοποίησης. Τα πειραματικά αποτελέσματα που διενεργήθηκαν υποδεικνύουν ότι η προτεινόμενη αρχιτεκτονική είναι πλεονεκτική για τις συναισθηματικές καταστάσεις που είναι πιο διαχωρίσιμες, γεγονός που οδηγεί σε βελτίωση της συνολικής απόδοσης του συστήματος. / In this doctoral dissertation a number of novel approaches were proposed and evaluated in different applications that utilize emotion awareness. The major target of the proposed methods was facing the difficulties existing, when an emotion recognition system is asked to operate in real-life conditions, where human speech is characterized by spontaneous and genuine formulations. In detail, within the present dissertation, the performance of an emotion recognition system was evaluated, initially, in acted speech, under different noise conditions, and this performance was compared to the one of human listeners. Further, the design and implementation of a real world emotional speech corpus is described, as this results from the interaction of naive users with a smart home dialogue system. Moreover, a system which utilizes low and high level descriptors was suggested. The suggested architecture leads to significantly better performance in some working points of the integrated system in the dialogue system. Furthermore, we propose a novel multistage classification scheme for affect recognition from real-life speech. In contrast with conventional approaches for affect/emotion recognition from speech, the proposed scheme models co-occurring affective states by constructing a multistage classification scheme. The empirical experiments performed indicate that the proposed classification scheme offers an advantage for those classes that are more separable, which contributes for improving the overall performance of the affect recognition system.
428

Multi-label Classification with Multiple Label Correlation Orders And Structures

Posinasetty, Anusha January 2016 (has links) (PDF)
Multilabel classification has attracted much interest in recent times due to the wide applicability of the problem and the challenges involved in learning a classifier for multilabeled data. A crucial aspect of multilabel classification is to discover the structure and order of correlations among labels and their effect on the quality of the classifier. In this work, we propose a structural Support Vector Machine (structural SVM) based framework which enables us to systematically investigate the importance of label correlations in multi-label classification. The proposed framework is very flexible and provides a unified approach to handle multiple correlation orders and structures in an adaptive manner and helps to effectively assess the importance of label correlations in improving the generalization performance. We perform extensive empirical evaluation on several datasets from different domains and present results on various performance metrics. Our experiments provide for the first time, interesting insights into the following questions: a) Are label correlations always beneficial in multilabel classification? b) What effect do label correlations have on multiple performance metrics typically used in multilabel classification? c) Is label correlation order significant and if so, what would be the favorable correlation order for a given dataset and a given performance metric? and d) Can we make useful suggestions on the label correlation structure?
429

Feature Selection under Multicollinearity & Causal Inference on Time Series

Bhattacharya, Indranil January 2017 (has links) (PDF)
In this work, we study and extend algorithms for Sparse Regression and Causal Inference problems. Both the problems are fundamental in the area of Data Science. The goal of regression problem is to nd out the \best" relationship between an output variable and input variables, given samples of the input and output values. We consider sparse regression under a high-dimensional linear model with strongly correlated variables, situations which cannot be handled well using many existing model selection algorithms. We study the performance of the popular feature selection algorithms such as LASSO, Elastic Net, BoLasso, Clustered Lasso as well as Projected Gradient Descent algorithms under this setting in terms of their running time, stability and consistency in recovering the true support. We also propose a new feature selection algorithm, BoPGD, which cluster the features rst based on their sample correlation and do subsequent sparse estimation using a bootstrapped variant of the projected gradient descent method with projection on the non-convex L0 ball. We attempt to characterize the efficiency and consistency of our algorithm by performing a host of experiments on both synthetic and real world datasets. Discovering causal relationships, beyond mere correlation, is widely recognized as a fundamental problem. The Causal Inference problems use observations to infer the underlying causal structure of the data generating process. The input to these problems is either a multivariate time series or i.i.d sequences and the output is a Feature Causal Graph where the nodes correspond to the variables and edges capture the direction of causality. For high dimensional datasets, determining the causal relationships becomes a challenging task because of the curse of dimensionality. Graphical modeling of temporal data based on the concept of \Granger Causality" has gained much attention in this context. The blend of Granger methods along with model selection techniques, such as LASSO, enables efficient discovery of a \sparse" sub-set of causal variables in high dimensional settings. However, these temporal causal methods use an input parameter, L, the maximum time lag. This parameter is the maximum gap in time between the occurrence of the output phenomenon and the causal input stimulus. How-ever, in many situations of interest, the maximum time lag is not known, and indeed, finding the range of causal e ects is an important problem. In this work, we propose and evaluate a data-driven and computationally efficient method for Granger causality inference in the Vector Auto Regressive (VAR) model without foreknowledge of the maximum time lag. We present two algorithms Lasso Granger++ and Group Lasso Granger++ which not only constructs the hypothesis feature causal graph, but also simultaneously estimates a value of maxlag (L) for each variable by balancing the trade-o between \goodness of t" and \model complexity".
430

[en] SEMANTIC ROLE-LABELING FOR PORTUGUESE / [pt] ANOTADOR DE PAPEIS SEMÂNTICOS PARA PORTUGUÊS

ARTHUR BELTRAO CASTILHO NETO 23 June 2017 (has links)
[pt] A anotação de papeis semânticos (APS) é uma importante tarefa do processamento de linguagem natural (PLN), que possibilita estabelecer uma relação de significado entre os eventos descritos em uma sentença e seus participantes. Dessa forma, tem o potencial de melhorar o desempenho de inúmeros outros sistemas, tais como: tradução automática, correção ortográfica, extração e recuperação de informações e sistemas de perguntas e respostas, uma vez que reduz as ambiguidades existentes no texto de entrada. A grande maioria dos sistemas de APS publicados no mundo realiza a tarefa empregando técnicas de aprendizado supervisionado e, para obter melhores resultados, usam corpora manualmente revisados de tamanho considerável. No caso do Brasil, o recurso lexical que possui anotações semânticas (Propbank.br) é muito menor. Por isso, nos últimos anos, foram feitas tentativas de melhorar esse resultado utilizando técnicas de aprendizado semisupervisionado ou não-supervisionado. Embora esses trabalhos tenham contribuido direta e indiretamente para a área de PLN, não foram capazes de superar o desempenho dos sistemas puramente supervisionados. Este trabalho apresenta uma abordagem ao problema de anotação de papéis semânticos no idioma português. Utilizamos aprendizado supervisionado sobre um conjunto de 114 atributos categóricos e empregando duas técnicas de regularização de domínio, combinadas para reduzir o número de atributos binários em 96 por cento. O modelo gerado usa uma support vector machine com solver L2-loss dual support vector classification e é testado na base PropBank.br, apresentando desempenho ligeiramente superior ao estado-da-arte. O sistema é avaliado empiricamente pelo script oficial da CoNLL 2005 Shared Task, obtendo 82,17 por cento de precisão, 82,88 por cento de cobertura e 82,52 por cento de F1 ao passo que o estado-da-arte anterior atinge 83,0 por cento de precisão, 81,7 por cento de cobertura e 82,3 por cento de F1. / [en] Semantic role-labeling (SRL) is an important task of natural language processing (NLP) which allows establishing meaningful relationships between events described in a given sentence and its participants. Therefore, it can potentially improve performance on a large number of NLP systems such as automatic translation, spell correction, information extraction and retrieval and question answering, as it decreases ambiguity in the input text. The vast majority of SRL systems reported so far employed supervised learning techniques to perform the task. For better results, large sized manually reviewed corpora are used. The Brazilian semantic role labeled lexical resource (Propbank.br) is much smaller. Hence, in recent years, attempts have been made to improve performance using semi supervised and unsupervised learning. Even making several direct and indirect contributions to NLP, those studies were not able to outperform exclusively supervised systems. This paper presents an approach to the SRL task in Portuguese language using supervised learning over a set of 114 categorical features. Over those, we apply a combination of two domain regularization methods to cut binary features down to 96 percent. We test a SVM model (L2-loss dual support vector classification) on PropBank.Br dataset achieving results slightly better than state-of-the-art. We empirically evaluate the system using official CoNLL 2005 Shared Task script pulling 82.17 percent precision, 82.88 percent coverage and 82.52 percent F1. The previous state-of-the-art Portuguese SRL system scores 83.0 percent precision, 81.7 percent coverage and 82.3 percent F1.

Page generated in 0.0647 seconds