Spelling suggestions: "subject:"supervised cachine learning"" "subject:"supervised amachine learning""
51 |
Water Contamination Detection With Binary Classification Using Artificial Neural NetworksLundholm, Christoffer, von Butovitsch, Nicholas January 2022 (has links)
Water contamination is a major source of diseasearound the world. Therefore, the reliable monitoring of harmfulcontamination in water distribution networks requires considerableeffort and attention. It is a vital necessity to possess a reliablemonitoring system in order to detect harmful contamination inwater distribution networks. To measure the potential contamination,a new sensor called an ’electric tongue’ was developedin Link¨opings University. It was created for the purpose ofmeasuring various features of the water reliably. This projecthas developed a supervised machine learning algorithm that usesan artificial neural network for the detection of anomalies in thesystem. The algorithm can detect anomalies with an accuracy ofaround 99.98% based on the data that was available. This wasachieved through a binary classifier, which reconstructs a vectorand compares it to the expected outcome. Despite the limitationsof the problem and the system’s capabilities, binary classificationis a potential solution to this problem. / Vatten kontaminering är en huvudsaklig anledning till sjukdom runtom i världen. Därför är det en avgörande nödvändighet att ha ett tillförlitligt övervakningssystem för att upptäcka skadliga föroreningar i vattendistributionsnät. För att mäta den potentiella föroreningen skapades en ny sensor, den så kallade ”Electric Tongue” vid Linköpings universitet Den skapades i syfte att mäta olika egenskaper i vattnet på ett tillförlitligt sätt. Genom att använda ett artificiellt neuralt nätverk utvecklades en supervised machine learning algoritm för att upptäcka anomalier i systemet. Algoritmen kan upptäcka anomalier med 99.98% säkerhet som baseras på befintliga data. Detta uppnåddes genom att rekonstruera en vektor och jämföra det med det förväntade resultatet genom att använda en binär klassificerare. Trots att det finns begränsningar som orsakats både av problemet men också systemets förmågor, så är binär klassificering en potentiell lösning till detta problem. / Kandidatexjobb i elektroteknik 2022, KTH, Stockholm
|
52 |
Classifying Portable Electronic Devices using Device Specifications : A Comparison of Machine Learning TechniquesWesterholm, Ludvig January 2024 (has links)
In this project, we explored the usage of machine learning in classifying portable electronic devices. The primary objective was to identify devices such as laptops, smartphones, and tablets, based on their physical and technical specification. These specifications, sourced from the Pricerunner price comparison website, contain height, Wi-Fi standard, and screen resolution. We aggregated this information into a dataset and split it into a training set and a testing set. To achieve the classification of devices, we trained four popular machine learning models: Random Forest (RF), Logistic Regression (LR), k-Nearest Neighbor (kNN), and Fully Connected Network (FCN). We then compared the performance of these models. The evaluation metrics used to compare performance included precision, recall, F1-score, accuracy, and training time. The RF model achieved the highest overall accuracy of 95.4% on the original dataset. The FCN, applied to a dataset processed with standardization followed by Principal Component Analysis (PCA), reached an accuracy of 92.7%, the best within this specific subset. LR excelled in a few class-specific metrics, while kNN performed notably well relative to its training time. The RF model was the clear winner on the original dataset, while the kNN model was a strong contender on the PCA-processed dataset due to its significantly faster training time compared to the FCN. In conclusion, the RF was the best-performing model on the original dataset, the FCN showed impressive results on the standardized and PCA-processed dataset, and the kNN model, with its highest macro precision and rapid training time, also demonstrated competitive performance.
|
53 |
[en] COREFERENCE RESOLUTION FOR THE ENGLISH LANGUAGE / [pt] RESOLUÇÃO DE CO-REFERÊNCIA PARA A LÍNGUA INGLESAADRIEL GARCIA HERNANDEZ 28 July 2017 (has links)
[pt] Um dos problemas encontrados nos sistemas de processamento de linguagem natural é a dificuldade em identificar elementos textuais que se referem à mesma entidade. Este fenômeno é chamado de correferência. Resolver esse problema é parte integrante da compreensão do discurso, permitindo que os usuários da linguagem conectem as partes da informação de fala relativas à mesma entidade. Por conseguinte, a resolução de correferência é um importante foco de atenção no processamento da linguagem natural.Apesar da riqueza das pesquisas existentes, o desempenho atual dos sistemas de resolução de correferência ainda não atingiu um nível satisfatório. Neste trabalho, descrevemos um sistema de aprendizado estruturado para resolução de correferências em restrições que explora duas técnicas: árvores de correferência latente e indução automática de atributos guiadas por entropia. A modelagem de árvore latente torna o problema de aprendizagem computacionalmente viável porque incorpora uma estrutura escondida relevante. Além disso, utilizando um método automático de indução de recursos, podemos construir eficientemente modelos não-lineares, usando algoritmos de aprendizado de modelo linear como, por exemplo, o algoritmo de perceptron estruturado e esparso.Nós avaliamos o sistema para textos em inglês, utilizando o conjunto de dados da CoNLL-2012 Shared Task. Para a língua inglesa, nosso sistema obteve um valor de 62.24 por cento no score oficial dessa competição. Este resultado está abaixo do desempenho no estado da arte para esta tarefa que é de 65.73 por cento. No entanto, nossa solução reduz significativamente o tempo de obtenção dos clusters dos documentos, pois, nosso sistema leva 0.35 segundos por documento no conjunto de testes, enquanto no estado da arte, leva 5 segundos para cada um. / [en] One of the problems found in natural language processing systems, is the difficulty to identify textual elements referring to the same entity, this task is called coreference. Solving this problem is an integral part of discourse comprehension since it allows language users to connect the pieces of speech information concerning to the same entity. Consequently, coreference resolution is a key task in natural language processing.Despite the large efforts of existing research, the current performance of coreference resolution systems has not reached a satisfactory level yet. In this work, we describe a structure learning system for unrestricted coreferencere solution that explores two techniques: latent coreference trees and automatic entropy-guided feature induction. The latent tree modeling makes the learning problem computationally feasible,since it incorporates are levant hidden structure. Additionally,using an automatic feature induction method, we can efciently build enhanced non-linear models using linear model learning algorithms, namely, the structure dandsparse perceptron algorithm. We evaluate the system on the CoNLL-2012 Shared Task closed track data set, for the English portion. The proposed system obtains a 62.24 per cent value on the competition s official score. This result is be low the 65.73 per cent, the state-of-the-art performance for this task. Nevertheless, our solution significantly reduces the time to obtain the clusters of adocument, since, our system takes 0.35 seconds per document in the testing set, while in the state-of-the-art, it takes 5 seconds for each one.
|
54 |
APLICAÇÃO DE TÉCNICAS DE APRENDIZADO DE MÁQUINA PARA CLASSIFICAÇÃO DE DEPÓSITOS MINERAIS BASEADA EM MODELO TEOR-TONELAGEM / APPLICATION OF MACHINE LEARNING TECHNIQUES FOR CLASSIFICATION OF MINERAL DEPOSITS CONTENT-BASED MODEL TONNAGERocha, Jocielma Jerusa Leal 01 July 2010 (has links)
Made available in DSpace on 2016-08-17T14:53:11Z (GMT). No. of bitstreams: 1
Jocielma Jerusa Leal Rocha.pdf: 3008647 bytes, checksum: 785c07837e5e5bb39cb7685000c9d145 (MD5)
Previous issue date: 2010-07-01 / Classification of mineral deposits into types is traditionally done by experts. Since there are reasons to believe that computational techniques can aid this classification process and make it less subjective, the research and investigation of different methods of clustering and classification to this domain may be appropriate. The way followed by researches in this domain has directed for the use of information available in large public databases and the application of supervised machine learning techniques. This work uses information from mineral deposits available in grade-tonnage models published in the literature to conduct research about the suitability of these three techniques: Decision Tree, Multilayer Perceptron Network and Probabilistic Neural Network. Altogether, 1,861 mineral deposits of 18 types are used. The types refer to grade-tonnage models. Initially, each of these three techniques are used to classify mineral deposits into 18 types. Analysis of these results suggested that some deposits types could be treated as a group and also that the classification could be divided into two levels: the first level to classify deposits considering groups of deposits and the second level to classify deposits previously identified on a group into some of specific type belonging to that group. A series of experiments was carried out in order to build a two levels model from the combination of the techniques used, which resulted in an average accuracy rate of 85% of cases. Patterns of errors occurrence were identified within groups in types of deposits less representative in the database. This represents a promising way to achieve improvement in the process of mineral deposits classification that does not mean increasing in the amount of deposits used or in the amount of characteristics of the deposits. / A classificação de depósitos minerais em tipos tradicionalmente é feita por especialistas no assunto. A possibilidade de que técnicas computacionais auxiliem o processo de classificação e o torne menos subjetivo incentiva a pesquisa e aplicação de diferentes métodos de agrupamento e classificação sobre esse domínio de análise. A evolução das pesquisas nesse domínio tem direcionado os estudos para a utilização de informações disponíveis em grandes bases de dados publicadas e a aplicação de técnicas de aprendizado de máquina supervisionado. Este trabalho utiliza informações de depósitos minerais disponibilizadas em modelos teor-tonelagem publicados na literatura para proceder a investigação da adequabilidade de três dessas técnicas: Árvore de Decisão, Rede Percéptron Multicamadas e Rede Neural Probabilística. Ao todo, são 1.861 depósitos distribuídos em 18 tipos identificados pelo modelo teor-tonelagem. Inicialmente verificou-se o resultado apresentado por cada uma das três técnicas para a classificação dos depósitos em 18 tipos. A análise desses resultados sugeriu a possibilidade de agrupar esses tipos e dividir a classificação em dois níveis: o primeiro nível para classificar os depósitos considerando o agrupamento de tipos e o segundo nível para classificar os depósitos que resultaram em um grupo em um dos tipos específicos daquele grupo. Uma série de experimentos foi realizada no sentido de construir um modelo de classificação em dois níveis a partir da combinação das técnicas utilizadas, o que resultou em uma taxa de acerto média de 85% dos casos e as principais ocorrências de erros foram identificadas dentro de grupos em tipos de depósitos menos representativos na base de dados. Isso representa uma maneira promissora de conseguir melhoria no processo de classificação de depósitos minerais que não implica no aumento da quantidade de depósitos utilizada ou na quantidade de características dos depósitos.
|
55 |
Detekce Útoků v Síťovém Provozu / Intrusion Detection in Network TrafficHomoliak, Ivan Unknown Date (has links)
Tato práce se zabývá problematikou anomální detekce síťových útoků s využitím technik strojového učení. Nejdříve jsou prezentovány state-of-the-art datové kolekce určené pro ověření funkčnosti systémů detekce útoků a také práce, které používají statistickou analýzu a techniky strojového učení pro nalezení síťových útoků. V další části práce je prezentován návrh vlastní kolekce metrik nazývaných Advanced Security Network Metrics (ASNM), který je součástí konceptuálního automatického systému pro detekci průniků (AIPS). Dále jsou navrženy a diskutovány dva různé přístupy k obfuskaci - tunelování a modifikace síťových charakteristik - sloužících pro úpravu provádění útoků. Experimenty ukazují, že použité obfuskace jsou schopny předejít odhalení útoků pomocí klasifikátoru využívajícího metriky ASNM. Na druhé straně zahrnutí těchto obfuskací do trénovacího procesu klasifikátoru může zlepšit jeho detekční schopnosti. Práce také prezentuje alternativní pohled na obfuskační techniky modifikující síťové charakteristiky a demonstruje jejich použití jako aproximaci síťového normalizéru založenou na vhodných trénovacích datech.
|
56 |
Dynamic prediction of repair costs in heavy-duty trucksSaigiridharan, Lakshidaa January 2020 (has links)
Pricing of repair and maintenance (R&M) contracts is one among the most important processes carried out at Scania. Predictions of repair costs at Scania are carried out using experience-based prediction methods which do not involve statistical methods for the computation of average repair costs for contracts terminated in the recent past. This method is difficult to apply for a reference population of rigid Scania trucks. Hence, the purpose of this study is to perform suitable statistical modelling to predict repair costs of four variants of rigid Scania trucks. The study gathers repair data from multiple sources and performs feature selection using the Akaike Information Criterion (AIC) to extract the most significant features that influence repair costs corresponding to each truck variant. The study proved to show that the inclusion of operational features as a factor could further influence the pricing of contracts. The hurdle Gamma model, which is widely used to handle zero inflations in Generalized Linear Models (GLMs), is used to train the data which consists of numerous zero and non-zero values. Due to the inherent hierarchical structure within the data expressed by individual chassis, a hierarchical hurdle Gamma model is also implemented. These two statistical models are found to perform much better than the experience-based prediction method. This evaluation is done using the mean absolute error (MAE) and root mean square error (RMSE) statistics. A final model comparison is conducted using the AIC to draw conclusions based on the goodness of fit and predictive performance of the two statistical models. On assessing the models using these statistics, the hierarchical hurdle Gamma model was found to perform predictions the best
|
57 |
Developing Automated Cell Segmentation Models Intended for MERFISH Analysis of the Cardiac Tissue by Deploying Supervised Machine Learning Algorithms / Utveckling av automatiserade cellsegmenteringsmodeller avsedda för MERFISH-analys av hjärtvävnad genom användning av övervakade maskininlärningsalgoritmerRune, Julia January 2023 (has links)
Följande studie behandlar utvecklandet av automatiserade cellsegmenteringsmodeller med avsikt att identifiera gränser mellan celler i hjärtvävnad. Syftet är att möjliggöra analys av data genererad från multiplexed error-robust in situ hybridization (MERFISH). MERFISH är en spatial transcriptomics-teknik som till skillnad från exempelvis single-cell RNA sequencing (ScRNA-seq) och single molecule fluorescence in situ hybridization (smFISH), möjliggör profilering av hundratals RNA-sekvenser hos enskilda celler utan att förlora dess rumsliga kontext. I Kosuri laboratoriet på Salk Institute of Biological Studies i San Diego tillämpas MERFISH på mushjärtan. Syftet är att få en djupare insikt i hur celler är organiserade i friska hjärtan, och hur denna struktur ändras i och med åldring och sjukdom. Att extrahera meningsfull information från MERFISH medför dock en betydande utmaning - en exakt cellsegmentering. Studien bidrar följaktligen till utvecklandet av segmenteringsmodeller för att kringgå de utmaningar som står i vägen för all efterföljande analys. Då klassiska segmenteringsalgoritmer är otillräckliga för att segmentera den komplexa vävnad som hjärtat utgörs av, tillämpades några av dagens mest avancerade och framstående maskininlärningsalgoritmer inom fältet, kallade Cellpose och Omnipose. Givet den täta och heterogena hjärtvävnaden, som härstammar från en bred distribution av celltyper och geometrier, utvecklades två separata modeller; en för att täcka både mindre celler och kardiomyocyter skurna på tvärsnittet; och en för att enbart segmentera kardiomyocyter skurna i longitudinell riktning. Den förstnämnda modellen utvecklades och tränades i Cellpose, och uppnådde en träffsäkerhet på 91.2%. Modellen för longitudinella kardiomyocyter utvecklades istället både i Cellpose och Omnipose för att utvärdera vilket nätverk som är bäst lämpat för ändamålet. Ingen av nätverken lyckades uppnå en tillräckligt hög träffsäkerhet för att vara applicerbar, och är därmed i behov av fortsatt träning. Modellen genererad i Omnipose bedöms dock vara mest lovande, givet dess mer heltäckande segmentering. Ytterligare utvecklingsområden för framtiden innefattar segmentering av celler i fibros-täta regioner, samt att utveckla en 3D-segmentering av hela hjärtat för att uppnå en mer komplett MERFISH-analys. Sammanfattningsvis har de genererade segmenteringsmodellerna banat väg för möjliggörandet av en rigorös MERFISH-analys av hjärtat. Genom att avslöja några av de strukturella och funktionella orsakerna till hjärtsvikt på en cellulär nivå, kan vi således på sikt bidra till utvecklingen av mer effektiva terapeutiska strategier. / The following study delves into the development of automated cell segmentation models, with the intention of identifying boundaries between cells in the cardiac tissue for analysing spatial transcriptomics data. Addressing the limitations of alternative techniques like single-cell RNA sequencing (ScRNA-seq) and single molecule fluorescence in situ hybridization (smFISH), the study underscores the innovative use of multiplexed error-robust fluorescence in situ hybridization (MERFISH) deployed by the Kosuri Lab at Salk Institute for Biological Studies. This advanced imaging-based technique allows for a single-cell transcriptome profiling of hundreds of different transcripts while retaining the spatial context of the tissue. The technique can accordingly reveal how the organization of cells within a healthy heart is altered during disease. However, the extraction of meaningful data from MERFISH poses a significant challenge - accurate cell segmentation. This thesis therefore presents the development of a robust model for cell boundary identification within cardiac tissue, leveraging some of the advanced supervised machine learning algorithms in the field, named Cellpose and Omnipose. Due to the dense and highly heterogeneous tissue- stemming from a wide distribution of cell types and shapes- two separate models had to be developed; one that covers the smaller cells and the cross-sectioned cardiomyocytes, and correspondingly one to cover the longitudinal cardiomyocytes. The cross-section model was successfully developed to achieve an accuracy of 91.2%, whereas the longitudinal model still needs further improvements before being implemented. The thesis acknowledges potential areas for improvement, emphasizing the need to further improve the segmentation of longitudinal cardiomyocytes, tackle the challenges with segmenting cells within fibrotic regions of the diseased heart, as well as achieving a precise 3D cell segmentation. Nonetheless, the generated models have paved the way towards enabling efficient downstream MERFISH analysis to ultimately understand the structural and functional dynamics of heart failure at a cellular level, aiding the development of more effective therapeutic strategies.
|
58 |
[pt] APRENDIZADO ESTRUTURADO COM INDUÇÃO E SELEÇÃO INCREMENTAIS DE ATRIBUTOS PARA ANÁLISE DE DEPENDÊNCIA EM PORTUGUÊS / [en] STRUCTURED LEARNING WITH INCREMENTAL FEATURE INDUCTION AND SELECTION FOR PORTUGUESE DEPENDENCY PARSINGYANELY MILANES BARROSO 09 November 2016 (has links)
[pt] O processamento de linguagem natural busca resolver várias tarefas de complexidade crescente que envolvem o aprendizado de estruturas complexas, como grafos e sequências, para um determinado texto. Por exemplo, a análise de dependência envolve o aprendizado de uma árvore que descreve a estrutura sintática de uma sentença dada. Um método amplamente utilizado para melhorar a representação do conhecimento de domínio em esta tarefa é considerar combinações de atributos usando conjunções lógicas que codificam informação útil com um padrão não-linear. O número total de todas as combinações possíveis para uma conjunção dada cresce exponencialmente no número de atributos e pode resultar em intratabilidade computacional. Também, pode levar a overfitting. Neste cenário, uma técnica para evitar o superajuste e reduzir o conjunto de atributos faz-se necessário. Uma abordagem comum para esta tarefa baseia-se em atribuir uma pontuação a uma árvore de dependência, usando uma função linear do conjunto de atributos. Sabe-se que os modelos lineares esparsos resolvem simultaneamente o problema de seleção de atributos e a estimativa de um modelo linear, através da combinação de um pequeno conjunto de atributos. Neste caso, promover a esparsidade ajuda no controle do superajuste e na compactação do conjunto de atributos. Devido a sua exibilidade, robustez e simplicidade, o algoritmo de perceptron é um método linear discriminante amplamente usado que pode ser modificado para produzir modelos esparsos e para lidar com atributos não-lineares. Propomos a aprendizagem incremental da combinação de um modelo linear esparso com um procedimento de indução de variáveis não-lineares, num cénario de predição estruturada. O modelo linear esparso é obtido através de uma modificação do algoritmo perceptron. O método de indução é Entropy-Guided Feature Generation. A avaliação empírica é realizada usando o conjunto de dados para português da CoNLL 2006 Shared Task. O analisador resultante alcança 92,98 por cento de precisão, que é um desempenho competitivo quando comparado com os sistemas de estado- da-arte. Em sua versão regularizada, o analizador alcança uma precisão de 92,83 por cento , também mostra uma redução notável de 96,17 por cento do número de atributos binários e, reduz o tempo de aprendizagem em quase 90 por cento, quando comparado com a sua versão não regularizada. / [en] Natural language processing requires solving several tasks of increasing
complexity, which involve learning to associate structures like graphs and
sequences to a given text. For instance, dependency parsing involves learning
of a tree that describes the dependency-based syntactic structure of a
given sentence. A widely used method to improve domain knowledge
representation in this task is to consider combinations of features, called
templates, which are used to encode useful information with nonlinear
pattern. The total number of all possible feature combinations for a given
template grows exponentialy in the number of features and can result in
computational intractability. Also, from an statistical point of view, it can
lead to overfitting. In this scenario, it is required a technique that avoids
overfitting and that reduces the feature set. A very common approach to
solve this task is based on scoring a parse tree, using a linear function
of a defined set of features. It is well known that sparse linear models
simultaneously address the feature selection problem and the estimation
of a linear model, by combining a small subset of available features. In
this case, sparseness helps control overfitting and performs the selection
of the most informative features, which reduces the feature set. Due to
its
exibility, robustness and simplicity, the perceptron algorithm is one of
the most popular linear discriminant methods used to learn such complex
representations. This algorithm can be modified to produce sparse models
and to handle nonlinear features. We propose the incremental learning of
the combination of a sparse linear model with an induction procedure of
non-linear variables in a structured prediction scenario. The sparse linear
model is obtained through a modifications of the perceptron algorithm. The
induction method is the Entropy-Guided Feature Generation. The empirical
evaluation is performed using the Portuguese Dependency Parsing data set
from the CoNLL 2006 Shared Task. The resulting parser attains 92.98 per cent of
accuracy, which is a competitive performance when compared against the
state-of-art systems. On its regularized version, it accomplishes an accuracy
of 92.83 per cent, shows a striking reduction of 96.17 per cent in the number of binary
features and reduces the learning time in almost 90 per cent, when compared to
its non regularized version.
|
59 |
Analysis of Retroreflection and other Properties of Road SignsSaleh, Roxan January 2021 (has links)
Road traffic signs provide regulatory, warning, guidance, and other important information to road users to prevent hazards and road accidents. Therefore, the traffic signs must be detectable, legible, and visible both in day and nighttime to fulfill their purpose. The nighttime visibility is critical to safe driving on the roads at night. The state of the art gives clear evidence that the retroreflectivity improves the nighttime visibility (detectability and legibility) of the road traffic signs and that the nighttime visibility can be improved by using an adequate level of retroreflectivity. Furthermore, nighttime visibility can be affected by human, sign, vehicle, environmental, and design factors. The retroreflectivity and colors of the road signs deteriorate over time and thus the visibility worsens, therefore, maintaining the road signs is one of the important issues to improve the safety on the roads. Thus, it is important to judge whether the retroreflectivity and colors of the road sign are within the accepted levels for visibility and the status of the signs are accepted or not and need to be replaced. This thesis aims to use machine learning algorithms to predict the status of road signs in Sweden. To achieve this aim, three classifiers were invoked: Artificial Neural Network (ANN), Support Vector Machines (SVM), and Random Forest (RF). The data which was collected in Sweden by The Road and Transport Research Institute (VTI) was used to build the prediction models. High accuracy was achieved using the three algorithms (ANN, SVM, and RF) of 0.84.3, 0.93, and 0.98, respectively. Scaling the data was found to improve the accuracy of the prediction for all three models and better accuracy is achieved when the data was scaled using standardization compared with normalization. Additionally using principal component analysis (PCA) has a different impact on the accuracy of the prediction for each algorithm. Another aim was to build prediction models to predict the retroreflectivity performance of the in-use road signs without the need to use instruments to measure the retroreflectivity or color. Experiments using linear and logarithmic regression models were conducted in this thesis to predict the retroreflectivity performance. Two datasets were used, VTI data and another data which was collected in Denmark by voluntary Nordic research cooperation (NMF group). The age of the road traffic sign, the chromaticity coordinate X for colors, and the class of retroreflectivity were found significant to the retroreflectivity in both datasets. The logarithmic regression models were able to predict the retroreflectivity with higher accuracy than linear models. Two suggested logarithmic regression models provided high accuracy for predicting the retroreflectivity (R2 of 0.50 on VTI data and 0.95 on NMF data) by using color, age, class, GPS position, and direction as predictors. Nearly the same accuracy (R2 of 0.57 on VTI data and 0.95 on NMF data) was achieved by using all parameters in the data as predictors (including chromaticity coordinates X, Y for colors). As a conclusion, omitting chromaticity coordinates X, Y for colors from the logarithmic regression models does not affect the accuracy of the prediction.
|
60 |
Interactive Machine Assistance: A Case Study in Linking Corpora and DictionariesBlack, Kevin P 01 November 2015 (has links) (PDF)
Machine learning can provide assistance to humans in making decisions, including linguistic decisions such as determining the part of speech of a word. Supervised machine learning methods derive patterns indicative of possible labels (decisions) from annotated example data. For many problems, including most language analysis problems, acquiring annotated data requires human annotators who are trained to understand the problem and to disambiguate among multiple possible labels. Hence, the availability of experts can limit the scope and quantity of annotated data. Machine-learned pre-annotation assistance, which suggests probable labels for unannotated items, can enable expert annotators to work more quickly and thus to produce broader and larger annotated resources more cost-efficiently. Yet, because annotated data is required to build the pre-annotation model, bootstrapping is an obstacle to utilizing pre-annotation assistance, especially for low-resource problems where little or no annotated data exists. Interactive pre-annotation assistance can mitigate bootstrapping costs, even for low-resource problems, by continually refining the pre-annotation model with new annotated examples as the annotators work. In practice, continually refining models has seldom been done except for the simplest of models which can be trained quickly. As a case study in developing sophisticated, interactive, machine-assisted annotation, this work employs the task of corpus-dictionary linkage (CDL), which is to link each word token in a corpus to its correct dictionary entry. CDL resources, such as machine-readable dictionaries and concordances, are essential aids in many tasks including language learning and corpus studies. We employ a pipeline model to provide CDL pre-annotations, with one model per CDL sub-task. We evaluate different models for lemmatization, the most significant CDL sub-task since many dictionary entry headwords are usually lemmas. The best performing lemmatization model is a hybrid which uses a maximum entropy Markov model (MEMM) to handle unknown (novel) word tokens and other component models to handle known word tokens. We extend the hybrid model design to the other CDL sub-tasks in the pipeline. We develop an incremental training algorithm for the MEMM which avoids wasting previous computation as would be done by simply retraining from scratch. The incremental training algorithm facilitates the addition of new dictionary entries over time (i.e., new labels) and also facilitates learning from partially annotated sentences which allows annotators to annotate words in any order. We validate that the hybrid model attains high accuracy and can be trained sufficiently quickly to provide interactive pre-annotation assistance by simulating CDL annotation on Quranic Arabic and classical Syriac data.
|
Page generated in 0.1117 seconds