Global ETD Search

1	Efficient techniques for streaming cross document coreference resolution Shrimpton, Luke William January 2017 (has links) Large text streams are commonplace; news organisations are constantly producing stories and people are constantly writing social media posts. These streams should be analysed in real-time so useful information can be extracted and acted upon instantly. When natural disasters occur people want to be informed, when companies announce new products financial institutions want to know and when celebrities do things their legions of fans want to feel involved. In all these examples people care about getting information in real-time (low latency). These streams are massively varied, people’s interests are typically classified by the entities they are interested in. Organising a stream by the entity being referred to would help people extract the information useful to them. This is a difficult task: fans of ‘Captain America’ films will not want to be incorrectly told that ‘Chris Evans’ (the main actor) was appointed to host ‘Top Gear’ when it was a different ‘Chris Evans’. People who use local idiosyncrasies such as referring to their home county (‘Cornwall’) as ‘Kernow’ (the Cornish for ‘Cornwall’ that has entered the local lexicon) should not be forced to change their language when finding out information about their home. This thesis addresses a core problem for real-time entity-specific NLP: Streaming cross document coreference resolution (CDC), how to automatically identify all the entities mentioned in a stream in real-time. This thesis address two significant problems for streaming CDC: There is no representative dataset and existing systems consume more resources over time. A new technique to create datasets is introduced and it was applied to social media (Twitter) to create a large (6M mentions) and challenging new CDC dataset that contains a much more variend range of entities than typical newswire streams. Existing systems are not able to keep up with large data streams. This problem is addressed with a streaming CDC system that stores a constant sized set of mentions. New techniques to maintain the sample are introduced significantly out-performing existing ones maintaining 95% of the performance of a non-streaming system while only using 20% of the memory.
2	Machine Learning Improvements for Data Partitioning and Classification Applied to Cardiac Arrhythmia Signals Cayce, Garrett Irwin 12 1900 (has links) This thesis creates a new method for the ethical splitting of data as well as improvements to neural network architectures to increase performance. Ethical dataset splitting should be based on statistics from the data, this prevents artificial manipulation of the data that helps or hurts the performance of a network. This bias introduced to the dataset can also be present by using the popular method of randomly splitting data into datasets. To remove bias from dataset splitting, the splits of a dataset must be based on statistics from the data. Improving neural network architectures to increase performance is very important for a wide range of applications, especially for classification of heartbeats. Every improvement matters, especially when the application means that any errors could put the life of a person in danger. These advancements being applied to heartbeat classification have exciting implications for saving thousands of lives and billions of dollars. The presented methods can also be expanded to a wide variety of applications and adapted to different types of data as increasing performance and splitting up datasets is important in all fields of machine learning. Convolutional Neural Network Principal Component Analysis Medical Classification Dataset Creation Heartbeat Classification Principal Component Analysis Singular Value Decomposition Engineering, Biomedical Biology, Biostatistics Biology, Bioinformatics
3	Rozpoznávání textu pomocí konvolučních sítí / Optical Character Recognition Using Convolutional Networks Csóka, Pavel January 2016 (has links) This thesis aims at creation of new datasets for text recognition machine learning tasks and experiments with convolutional neural networks on these datasets. It describes architecture of convolutional nets, difficulties of recognizing text from photographs and contemporary works using these networks. Next, creation of annotation, using Tesseract OCR, for dataset comprised from photos of document pages, taken by mobile phones, named Mobile Page Photos. From this dataset two additional are created by cropping characters out of its photos formatted as Street View House Numbers dataset. Dataset Mobile Nice Page Photos Characters contains readable characters and Mobile Page Photos Characters adds hardly readable and unreadable ones. Three models of convolutional nets are created and used for text recognition experiments on these datasets, which are also used for estimation of annotation error.

1

Page generated in 0.4349 seconds