Global ETD Search

61	Efficient network based approaches for pattern recognition and knowledge discovery from large and heterogeneous datasets Zhu, Cheng 25 October 2013 (has links) No description available. Computer Science Network approaches pattern recognition heterogeneous datasets rare orphan disease drug repositioning gene prediction
62	Identification of Uniform Class Regions using Perceptron Training Samuel, Nikhil J. 15 October 2015 (has links) No description available. Computer Science Perceptron Training Uniform Class Regions Classification Supervised Learning Error Correction Linearly non-separable datasets
63	A Comparison of Rule Extraction Techniques with Emphasis on Heuristics for Imbalanced Datasets Singh, Manjeet 22 September 2010 (has links) No description available. Industrial Engineering Ecological Datasets Imbalanced dataset modeling Artficial Neural Networks Surface Generation Non-linear modeling
64	Classification in High Dimensional Feature Spaces through Random Subspace Ensembles Pathical, Santhosh P. January 2010 (has links) No description available. Computer Science machine learning classification ensemble learning high dimensional datasets curse of dimensionality
65	Performance Evaluation of a Low Impact Development Retrofit for Urban Stormwater Treatment Le Bel, Paul David 18 April 2013 (has links) The goal of Low Impact Development (LID) is to mimic the pre-development hydrologic regime of a catchment through infiltration, filtration, storage, evaporation, and detention of post-development runoff using small-scale hydrologic controls close to the source. A LID facility located in Northern Virginia was examined for pollutant removal and hydrologic performance. The treatment train included four in-line grass swales followed by a bioretention cell with a gravel base. The facility retained 85% of the rainfall. Influent and effluent pollutant loads were calculated using three common substitution methods for datasets censored by values below the analytical detection limit. The Summation of Loads (SOL) method was used to facilitate understanding of how data censoring affected performance results when substitution methods were used. The SOL analysis showed positive removal performance for most nutrient species, sediment, oxygen demanding substances, selected trace metals and total petroleum hydrocarbons. Negative performance was observed for oxidized nitrogen, total dissolved solids and oil & grease. LID facility influent and effluent loads were also compared using the Effluent Probability Method (EPM). The EPM analysis showed statistically significant (p d 0.05) pollutant load removal performance over the entire range of sampled events for total suspended solids, total phosphorus, total nitrogen, total Kjeldahl nitrogen, ammonia nitrogen, chemical oxygen demand, copper, zinc and alkalinity. EPM analysis did not show significant removals of oxidized nitrogen, total dissolved solids, orthophosphate phosphorus and hardness. / Master of Science Bioretention Low Impact Development Summation of Loads Method Effluent Probability Method Censored Datasets
66	Toward Practical, In-The-Wild, and Reusable Wearable Activity Classification Younes, Rabih Halim 08 June 2018 (has links) Wearable activity classifiers, so far, have been able to perform well with simple activities, strictly-scripted activities, and application-specific activities. In addition, current classification systems suffer from using impractical tight-fitting sensor networks, or only use one loose-fitting sensor node that cannot capture much movement information (e.g., smartphone sensors and wrist-worn sensors). These classifiers either do not address the bigger picture of making activity recognition more practical and being able to recognize more complex and naturalistic activities, or try to address this issue but still perform poorly on many fronts. This dissertation works toward having practical, in-the-wild, and reusable wearable activity classifiers by taking several steps that include the four following main contributions. The dissertation starts by quantifying users' needs and expectations from wearable activity classifiers to set a framework for designing ideal wearable activity classifiers. Data collected from user studies and interviews is gathered and analyzed, then several conclusions are made to set a framework of essential characteristics that ideal wearable activity classification systems should have. Afterwards, this dissertation introduces a group of datasets that can be used to benchmark different types of activity classifiers and can accommodate for a variety of goals. These datasets help comparing different algorithms in activity classification to assess their performance under various circumstances and with different types of activities. The third main contribution consists of developing a technique that can classify complex activities with wide variations. Testing this technique shows that it is able to accurately classify eight complex daily-life activities with wide variations at an accuracy rate of 93.33%, significantly outperforming the state-of-the-art. This technique is a step forward toward classifying real-life natural activities performed in an environment that allows for wide variations within the activity. Finally, this dissertation introduces a method that can be used on top of any activity classifier that allows access to its matching scores in order to improve its classification accuracy. Testing this method shows that it improves classification results by 11.86% and outperforms the state-of-the-art, therefore taking a step forward toward having reusable activity classification techniques that can be used across users, sensor domains, garments, and applications. / Ph. D. Wearable Activity Classification Lowest Cumulative Cost Activity Sequence Complex Activities Datasets
67	Relation entre tableaux de données : exploration et prédiction / Relating datasets : exploration and prediction El Ghaziri, Angélina 20 October 2016 (has links) La recherche développée dans le cadre de cette thèse aborde différents aspects relevant de l’analyse statistique de données. Dans un premier temps, une analyse de trois indices d’associations entre deux tableaux de données est développée. Par la suite, des stratégies d’analyse liées à la standardisation de tableaux de données avec des applications en analyse en composantes principales (ACP) et en régression, notamment la régression PLS sont présentées. La première stratégie consiste à proposer une standardisation continuum des variables. Une standardisation plus générale est aussi abordée consistant à réduire de manière graduelle non seulement les variances des variables mais également les corrélations entre ces variables. De là, une approche continuum de régression a été élaborée regroupant l’analyse des redondances et la régression PLS. Par ailleurs, cette dernière standardisation a inspiré une démarche de régression biaisée dans le cadre de régression linéaire multiple. Les propriétés d’une telle démarche sont étudiées et les résultats sont comparés à ceux de la régression Ridge. Dans le cadre de l’analyse de plusieurs tableaux de données, une extension de la méthode ComDim pour la situation de K+1 tableaux est développée. Les propriétés de cette méthode, appelée P-ComDim, sont étudiées et comparées à celles de Multiblock PLS. Enfin, la situation où il s’agit d’évaluer l’effet de plusieurs facteurs sur des données multivariées est considérée et une nouvelle stratégie d’analyse est proposée. / The research developed in this thesis deals with several statistical aspects for analyzing datasets. Firstly, investigations of the properties of several association indices commonly used by practitioners are undergone. Secondly, different strategies related to the standardization of the datasets with application to principal component analysis (PCA) and regression, especially PLS-regression were developed. The first strategy consists of a continuum standardization of the variables. The interest of such standardization in PCA and PLS-regression is emphasized.A more general standardization is also discussed which consists in reducing gradually not only the variances of the variables but also their correlations. Thereafter, a continuum approach was developed combining Redundancy Analysis and PLS-regression. Moreover, this new standardization inspired a biased regression model in multiple linear regression. Properties related to this approach are studied and the results are compared on the basis of case studies with those of Ridge regression. In the context of the analysis of several datasets in an exploratory perspective, the method called ComDim, has certainly raised interest among practitioners. An extension of this method for the analysis of K+1 datasets was developed. Properties related to this method, called P-ComDim, are studied and compared to Multiblock PLS. Finally, for the analysis of datasets depending on several factors, a new approach based on PLS regression is proposed. Comparaison de deux tableaux Tableaux multiples Régression PLS ComDim Analyse des Redondances Régression biaisée Analyse en Composante Principale Comparison between two datasets Multi-Blocks datasets PLS-Regression ComDim Redundancy Analysis Biased regression Principal Component Analysis
68	[en] NAMED ENTITY RECOGNITION FOR PORTUGUESE / [pt] RECONHECIMENTO DE ENTIDADES MENCIONADAS PARA O PORTUGUÊS DANIEL SPECHT SILVA MENEZES 13 December 2018 (has links) [pt] A produção e acesso a quantidades imensas dados é um elemento pervasivo da era da informação. O volume de informação disponível é sem precedentes na história da humanidade e está sobre constante processo de expansão. Uma oportunidade que emerge neste ambiente é o desenvolvimento de aplicações que sejam capazes de estruturar conhecimento contido nesses dados. Neste contexto se encaixa a área de Processamento de Linguagem Natural (PLN) - Natural Language Processing (NLP) - , ser capaz de extrair informações estruturadas de maneira eficiente de fontes textuais. Um passo fundamental para esse fim é a tarefa de Reconhecimento de Entidades Mencionadas (ou nomeadas) - Named Entity Recognition (NER) - que consistem em delimitar e categorizar menções a entidades num texto. A construção de sistemas para NLP deve ser acompanhada de datasets que expressem o entendimento humano sobre as estruturas gramaticais de interesse, para que seja possível realizar a comparação dos resultados com o real discernimento humano. Esses datasets são recursos escassos, que requerem esforço humano para sua produção. Atualmente, a tarefa de NER vem sendo abordada com sucesso por meio de redes neurais artificiais, que requerem conjuntos de dados anotados tanto para avaliação quanto para treino. A proposta deste trabalho é desenvolver um dataset de grandes dimensões para a tarefa de NER em português de maneira automatizada, minimizando a necessidade de intervenção humana. Utilizamos recursos públicos como fonte de dados, nominalmente o DBpedia e Wikipédia. Desenvolvemos uma metodologia para a construção do corpus e realizamos experimentos sobre o mesmo utilizando arquiteturas de redes neurais de melhores performances reportadas atualmente. Exploramos diversas modelos de redes neurais, explorando diversos valores de hiperparâmetros e propondo arquiteturas com o foco específico de incorporar fontes de dados diferentes para treino. / [en] The production and access of huge amounts of data is a pervasive element of the Information Age. The volume of availiable data is without precedents in human history and it s in constant expansion. An oportunity that emerges in this context is the development and usage of applicationos that are capable structuring the knowledge of data. In this context fits the Natural Language Processing, being able to extract information efficiently from textual data. A fundamental step for this goal is the task of Named Entity Recognition (NER) which delimits and categorizes the mentions to entities. The development o systems for NLP tasks must be accompanied by datasets produced by humans in order to compare the system with the human discerniment for the NLP task at hand. These datasets are a scarse resource which the construction is costly in terms of human supervision. Recentlly, the NER task has been approached using artificial network models which needs datsets for both training and evaluation. In this work we propose the construction of a datasets for portuguese NER with an automatic approach using public data sources structured according to the principles of SemanticWeb, namely, DBpedia and Wikipédia. A metodology for the construction of this dataset was developed and experiments were performed using both the built dataset and the neural network architectures with the best reported results. Many setups for the experiments were evaluated, we obtained preliminary results for diverse hiperparameters values, also proposing architectures with the specific focus of incorporating diverse data sources for training. [pt] REDES NEURAIS [en] NEURAL NETWORKS [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] WIKIPEDIA [en] WIKIPEDIA [pt] PROCESSAMENTO DE LINGUAGEM NATURAL [en] NATURAL LANGUAGE PROCESSING [en] NAMED ENTITY RECOGNITION [pt] DATASETS [en] DATASETS
69	Electronic multi-agency collaboration : a model for sharing children's personal information among organisations Louws, Margie January 2010 (has links) The sharing of personal information among health and social service organisations is a complex issue and problematic process in present-day England. Organisations which provide services to children face enormous challenges on many fronts. Internal ways of working, evolving best practice, data protection applications, government mandates and new government agencies, rapid changes in technology, and increasing costs are but a few of the challenges with which organisations must contend in order to provide services to children while keeping in step with change. This thesis is an exploration into the process of sharing personal information in the context of public sector reforms. Because there is an increasing emphasis of multi-agency collaboration, this thesis examines the information sharing processes both within and among organisations, particularly those providing services to children. From the broad principles which comprise a socio-technical approach of information sharing, distinct critical factors for successful information sharing and best practices are identified. These critical success factors are then used to evaluate the emerging national database, ContactPoint, highlighting particular areas of concern. In addition, data protection and related issues in the information sharing process are addressed. It is argued that one of the main factors which would support effective information sharing is to add a timeline to the life of a dataset containing personal information, after which the shared information would dissolve. Therefore, this thesis introduces Dynamic Multi-Agency Collaboration (DMAC), a theoretical model of effective information sharing using a limited-life dataset. The limited life of the DMAC dataset gives more control to information providers, encouraging effective information sharing within the parameters of the Data Protection Act 1998. 361.3
70	Geometric Approach to Support Vector Machines Learning for Large Datasets Strack, Robert 03 May 2013 (has links) The dissertation introduces Sphere Support Vector Machines (SphereSVM) and Minimal Norm Support Vector Machines (MNSVM) as the new fast classification algorithms that use geometrical properties of the underlying classification problems to efficiently obtain models describing training data. SphereSVM is based on combining minimal enclosing ball approach, state of the art nearest point problem solvers and probabilistic techniques. The blending of the three speeds up the training phase of SVMs significantly and reaches similar (i.e., practically the same) accuracy as the other classification models over several big and large real data sets within the strict validation frame of a double (nested) cross-validation (CV). MNSVM is further simplification of SphereSVM algorithm. Here, relatively complex classification task was converted into one of the simplest geometrical problems -- minimal norm problem. This resulted in additional speedup compared to SphereSVM. The results shown are promoting both SphereSVM and MNSVM as outstanding alternatives for handling large and ultra-large datasets in a reasonable time without switching to various parallelization schemes for SVMs algorithms proposed recently. The variants of both algorithms, which work without explicit bias term, are also presented. In addition, other techniques aiming to improve the time efficiency are discussed (such as over-relaxation and improved support vector selection scheme). Finally, the accuracy and performance of all these modifications are carefully analyzed and results based on nested cross-validation procedure are shown. computational geometry support vector machines large datasets classification minimal norm problem minimal enclosing ball problem Computer Sciences Physical Sciences and Mathematics

Search results