• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 10
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 25
  • 25
  • 7
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Random forest och glesa datarespresentationer / Random forest using sparse data structures

Linusson, Henrik, Rudenwall, Robin, Olausson, Andreas January 2012 (has links)
In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets. / Program: Systemarkitekturutbildningen
12

Constructing a Clinical Research Data Management System

Quintero, Michael C. 04 November 2017 (has links)
Clinical study data is usually collected without knowing what kind of data is going to be collected in advance. In addition, all of the possible data points that can apply to a patient in any given clinical study is almost always a superset of the data points that are actually recorded for a given patient. As a result of this, clinical data resembles a set of sparse data with an evolving data schema. To help researchers at the Moffitt Cancer Center better manage clinical data, a tool was developed called GURU that uses the Entity Attribute Value model to handle sparse data and allow users to manage a database entity’s attributes without any changes to the database table definition. The Entity Attribute Value model’s read performance gets faster as the data gets sparser but it was observed to perform many times worse than a wide table if the attribute count is not sufficiently large. Ultimately, the design trades read performance for flexibility in the data schema.
13

Order in the random forest

Karlsson, Isak January 2017 (has links)
In many domains, repeated measurements are systematically collected to obtain the characteristics of objects or situations that evolve over time or other logical orderings. Although the classification of such data series shares many similarities with traditional multidimensional classification, inducing accurate machine learning models using traditional algorithms are typically infeasible since the order of the values must be considered. In this thesis, the challenges related to inducing predictive models from data series using a class of algorithms known as random forests are studied for the purpose of efficiently and effectively classifying (i) univariate, (ii) multivariate and (iii) heterogeneous data series either directly in their sequential form or indirectly as transformed to sparse and high-dimensional representations. In the thesis, methods are developed to address the challenges of (a) handling sparse and high-dimensional data, (b) data series classification and (c) early time series classification using random forests. The proposed algorithms are empirically evaluated in large-scale experiments and practically evaluated in the context of detecting adverse drug events. In the first part of the thesis, it is demonstrated that minor modifications to the random forest algorithm and the use of a random projection technique can improve the effectiveness of random forests when faced with discrete data series projected to sparse and high-dimensional representations. In the second part of the thesis, an algorithm for inducing random forests directly from univariate, multivariate and heterogeneous data series using phase-independent patterns is introduced and shown to be highly effective in terms of both computational and predictive performance. Then, leveraging the notion of phase-independent patterns, the random forest is extended to allow for early classification of time series and is shown to perform favorably when compared to alternatives. The conclusions of the thesis not only reaffirm the empirical effectiveness of random forests for traditional multidimensional data but also indicate that the random forest framework can, with success, be extended to sequential data representations.
14

Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais / Heuristic biclustering and coclustering based on matrix factorization: a study on textual data

Ramos Diaz, Alexandra Katiuska 16 October 2018 (has links)
Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância / Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data
15

Biagrupamento heurístico e coagrupamento baseado em fatoração de matrizes: um estudo em dados textuais / Heuristic biclustering and coclustering based on matrix factorization: a study on textual data

Alexandra Katiuska Ramos Diaz 16 October 2018 (has links)
Biagrupamento e coagrupamento são tarefas de mineração de dados que permitem a extração de informação relevante sobre dados e têm sido aplicadas com sucesso em uma ampla variedade de domínios, incluindo aqueles que envolvem dados textuais -- foco de interesse desta pesquisa. Nas tarefas de biagrupamento e coagrupamento, os critérios de similaridade são aplicados simultaneamente às linhas e às colunas das matrizes de dados, agrupando simultaneamente os objetos e os atributos e possibilitando a criação de bigrupos/cogrupos. Contudo suas definições variam segundo suas naturezas e objetivos, sendo que a tarefa de coagrupamento pode ser vista como uma generalização da tarefa de biagrupamento. Estas tarefas, quando aplicadas nos dados textuais, demandam uma representação em um modelo de espaço vetorial que, comumente, leva à geração de espaços caracterizados pela alta dimensionalidade e esparsidade, afetando o desempenho de muitos dos algoritmos. Este trabalho apresenta uma análise do comportamento do algoritmo para biagrupamento Cheng e Church e do algoritmo para coagrupamento de decomposição de valores em blocos não negativos (\\textit{Non-Negative Block Value Decomposition} - NBVD), aplicado ao contexto de dados textuais. Resultados experimentais quantitativos e qualitativos são apresentados a partir das experimentações destes algoritmos em conjuntos de dados sintéticos criados com diferentes níveis de esparsidade e em um conjunto de dados real. Os resultados são avaliados em termos de medidas próprias de biagrupamento, medidas internas de agrupamento a partir das projeções nas linhas dos bigrupos/cogrupos e em termos de geração de informação. As análises dos resultados esclarecem questões referentes às dificuldades encontradas por estes algoritmos nos ambiente de experimentação, assim como se são capazes de fornecer informações diferenciadas e úteis na área de mineração de texto. De forma geral, as análises realizadas mostraram que o algoritmo NBVD é mais adequado para trabalhar com conjuntos de dados em altas dimensões e com alta esparsidade. O algoritmo de Cheng e Church, embora tenha obtidos resultados bons de acordo com os objetivos do algoritmo, no contexto de dados textuais, propiciou resultados com baixa relevância / Biclustering e coclustering are data mining tasks that allow the extraction of relevant information about data and have been applied successfully in a wide variety of domains, including those involving textual data - the focus of interest of this research. In biclustering and coclustering tasks, similarity criteria are applied simultaneously to the rows and columns of the data matrices, simultaneously grouping the objects and attributes and enabling the discovery of biclusters/coclusters. However their definitions vary according to their natures and objectives, being that the task of coclustering can be seen as a generalization of the task of biclustering. These tasks applied in the textual data demand a representation in a model of vector space, which commonly leads to the generation of spaces characterized by high dimensionality and sparsity and influences the performance of many algorithms. This work provides an analysis of the behavior of the algorithm for biclustering Cheng and Church and the algorithm for coclustering non-negative block decomposition (NBVD) applied to the context of textual data. Quantitative and qualitative experimental results are shown, from experiments on synthetic datasets created with different sparsity levels and on a real data set. The results are evaluated in terms of their biclustering oriented measures, internal clustering measures applied to the projections in the lines of the biclusters/coclusters and in terms of generation of information. The analysis of the results clarifies questions related to the difficulties faced by these algorithms in the experimental environment, as well as if they are able to provide differentiated information useful to the field of text mining. In general, the analyses carried out showed that the NBVD algorithm is better suited to work with datasets in high dimensions and with high sparsity. The algorithm of Cheng and Church, although it obtained good results according to its own objectives, provided results with low relevance in the context of textual data
16

Hybrid Ensemble Methods: Interpretible Machine Learning for High Risk Aeras / Hybrida ensemblemetoder: Tolkningsbar maskininlärning för högriskområden

Ulvklo, Maria January 2021 (has links)
Despite the access to enormous amounts of data, there is a holdback in the usage of machine learning in the Cyber Security field due to the lack of interpretability of ”Black­box” models and due to heterogenerous data. This project presents a method that provide insights in the decision making process in Cyber Security classification. Hybrid Ensemble Methods (HEMs), use several weak learners trained on single data features and combines the output of these in a neural network. In this thesis HEM preforms phishing website classification with high accuracy, along with interpretability. The ensemble of predictions boosts the accuracy with 8%, giving a final prediction accuracy of 93 %, which indicates that HEM are able to reconstruct correlations between the features after the interpredability stage. HEM provides information about which weak learners trained on specific information that are valuable for the classification. No samples were disregarded despite missing features. Cross validation were made across 3 random seeds and the results showed to be steady with a variance of 0.22%. An important finding was that the methods performance did not significantly change when disregarding the worst of the weak learners, meaning that adding models trained on bad data won’t sabotage the prediction. The findings of these investigations indicates that Hybrid Ensamble methods are robust and flexible. This thesis represents an attempt to construct a smarter way of making predictions, where the usage of several forms of information can be combined, in an artificially intelligent way. / Trots tillgången till enorma mängder data finns det ett bakslag i användningen av maskininlärning inom cybersäkerhetsområdet på grund av bristen på tolkning av ”Blackbox”-modeller och på grund av heterogen data. Detta projekt presenterar en metod som ger insikt i beslutsprocessen i klassificering inom cyber säkerhet. Hybrid Ensemble Methods (HEMs), använder flera svaga maskininlärningsmodeller som är tränade på enstaka datafunktioner och kombinerar resultatet av dessa i ett neuralt nätverk. I denna rapport utför HEM klassificering av nätfiskewebbplatser med hög noggrannhet, men med vinsten av tolkningsbarhet. Sammansättandet av förutsägelser ökar noggrannheten med 8 %, vilket ger en slutgiltig prediktionsnoggrannhet på 93 %, vilket indikerar att HEM kan rekonstruera korrelationer mellan funktionerna efter tolkbarhetsstadiet. HEM ger information om vilka svaga maskininlärningsmodeller, som tränats på specifik information, som är värdefulla för klassificeringen. Inga datapunkter ignorerades trots saknade datapunkter. Korsvalidering gjordes över 3 slumpmässiga dragningar och resultaten visade sig vara stabila med en varians på 0.22 %. Ett viktigt resultat var att metodernas prestanda inte förändrades nämnvärt när man bortsåg från de sämsta av de svaga modellerna, vilket innebär att modeller tränade på dålig data inte kommer att sabotera förutsägelsen. Resultaten av dessa undersökningar indikerar att Hybrid Ensamble-metoder är robusta och flexibla. Detta projekt representerar ett försök att konstruera ett smartare sätt att göra klassifieringar, där användningen av flera former av information kan kombineras, på ett artificiellt intelligent sätt.
17

Aplicacions de tècniques de fusió de dades per a l'anàlisi d'imatges de satèl·lit en Oceanografia

Reig Bolaño, Ramon 25 June 2008 (has links)
Durant dècades s'ha observat i monitoritzat sistemàticament la Terra i el seu entorn des de l'espai o a partir de plataformes aerotransportades. Paral·lelament, s'ha tractat d'extreure el màxim d'informació qualitativa i quantitativa de les observacions realitzades. Les tècniques de fusió de dades donen un "ventall de procediments que ens permeten aprofitar les dades heterogènies obtingudes per diferents mitjans i instruments i integrar-les de manera que el resultat final sigui qualitativament superior". En aquesta tesi s'han desenvolupat noves tècniques que es poden aplicar a l'anàlisi de dades multiespectrals que provenen de sensors remots, adreçades a aplicacions oceanogràfiques. Bàsicament s'han treballat dos aspectes: les tècniques d'enregistrament o alineament d'imatges; i la interpolació de dades esparses i multiescalars, focalitzant els resultats als camps vectorials bidimensionals.En moltes aplicacions que utilitzen imatges derivades de satèl·lits és necessari mesclar o comparar imatges adquirides per diferents sensors, o bé comparar les dades d'un sòl sensor en diferents instants de temps, per exemple en: reconeixement, seguiment i classificació de patrons o en la monitorització mediambiental. Aquestes aplicacions necessiten una etapa prèvia d'enregistrament geomètric, que alinea els píxels d'una imatge, la imatge de treball, amb els píxels corresponents d'una altra imatge, la imatge de referència, de manera que estiguin referides a uns mateixos punts. En aquest treball es proposa una aproximació automàtica a l'enregistrament geomètric d'imatges amb els contorns de les imatges; a partir d'un mètode robust, vàlid per a imatges mutimodals, que a més poden estar afectades de distorsions, rotacions i de, fins i tot, oclusions severes. En síntesi, s'obté una correspondència punt a punt de la imatge de treball amb el mapa de referència, fent servir tècniques de processament multiresolució. El mètode fa servir les mesures de correlació creuada de les transformades wavelet de les seqüències que codifiquen els contorns de la línia de costa. Un cop s'estableix la correspondència punt a punt, es calculen els coeficients de la transformació global i finalment es poden aplicar a la imatge de treball per a enregistrar-la respecte la referència.A la tesi també es prova de resoldre la interpolació d'un camp vectorial espars mostrejat irregularment. Es proposa un algorisme que permet aproximar els valors intermitjos entre les mostres irregulars si es disposa de valors esparsos a escales de menys resolució. El procediment és òptim si tenim un model que caracteritzi l'esquema multiresolució de descomposició i reconstrucció del conjunt de dades. Es basa en la transformada wavelet discreta diàdica i en la seva inversa, realitzades a partir d'uns bancs de filtres d'anàlisi i síntesi. Encara que el problema està mal condicionat i té infinites solucions, la nostra aproximació, que primer treballarem amb senyals d'una dimensió, dóna una estratègia senzilla per a interpolar els valors d'un camp vectorial bidimensional, utilitzant tota la informació disponible a diferents resolucions. Aquest mètode de reconstrucció es pot utilitzar com a extensió de qualsevol interpolació inicial. També pot ser un mètode adequat si es disposa d'un conjunt de mesures esparses de diferents instruments que prenen dades d'una mateixa escena a diferents resolucions, sense cap restricció en les característiques de la distribució de mesures. Inicialment cal un model dels filtres d'anàlisi que generen les dades multiresolució i els filtres de síntesi corresponents, però aquest requeriment es pot relaxar parcialment, i és suficient tenir una aproximació raonable a la part passa baixes dels filtres. Els resultats de la tesi es podrien implementar fàcilment en el flux de processament d'una estació receptora de satèl·lits, i així es contribuiria a la millora d'aplicacions que utilitzessin tècniques de fusió de dades per a monitoritzar paràmetres mediambientals. / During the last decades a systematic survey of the Earth environment has been set up from many spatial and airborne platforms. At present, there is a continuous effort to extract and combine the maximum of quantitative information from these different data sets, often rather heterogeneous. Data fusion can be defined as "a set of means and tools for the alliance of data originating from different sources with the aims of a greater quality result". In this thesis we have developed new techniques and schemes that can be applied on multispectral data obtained from remote sensors, with particular interest in oceanographic applications. They are based on image and signal processing. We have worked mainly on two topics: image registration techniques or image alignment; and data interpolation of multiscale and sparse data sets, with focus on two dimensional vector fields. In many applications using satellite images, and specifically in those related to oceanographic studies, it is necessary to merge or compare multiple images of the same scene acquired from different captors or from one captor but at different times. Typical applications include pattern classification, recognition and tracking, multisensor data fusion and environmental monitoring. Image registration is the process of aligning the remotely sensed images to the same ground truth and transforming them into a known geographic projection (map coordinates). This step is crucial to correctly merge complementary information from multisensor data. The proposed approach to automatic image registration is a robust method, valid for multimodal images affected by distortions, rotations and, to a reasonably extend, with severe data occlusion. We derived a point to point matching of one image to a georeferenced map applying multiresolution signal processing techniques. The method is based on the contours of images: it uses a maximum cross correlation measure on the biorthogonal undecimated discrete wavelet transforms of the codified coastline contours sequences. Once this point to point correspondence is established, the coefficients of a global transform could be calculated and finally applied on the working image to register it to the georeferenced map. The second topic of this thesis focus on the interpolation of sparse irregularly-sampled vector fields when these sparse data belong to different resolutions. It is proposed a new algorithm to iteratively approximate the intermediate values between irregularly sampled data when a set of sparse values at coarser scales is known. The procedure is optimal if there is a characterized model for the multiresolution decomposition / reconstruction scheme of the dataset. The scheme is based on a fast dyadic wavelet transform and on its inversion using a filter bank analysis/synthesis implementation for the wavelet transform model. Although the problem is ill-posed, and there are infinite solutions, our approach, firstly worked for one dimension signals, gives an easy strategy to interpolate the values of a vector field using all the information available at different scales. This reconstruction method could be used as an extension on any initial interpolation. It can also be suitable in cases where there are sparse measures from different instruments that are sensing the same scene simultaneously at several resolutions, without any restriction to the characteristics of the data distribution. Initially a filter model for the generation of multiresolution data and their synthesis counterpart is the main requisite but; this assumption can be partially relaxed with the only requirement of a reasonable approximation to the low pass counterpart. The thesis results can be easily implemented on the process stream of any satellite receiving station and therefore constitute a first contribution to potential applications on data fusion of environmental monitoring.
18

Development of an assured systems management model for environmental decision–making / Jacobus Johannes Petrus Vivier

Vivier, Jacobus Johannes Petrus January 2011 (has links)
The purpose of this study was to make a contribution towards decision–making in complex environmental problems, especially where data is limited and associated with a high degree of uncertainty. As a young scientist, I understood the value of science as a measuring and quantification tool and used to intuitively believe that science was exact and could provide undisputable answers. It was in 1997, during the Safety Assessments done at the Vaalputs National Radioactive Waste Repository that my belief system was challenged. This occurred after there were numerous scientific studies done on the site that was started since the early 1980’s, yet with no conclusion as to how safe the site is in terms of radioactive waste disposal. The Safety Assessment process was developed by the International Atomic Energy Agency (IAEA) to transform the scientific investigations and data into decision–making information for the purposes of radioactive waste management. It was also during the Vaalputs investigations when I learned the value of lateral thinking. There were numerous scientists with doctorate and master’s degrees that worked on the site of which I was one. One of the important requirements was to measure evaporation at the local weather station close to the repository. It was specifically important to measure evaporation as a controlling parameter in the unsaturated zone models. Evaporation was measured with an Apan that is filled with water so that the losses can be measured. Vaalputs is a very dry place and water is scarce. The local weather station site was fenced off, but there was a problem in that the aardvark dug below the fence and drank the water in the A–pan, so that no measurements were possible. The solution from the scientists was to put the fence deeper into the ground. The aardvark did not find it hard to dig even deeper. The next solution was to put a second fence around the weather station and again the aardvark dug below it to drink the water. It was then that Mr Robbie Schoeman, a technician became aware of the problem and put a drinking water container outside the weather station fence for the aardvark and – the problem was solved at a fraction of the cost of the previous complex solutions. I get in contact with the same thinking patterns that intuitively expect that the act of scientific investigations will provide decision–making information or even solve the problem. If the investigation provides more questions than answers, the quest is for more and more data on more detailed scales. There is a difference between problem characterization and solution viidentification. Problem characterization requires scientific and critical thinking, which is an important component but that has to be incorporated with the solution identification process of creative thinking towards decision–making. I am a scientist by heart, but it was necessary to realise that apart from research, practical science must feed into a higher process, such as decision–making to be able to make a practical difference. The process of compilation of this thesis meant a lot to me as I initially thought of doing a PhD and then it changed me, especially in the way I think. This was a life changing process, which is good. As Jesus said in Mathew 3:2 And saying, Repent (think differently; change your mind, regretting your sins and changing your conduct), for the kingdom of heaven is at hand. / Thesis (Ph.D. (Geography and Environmental Studies))--North-West University, Potchefstroom Campus, 2011.
19

Development of an assured systems management model for environmental decision–making / Jacobus Johannes Petrus Vivier

Vivier, Jacobus Johannes Petrus January 2011 (has links)
The purpose of this study was to make a contribution towards decision–making in complex environmental problems, especially where data is limited and associated with a high degree of uncertainty. As a young scientist, I understood the value of science as a measuring and quantification tool and used to intuitively believe that science was exact and could provide undisputable answers. It was in 1997, during the Safety Assessments done at the Vaalputs National Radioactive Waste Repository that my belief system was challenged. This occurred after there were numerous scientific studies done on the site that was started since the early 1980’s, yet with no conclusion as to how safe the site is in terms of radioactive waste disposal. The Safety Assessment process was developed by the International Atomic Energy Agency (IAEA) to transform the scientific investigations and data into decision–making information for the purposes of radioactive waste management. It was also during the Vaalputs investigations when I learned the value of lateral thinking. There were numerous scientists with doctorate and master’s degrees that worked on the site of which I was one. One of the important requirements was to measure evaporation at the local weather station close to the repository. It was specifically important to measure evaporation as a controlling parameter in the unsaturated zone models. Evaporation was measured with an Apan that is filled with water so that the losses can be measured. Vaalputs is a very dry place and water is scarce. The local weather station site was fenced off, but there was a problem in that the aardvark dug below the fence and drank the water in the A–pan, so that no measurements were possible. The solution from the scientists was to put the fence deeper into the ground. The aardvark did not find it hard to dig even deeper. The next solution was to put a second fence around the weather station and again the aardvark dug below it to drink the water. It was then that Mr Robbie Schoeman, a technician became aware of the problem and put a drinking water container outside the weather station fence for the aardvark and – the problem was solved at a fraction of the cost of the previous complex solutions. I get in contact with the same thinking patterns that intuitively expect that the act of scientific investigations will provide decision–making information or even solve the problem. If the investigation provides more questions than answers, the quest is for more and more data on more detailed scales. There is a difference between problem characterization and solution viidentification. Problem characterization requires scientific and critical thinking, which is an important component but that has to be incorporated with the solution identification process of creative thinking towards decision–making. I am a scientist by heart, but it was necessary to realise that apart from research, practical science must feed into a higher process, such as decision–making to be able to make a practical difference. The process of compilation of this thesis meant a lot to me as I initially thought of doing a PhD and then it changed me, especially in the way I think. This was a life changing process, which is good. As Jesus said in Mathew 3:2 And saying, Repent (think differently; change your mind, regretting your sins and changing your conduct), for the kingdom of heaven is at hand. / Thesis (Ph.D. (Geography and Environmental Studies))--North-West University, Potchefstroom Campus, 2011.
20

Sparsity-sensitive diagonal co-clustering algorithms for the effective handling of text data

Ailem, Melissa 18 November 2016 (has links)
Dans le contexte actuel, il y a un besoin évident de techniques de fouille de textes pour analyser l'énorme quantité de documents textuelles non structurées disponibles sur Internet. Ces données textuelles sont souvent représentées par des matrices creuses (sparses) de grande dimension où les lignes et les colonnes représentent respectivement des documents et des termes. Ainsi, il serait intéressant de regrouper de façon simultanée ces termes et documents en classes homogènes, rendant ainsi cette quantité importante de données plus faciles à manipuler et à interpréter. Les techniques de classification croisée servent justement cet objectif. Bien que plusieurs techniques existantes de co-clustering ont révélé avec succès des blocs homogènes dans plusieurs domaines, ces techniques sont toujours contraintes par la grande dimensionalité et la sparsité caractérisant les matrices documents-termes. En raison de cette sparsité, plusieurs co-clusters sont principalement composés de zéros. Bien que ces derniers soient homogènes, ils ne sont pas pertinents et doivent donc être filtrés en aval pour ne garder que les plus importants. L'objectif de cette thèse est de proposer de nouveaux algorithmes de co-clustering conçus pour tenir compte des problèmes liés à la sparsité mentionnés ci-dessus. Ces algorithmes cherchent une structure diagonale par blocs et permettent directement d'identifier les co-clusters les plus pertinents, ce qui les rend particulièrement efficaces pour le co-clustering de données textuelles. Dans ce contexte, nos contributions peuvent être résumées comme suit: Tout d'abord, nous introduisons et démontrons l'efficacité d'un nouvel algorithme de co-clustering basé sur la maximisation directe de la modularité de graphes. Alors que les algorithmes de co-clustering existants qui se basent sur des critères de graphes utilisent des approximations spectrales, l'algorithme proposé utilise une procédure d'optimisation itérative pour révéler les co-clusters les plus pertinents dans une matrice documents-termes. Par ailleurs, l'optimisation proposée présente l'avantage d'éviter le calcul de vecteurs propres, qui est une tâche rédhibitoire lorsque l'on considère des données de grande dimension. Ceci est une amélioration par rapport aux approches spectrales, où le calcul des vecteurs propres est nécessaire pour effectuer le co-clustering. Dans un second temps, nous utilisons une approche probabiliste pour découvrir des structures en blocs homogènes diagonaux dans des matrices documents-termes. Nous nous appuyons sur des approches de type modèles de mélanges, qui offrent de solides bases théoriques et une grande flexibilité qui permet de découvrir diverses structures de co-clusters. Plus précisément, nous proposons un modèle de blocs latents parcimonieux avec des distributions de Poisson sous contraintes. De façon intéressante, ce modèle comprend la sparsité dans sa formulation, ce qui le rend particulièrement adapté aux données textuelles. En plaçant l'estimation des paramètres de ce modèle dans le cadre du maximum de vraisemblance et du maximum de vraisemblance classifiante, quatre algorithmes de co-clustering ont été proposées, incluant une variante dure, floue, stochastique et une quatrième variante qui tire profit des avantages des variantes floue et stochastique simultanément. Pour finir, nous proposons un nouveau cadre de fouille de textes biomédicaux qui comprend certains algorithmes de co-clustering mentionnés ci-dessus. Ce travail montre la contribution du co-clustering dans une problématique réelle de fouille de textes biomédicaux. Le cadre proposé permet de générer de nouveaux indices sur les résultats retournés par les études d'association pan-génomique (GWAS) en exploitant les abstracts de la base de données PUBMED. (...) / In the current context, there is a clear need for Text Mining techniques to analyse the huge quantity of unstructured text documents available on the Internet. These textual data are often represented by sparse high dimensional matrices where rows and columns represent documents and terms respectively. Thus, it would be worthwhile to simultaneously group these terms and documents into meaningful clusters, making this substantial amount of data easier to handle and interpret. Co-clustering techniques just serve this purpose. Although many existing co-clustering approaches have been successful in revealing homogeneous blocks in several domains, these techniques are still challenged by the high dimensionality and sparsity characteristics exhibited by document-term matrices. Due to this sparsity, several co-clusters are primarily composed of zeros. While homogeneous, these co-clusters are irrelevant and must be filtered out in a post-processing step to keep only the most significant ones. The objective of this thesis is to propose new co-clustering algorithms tailored to take into account these sparsity-related issues. The proposed algorithms seek a block diagonal structure and allow to straightaway identify the most useful co-clusters, which makes them specially effective for the text co-clustering task. Our contributions can be summarized as follows: First, we introduce and demonstrate the effectiveness of a novel co-clustering algorithm based on a direct maximization of graph modularity. While existing graph-based co-clustering algorithms rely on spectral relaxation, the proposed algorithm uses an iterative alternating optimization procedure to reveal the most meaningful co-clusters in a document-term matrix. Moreover, the proposed optimization has the advantage of avoiding the computation of eigenvectors, a task which is prohibitive when considering high dimensional data. This is an improvement over spectral approaches, where the eigenvectors computation is necessary to perform the co-clustering. Second, we use an even more powerful approach to discover block diagonal structures in document-term matrices. We rely on mixture models, which offer strong theoretical foundations and considerable flexibility that makes it possible to uncover various specific cluster structure. More precisely, we propose a rigorous probabilistic model based on the Poisson distribution and the well known Latent Block Model. Interestingly, this model includes the sparsity in its formulation, which makes it particularly effective for text data. Setting the estimate of this model’s parameters under the Maximum Likelihood (ML) and the Classification Maximum Likelihood (CML) approaches, four co-clustering algorithms have been proposed, including a hard, a soft, a stochastic and a fourth algorithm which leverages the benefits of both the soft and stochastic variants, simultaneously. As a last contribution of this thesis, we propose a new biomedical text mining framework that includes some of the above mentioned co-clustering algorithms. This work shows the contribution of co-clustering in a real biomedical text mining problematic. The proposed framework is able to propose new clues about the results of genome wide association studies (GWAS) by mining PUBMED abstracts. This framework has been tested on asthma disease and allowed to assess the strength of associations between asthma genes reported in previous GWAS as well as discover new candidate genes likely associated to asthma. In a nutshell, while several text co-clustering algorithms already exist, their performance can be substantially increased if more appropriate models and algorithms are available. According to the extensive experiments done on several challenging real-world text data sets, we believe that this thesis has served well this objective.

Page generated in 0.1175 seconds