Spelling suggestions: "subject:"[een] DATASETS"" "subject:"[enn] DATASETS""
111 |
A methodology for improving computed individual regressions predictions. / Uma metodologia para melhorar predições individuais de regressões.Matsumoto, Élia Yathie 23 October 2015 (has links)
This research proposes a methodology to improve computed individual prediction values provided by an existing regression model without having to change either its parameters or its architecture. In other words, we are interested in achieving more accurate results by adjusting the calculated regression prediction values, without modifying or rebuilding the original regression model. Our proposition is to adjust the regression prediction values using individual reliability estimates that indicate if a single regression prediction is likely to produce an error considered critical by the user of the regression. The proposed method was tested in three sets of experiments using three different types of data. The first set of experiments worked with synthetically produced data, the second with cross sectional data from the public data source UCI Machine Learning Repository and the third with time series data from ISO-NE (Independent System Operator in New England). The experiments with synthetic data were performed to verify how the method behaves in controlled situations. In this case, the outcomes of the experiments produced superior results with respect to predictions improvement for artificially produced cleaner datasets with progressive worsening with the addition of increased random elements. The experiments with real data extracted from UCI and ISO-NE were done to investigate the applicability of the methodology in the real world. The proposed method was able to improve regression prediction values by about 95% of the experiments with real data. / Esta pesquisa propõe uma metodologia para melhorar previsões calculadas por um modelo de regressão, sem a necessidade de modificar seus parâmetros ou sua arquitetura. Em outras palavras, o objetivo é obter melhores resultados por meio de ajustes nos valores computados pela regressão, sem alterar ou reconstruir o modelo de previsão original. A proposta é ajustar os valores previstos pela regressão por meio do uso de estimadores de confiabilidade individuais capazes de indicar se um determinado valor estimado é propenso a produzir um erro considerado crítico pelo usuário da regressão. O método proposto foi testado em três conjuntos de experimentos utilizando três tipos de dados diferentes. O primeiro conjunto de experimentos trabalhou com dados produzidos artificialmente, o segundo, com dados transversais extraídos no repositório público de dados UCI Machine Learning Repository, e o terceiro, com dados do tipo séries de tempos extraídos do ISO-NE (Independent System Operator in New England). Os experimentos com dados artificiais foram executados para verificar o comportamento do método em situações controladas. Nesse caso, os experimentos alcançaram melhores resultados para dados limpos artificialmente produzidos e evidenciaram progressiva piora com a adição de elementos aleatórios. Os experimentos com dados reais extraído das bases de dados UCI e ISO-NE foram realizados para investigar a aplicabilidade da metodologia no mundo real. O método proposto foi capaz de melhorar os valores previstos por regressões em cerca de 95% dos experimentos realizados com dados reais.
|
112 |
Indexation bio-inspirée pour la recherche d'images par similarité / Bio-inspired Indexing for Content-Based Image RetrievalMichaud, Dorian 16 October 2018 (has links)
La recherche d'images basée sur le contenu visuel est un domaine très actif de la vision par ordinateur, car le nombre de bases d'images disponibles ne cesse d'augmenter.L’objectif de ce type d’approche est de retourner les images les plus proches d'une requête donnée en terme de contenu visuel.Notre travail s'inscrit dans un contexte applicatif spécifique qui consiste à indexer des petites bases d'images expertes sur lesquelles nous n'avons aucune connaissance a priori.L’une de nos contributions pour palier ce problème consiste à choisir un ensemble de descripteurs visuels et de les placer en compétition directe. Nous utilisons deux stratégies pour combiner ces caractéristiques : la première, est pyschovisuelle, et la seconde, est statistique.Dans ce contexte, nous proposons une approche adaptative non supervisée, basée sur les sacs de mots et phrases visuels, dont le principe est de sélectionner les caractéristiques pertinentes pour chaque point d'intérêt dans le but de renforcer la représentation de l'image.Les tests effectués montrent l'intérêt d'utiliser ce type de méthodes malgré la domination des méthodes basées réseaux de neurones convolutifs dans la littérature.Nous proposons également une étude, ainsi que les résultats de nos premiers tests concernant le renforcement de la recherche en utilisant des méthodes semi-interactives basées sur l’expertise de l'utilisateur. / Image Retrieval is still a very active field of image processing as the number of available image datasets continuously increases.One of the principal objectives of Content-Based Image Retrieval (CBIR) is to return the most similar images to a given query with respect to their visual content.Our work fits in a very specific application context: indexing small expert image datasets, with no prior knowledge on the images. Because of the image complexity, one of our contributions is the choice of effective descriptors from literature placed in direct competition.Two strategies are used to combine features: a psycho-visual one and a statistical one.In this context, we propose an unsupervised and adaptive framework based on the well-known bags of visual words and phrases models that select relevant visual descriptors for each keypoint to construct a more discriminative image representation.Experiments show the interest of using this this type of methodologies during a time when convolutional neural networks are ubiquitous.We also propose a study about semi interactive retrieval to improve the accuracy of CBIR systems by using the knowledge of the expert users.
|
113 |
A methodology for improving computed individual regressions predictions. / Uma metodologia para melhorar predições individuais de regressões.Élia Yathie Matsumoto 23 October 2015 (has links)
This research proposes a methodology to improve computed individual prediction values provided by an existing regression model without having to change either its parameters or its architecture. In other words, we are interested in achieving more accurate results by adjusting the calculated regression prediction values, without modifying or rebuilding the original regression model. Our proposition is to adjust the regression prediction values using individual reliability estimates that indicate if a single regression prediction is likely to produce an error considered critical by the user of the regression. The proposed method was tested in three sets of experiments using three different types of data. The first set of experiments worked with synthetically produced data, the second with cross sectional data from the public data source UCI Machine Learning Repository and the third with time series data from ISO-NE (Independent System Operator in New England). The experiments with synthetic data were performed to verify how the method behaves in controlled situations. In this case, the outcomes of the experiments produced superior results with respect to predictions improvement for artificially produced cleaner datasets with progressive worsening with the addition of increased random elements. The experiments with real data extracted from UCI and ISO-NE were done to investigate the applicability of the methodology in the real world. The proposed method was able to improve regression prediction values by about 95% of the experiments with real data. / Esta pesquisa propõe uma metodologia para melhorar previsões calculadas por um modelo de regressão, sem a necessidade de modificar seus parâmetros ou sua arquitetura. Em outras palavras, o objetivo é obter melhores resultados por meio de ajustes nos valores computados pela regressão, sem alterar ou reconstruir o modelo de previsão original. A proposta é ajustar os valores previstos pela regressão por meio do uso de estimadores de confiabilidade individuais capazes de indicar se um determinado valor estimado é propenso a produzir um erro considerado crítico pelo usuário da regressão. O método proposto foi testado em três conjuntos de experimentos utilizando três tipos de dados diferentes. O primeiro conjunto de experimentos trabalhou com dados produzidos artificialmente, o segundo, com dados transversais extraídos no repositório público de dados UCI Machine Learning Repository, e o terceiro, com dados do tipo séries de tempos extraídos do ISO-NE (Independent System Operator in New England). Os experimentos com dados artificiais foram executados para verificar o comportamento do método em situações controladas. Nesse caso, os experimentos alcançaram melhores resultados para dados limpos artificialmente produzidos e evidenciaram progressiva piora com a adição de elementos aleatórios. Os experimentos com dados reais extraído das bases de dados UCI e ISO-NE foram realizados para investigar a aplicabilidade da metodologia no mundo real. O método proposto foi capaz de melhorar os valores previstos por regressões em cerca de 95% dos experimentos realizados com dados reais.
|
114 |
Learning Algorithms Using Chance-Constrained ProgramsJagarlapudi, Saketha Nath 07 1900 (has links)
This thesis explores Chance-Constrained Programming (CCP) in the context of learning. It is shown that chance-constraint approaches lead to improved algorithms for three important learning problems — classification with specified error rates, large dataset classification and Ordinal Regression (OR). Using moments of training data, the CCPs are posed as Second Order Cone Programs (SOCPs). Novel iterative algorithms for solving the resulting SOCPs are also derived. Borrowing ideas from robust optimization theory, the proposed formulations are made robust to moment estimation errors.
A maximum margin classifier with specified false positive and false negative rates is derived. The key idea is to employ chance-constraints for each class which imply that the actual misclassification rates do not exceed the specified. The formulation is applied to the case of biased classification.
The problems of large dataset classification and ordinal regression are addressed by deriving formulations which employ chance-constraints for clusters in training data rather than constraints for each data point. Since the number of clusters can be substantially smaller than the number of data points, the resulting formulation size and number of inequalities are very small. Hence the formulations scale well to large datasets.
The scalable classification and OR formulations are extended to feature spaces and the kernelized duals turn out to be instances of SOCPs with a single cone constraint. Exploiting this speciality, fast iterative solvers which outperform generic SOCP solvers, are proposed. Compared to state-of-the-art learners, the proposed algorithms achieve a speed up as high as 10000 times, when the specialized SOCP solvers are employed.
The proposed formulations involve second order moments of data and hence are susceptible to moment estimation errors. A generic way of making the formulations robust to such estimation errors is illustrated. Two novel confidence sets for moments are derived and it is shown that when either of the confidence sets are employed, the robust formulations also yield SOCPs.
|
115 |
Comparando a saúde no Brasil com os países da OCDE: explorando dados de saúde públicaLima, Cecília Pessanha 30 March 2016 (has links)
Submitted by Cecilia Pessanha Lima (ceciliapessanha@hotmail.com) on 2016-05-03T13:43:05Z
No. of bitstreams: 4
CeciliaMestrado.pdf: 4264856 bytes, checksum: 4cee96c9c4dfc83613b5314d48ab3453 (MD5)
Anexo A – Códigos SQL utilizados para o cálculo dos indicadores.zip: 28174 bytes, checksum: dfa95cbed981f9d9be8cf57526e84ad7 (MD5)
Anexo B – Códigos utilizados para importação das bases de dados.zip: 9937 bytes, checksum: 834e218ce4ec37717a40f5533f9640b7 (MD5)
Anexo C - Filtros Aplicados em Cada Variável.pdf: 374380 bytes, checksum: 10ca66baf67f8d6a85b54c1cf89b56a9 (MD5) / Approved for entry into archive by ÁUREA CORRÊA DA FONSECA CORRÊA DA FONSECA (aurea.fonseca@fgv.br) on 2016-05-05T12:51:03Z (GMT) No. of bitstreams: 4
CeciliaMestrado.pdf: 4264856 bytes, checksum: 4cee96c9c4dfc83613b5314d48ab3453 (MD5)
Anexo A – Códigos SQL utilizados para o cálculo dos indicadores.zip: 28174 bytes, checksum: dfa95cbed981f9d9be8cf57526e84ad7 (MD5)
Anexo B – Códigos utilizados para importação das bases de dados.zip: 9937 bytes, checksum: 834e218ce4ec37717a40f5533f9640b7 (MD5)
Anexo C - Filtros Aplicados em Cada Variável.pdf: 374380 bytes, checksum: 10ca66baf67f8d6a85b54c1cf89b56a9 (MD5) / Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2016-05-09T17:02:45Z (GMT) No. of bitstreams: 4
CeciliaMestrado.pdf: 4264856 bytes, checksum: 4cee96c9c4dfc83613b5314d48ab3453 (MD5)
Anexo A – Códigos SQL utilizados para o cálculo dos indicadores.zip: 28174 bytes, checksum: dfa95cbed981f9d9be8cf57526e84ad7 (MD5)
Anexo B – Códigos utilizados para importação das bases de dados.zip: 9937 bytes, checksum: 834e218ce4ec37717a40f5533f9640b7 (MD5)
Anexo C - Filtros Aplicados em Cada Variável.pdf: 374380 bytes, checksum: 10ca66baf67f8d6a85b54c1cf89b56a9 (MD5) / Made available in DSpace on 2016-05-09T17:03:00Z (GMT). No. of bitstreams: 4
CeciliaMestrado.pdf: 4264856 bytes, checksum: 4cee96c9c4dfc83613b5314d48ab3453 (MD5)
Anexo A – Códigos SQL utilizados para o cálculo dos indicadores.zip: 28174 bytes, checksum: dfa95cbed981f9d9be8cf57526e84ad7 (MD5)
Anexo B – Códigos utilizados para importação das bases de dados.zip: 9937 bytes, checksum: 834e218ce4ec37717a40f5533f9640b7 (MD5)
Anexo C - Filtros Aplicados em Cada Variável.pdf: 374380 bytes, checksum: 10ca66baf67f8d6a85b54c1cf89b56a9 (MD5)
Previous issue date: 2016-03-30 / Healthcare authorities in Brazil produces a large amount of data on health services and use. The appropriate treatment of this data with massive data techniques enables the extraction of important information. This information can contribute to a better understanding of the Brazilian healthcare sector. The evaluation of health systems performance based on the analysis of routinely produced healthcare data has been a worldwide trend. Several countries already maintain monitoring programs based on indicators constructed using this type of data. In this context, the OCDE—Organization for Economic Co-operation and Development, an international organization that evaluates the economic policies of its 34 member countries, has a biennial publication called Health at a Glance, which aims to make the comparison of health systems in OCDE member countries. Although it is not a member country, OCDE seeks to include Brazil in the calculation of some of the indicators, when the data is available, considering that Brazil is one of the largest economies that are still not a member country. This study aims to construct and implement, based on the methodology of Health at a Glance 2015, the calculation in the Brazilian context of 22 indicators in the health field “Use of Health Services.” To develop the set of indicators, first, a wide search of the major national health databases was done to assess data availability. The available data was then extracted using massive data techniques. Those techniques were required because of the large volume of health data in Brazil. The datasets were extracted from three main data sources containing health billing data: SUS, private health insurance and other sources of billing, as public health insurances, DPVAT and private. This work has shown that health data publicly available in Brazil can be used to evaluate the Brazilian health system performance, and include Brazil in the international benchmark of the OCDE countries for the 22 indicators calculated. It can also promote the comparison of the public health sector in Brazil, SUS, and the private health insurance sector based on the same set of indicators. It also made possible the comparison of in each State for SUS, thus underlining the differences in the health-care services among Brazil States for the public sector. The analysis of the indicators showed that, in general, compared to OCDE countries, Brazil has a below-average performance, which indicates a need for efforts to achieve a higher level in the provision of healthcare services that are under these indicators assessment. When separating SUS and private health insurance, the analysis of Brazil’s indicators shows that the private health sector performance is in the average of the OCDE countries. On the other hand, it was observed that SUS was systematically and significantly under the average of the OCDE countries. This highlights the inequalities in healthcare services provision in Brazil between the SUS and private health insurance. The use of the TISS/ANS database as a source of information for the private health insurance sector for the calculation of these indicators will be an improvement over the data available at the time of this analysis. TISS includes all the information exchanged between healthcare services providers and private health insurance operators, in order to perform the payment of healthcare services provided. / A atenção à saúde da população no Brasil gera um grande volume de dados sobre os serviços de saúde prestados. O tratamento adequado destes dados com técnicas de acesso à grande massa de dados pode permitir a extração de informações importantes para um melhor conhecimento do setor saúde. Avaliar o desempenho dos sistemas de saúde através da utilização da massa de dados produzida tem sido uma tendência mundial, uma vez que vários países já mantêm programas de avaliação baseados em dados e indicadores. Neste contexto, A OCDE – Organização para Cooperação e Desenvolvimento Econômico, que é uma organização internacional que avalia as políticas econômicas de seus 34 países membros, possui uma publicação bienal, chamada Health at a Glance, que tem por objetivo fazer a comparação dos sistemas de saúde dos países membros da OCDE. Embora o Brasil não seja um membro, a OCDE procura incluí-lo no cálculo de alguns indicadores, quando os dados estão disponíveis, pois considera o Brasil como uma das maiores economias que não é um país membro. O presente estudo tem por objetivo propor e implementar, com base na metodologia da publicação Health at a Glance de 2015, o cálculo para o Brasil de 22 indicadores em saúde que compõem o domínio “utilização de serviços em saúde” da publicação da OCDE. Para isto foi feito um levantamento das principais bases de dados nacionais em saúde disponíveis que posteriormente foram capturadas, conforme necessidade, através de técnicas para acessar e tratar o grande volume de dados em saúde no Brasil. As bases de dados utilizadas são provenientes de três principais fontes remuneração: SUS, planos privados de saúde e outras fontes de remuneração como, por exemplo, planos públicos de saúde, DPVAT e particular. A realização deste trabalho permitiu verificar que os dados em saúde disponíveis publicamente no Brasil podem ser usados na avaliação do desempenho do sistema de saúde, e além de incluir o Brasil no benchmark internacional dos países da OCDE nestes 22 indicadores, promoveu a comparação destes indicadores entre o setor público de saúde do Brasil, o SUS, e o setor de planos privados de saúde, a chamada saúde suplementar. Além disso, também foi possível comparar os indicadores calculados para o SUS para cada UF, demonstrando assim as diferenças na prestação de serviços de saúde nos estados do Brasil para o setor público. A análise dos resultados demonstrou que, em geral, o Brasil comparado com os países da OCDE apresenta um desempenho abaixo da média dos demais países, o que indica necessidade de esforços para atingir um nível mais alto na prestação de serviços em saúde que estão no âmbito de avaliação dos indicadores calculados. Quando segmentado entre SUS e saúde suplementar, a análise dos resultados dos indicadores do Brasil aponta para uma aproximação do desempenho do setor de saúde suplementar em relação à média dos demais países da OCDE, e por outro lado um distanciamento do SUS em relação a esta média. Isto evidencia a diferença no nível de prestação de serviços dentro do Brasil entre o SUS e a saúde suplementar. Por fim, como proposta de melhoria na qualidade dos resultados obtidos neste estudo sugere-se o uso da base de dados do TISS/ANS para as informações provenientes do setor de saúde suplementar, uma vez que o TISS reflete toda a troca de informações entre os prestadores de serviços de saúde e as operadoras de planos privados de saúde para fins de pagamento dos serviços prestados.
|
116 |
Dataset selection for aggregate model implementation in predictive data miningLutu, P.E.N. (Patricia Elizabeth Nalwoga) 15 November 2010 (has links)
Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results. / Thesis (PhD)--University of Pretoria, 2010. / Computer Science / unrestricted
|
117 |
Trasování objektu v reálném čase / Visual Object Tracking in RealtimeKratochvíla, Lukáš January 2019 (has links)
Sledování obecného objektu na zařízení s omezenými prostředky v reálném čase je obtížné. Mnoho algoritmů věnujících se této problematice již existuje. V této práci se s nimi seznámíme. Různé přístupy k této problematice jsou diskutovány včetně hlubokého učení. Představeny jsou reprezentace objektu, datasety i metriky pro vyhodnocování. Mnoho sledovacích algorimů je představeno, osm z nich je implementováno a vyhodnoceno na VOT datasetu.
|
118 |
From data collection to electric grid performance : How can data analytics support asset management decisions for an efficient transition toward smart grids?Koziel, Sylvie Evelyne January 2021 (has links)
Physical asset management in the electric power sector encompasses the scheduling of the maintenance and replacement of grid components, as well as decisions about investments in new components. Data plays a crucial role in these decisions. The importance of data is increasing with the transformation of the power system and its evolution toward smart grids. This thesis deals with questions related to data management as a way to improve the performance of asset management decisions. Data management is defined as the collection, processing, and storage of data. Here, the focus is on the collection and processing of data. First, the influence of data on the decisions related to assets is explored. In particular, the impacts of data quality on the replacement time of a generic component (a line for example) are quantified using a scenario approach, and failure modeling. In fact, decisions based on data of poor quality are most likely not optimal. In this case, faulty data related to the age of the component leads to a non-optimal scheduling of component replacement. The corresponding costs are calculated for different levels of data quality. A framework has been developed to evaluate the amount of investment needed into data quality improvement, and its profitability. Then, the ways to use available data efficiently are investigated. Especially, the possibility to use machine learning algorithms on real-world datasets is examined. New approaches are developed to use only available data for component ranking and failure prediction, which are two important concepts often used to prioritize components and schedule maintenance and replacement. A large part of the scientific literature assumes that the future of smart grids lies in big data collection, and in developing algorithms to process huge amounts of data. On the contrary, this work contributes to show how automatization and machine learning techniques can actually be used to reduce the need to collect huge amount of data, by using the available data more efficiently. One major challenge is the trade-offs needed between precision of modeling results, and costs of data management. / <p>QC 20210330</p>
|
119 |
Estimation de pose 2D par réseau convolutifHuppé, Samuel 04 1900 (has links)
Magic: The Gathering} est un jeu de cartes à collectionner stochastique à information imparfaite inventé par Richard Garfield en 1993. Le but de ce projet est de proposer un pipeline d'apprentissage machine permettant d'accomplir la détection et la localisation des cartes du jeu \textit{Magic} au sein d'une image typique des tournois de ce jeu. Il s'agit d'un problème de pose d'objets 2D à quatre degrés de liberté soit, la position sur deux axes, la rotation et l'échelle, dans un contexte où les cartes peuvent être superposées. À travers ce projet, nous avons développé une approche par données synthétiques à deux réseaux capable, collectivement d'identifier, et de régresser ces paramètres avec une précision significative. Dans le cadre de ce projet, nous avons développé un algorithme d'apprentissage profond par données synthétiques capable de positionner une carte avec une précision d'un demi pixel et d'une rotation de moins d'un degré. Finalement, nous avons montré que notre jeu de données synthétique est suffisamment réaliste pour permettre à nos réseaux de généraliser aux cas d'images réelles. / Magic: The Gathering} is an imperfect information, stochastic, collectible card game invented by Richard Garfield in 1993. The goal of this project is to propose a machine learning pipeline capable of detecting and localising \textit{Magic} cards within an image. This is a 2D pose problem with 4 degrees of freedom, namely translation in $x$ and $y$, rotation, and scale, in a context where cards can be superimposed on one another. We tackle this problem by relying on deep learning using a combination of two separate neural networks. Our final pipeline has the ability to tackle real-world images and gives, with a very good degree of precision, the poses of cards within an image. Through the course of this project, we have developped a method of realistic synthetic data generation to train both our models to tackle real world images. The results show that our pose subnetwork is able to predict position within half a pixel, rotation within one degree and scale within 2 percent.
|
120 |
Efficient Query Processing for Dynamically Changing DatasetsIdris, Muhammad, Ugarte, Martín, Vansummeren, Stijn, Voigt, Hannes, Lehner, Wolfgang 11 August 2022 (has links)
The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. Traditional approaches to this problem were developed around the notion of Incremental View Maintenance (IVM), and are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniques are suboptimal: instead of materializing results and subresults, one may also maintain a data structure that supports efficient maintenance under updates and from which the full query result can quickly be enumerated. In two previous articles, we have presented algorithms for dynamically evaluating queries that are easy to implement, efficient, and can be naturally extended to evaluate queries from a wide range of application domains. In this paper, we discuss our algorithm and its complexity, explaining the main components behind its efficiency. Finally, we show experiments that compare our algorithm to a state-of-the-art (Higher-order) IVM engine, as well as to a prominent complex event recognition engine. Our approach outperforms the competitor systems by up to two orders of magnitude in processing time, and one order in memory consumption.
|
Page generated in 0.0408 seconds