Global ETD Search

41	Efficient Algorithms for Mining Data Streams Boedihardjo, Arnold Priguna 06 September 2010 (has links) Data streams are ordered sets of values that are fast, continuous, mutable, and potentially unbounded. Examples of data streams include the pervasive time series which span domains such as finance, medicine, and transportation. Mining data streams require approaches that are efficient, adaptive, and scalable. For several stream mining tasks, knowledge of the data's probability density function (PDF) is essential to deriving usable results. Providing an accurate model for the PDF benefits a variety of stream mining applications and its successful development can have far-reaching impact to the general discipline of stream analysis. Therefore, this research focuses on the construction of efficient and effective approaches for estimating the PDF of data streams. In this work, kernel density estimators (KDEs) are developed that satisfy the stringent computational stipulations of data streams, model unknown and dynamic distributions, and enhance the estimation quality of complex structures. Contributions of this work include: (1) theoretical development of the local region based KDE; (2) construction of a local region based estimation algorithm; (3) design of a generalized local region approach that can be applied to any global bandwidth KDE to enhance estimation accuracy; and (4) application extension of the local region based KDE to multi-scale outlier detection. Theoretical development includes the formulation of the local region concept to effectively approximate the computationally intensive adaptive KDE. This work also analyzes key theoretical properties of the local region based approach which include (amongst others) its expected performance, an alternative local region construction criterion, and its robustness under evolving distributions. Algorithmic design includes the development of a specific estimation technique that reduces the time/space complexities of the adaptive KDE. In order to accelerate mining tasks such as outlier detection, an integrated set of optimizations are proposed for estimating multiple density queries. Additionally, the local region concept is extended to an efficient algorithmic framework which can be applied to any global bandwidth KDEs. The combined solution can significantly improve estimation accuracy while retaining overall linear time/space costs. As an application extension, an outlier detection framework is designed which can effectively detect outliers within multiple data scale representations. / Ph. D. Data Mining Machine learning Kernel Density Estimation Outlier Detection Data Stream
42	Driver Behaviour Modelling: Travel Prediction Using Probability Density Function Uglanov, Alexey, Kartashev, K., Campean, Felician, Doikin, Aleksandr, Abdullatif, Amr R.A., Angiolini, E., Lin, C., Zhang, Q. 10 September 2021 (has links) No / This paper outlines the current challenges of driver behaviour modelling for real-world applications and presents the novel method to identify the pattern of usage to predict upcoming journeys in probability sense. The primary aim is to establish similarity between observed behaviour of drivers resulting in the ability to cluster them and deploy control strategies based on contextual intelligence and datadriven approach. The proposed approach uses the probability density function (PDF) driven by kernel density estimation (KDE) as a probabilistic approach to predict the type of the upcoming journey, expressed as duration and distance. Using the proposed method, the mathematical formulation and programming algorithm procedure have been indicated in detail, while the case study examples with the data visualisation are given for algorithm validation in simulation. / aiR-FORCE project, funded as Proof of Concept by the Institute of Digital Engineering Driver behaviour modelling Probability density function Kernel density estimation Probabilistic predictions
43	Anthropogenic effects on site use and temporal patterns of terrestrial mammals in Harenna Forest, Ethiopia Gichuru, Phillys Njambi 22 March 2022 (has links) There has been little research comprehensively documenting wildlife species in Harenna Forest within the Bale Mountains National Park of Ethiopia. This area is one of the few remaining afro-alpine biodiversity hotspots and is home to numerous endemic plants and animals and offers socio-economic benefits to the neighboring communities. Human population pressure, weak land protection policies, and uncertain land tenure rights have led to increases in farmland for subsistence and coffee farming, livestock grazing, and reduction of afro-alpine, shrubland and grassland habitats. Given these challenges, I used 48 camera trap stations to produce an inventory of wildlife species and to determine factors influencing occupancy (i.e., habitat use), detection, and temporal activity and overlap. I recorded 26 terrestrial and arboreal mammalian species and I had sufficient data to model occupancy for 13 species and temporal activity for 14 species. Occupancy and detection were generally higher for herbivores and omnivores (occupancy: 0.28-0.97; detection: 0.1-0.54) than carnivores (occupancy: 0.31-0.80; detection: 0.04-0.18). I found more evidence of positive anthropogenic impacts on herbivore and omnivore occupancy than negative, while detection was influenced by habitat or landscape features, rather than by humans. Carnivore occupancy was largely unaffected by anthropogenic or habitat variables, but detection was strongly, and mostly positively, influenced by anthropogenic impacts. Temporal activity analyses revealed that, for herbivores and omnivores, only tree hyraxes (Dendrohyrax arboreus) and crested porcupines (Hystrix cristata) were nocturnal, Menelik bushbucks (Tragelaphus scriptus meneliki) were crepuscular, and the remaining species ranged from diurnal to cathemeral. Neither similar body size nor similar diet affected overlap between species pairs. However, overlap with human temporal activity was low for Menelik bushbucks (Δ=0.45) and common duikers (Sylvicapra grimmia) appeared to become less active at stations with high human use. For carnivores, leopards (Panthera pardus) and honey badgers (Mellivora capensis) were crepuscular, and the remaining species were nocturnal. I found evidence that carnivores overlapped less when they were more similar in body size to other carnivores (average Δ=0.67-0.71) compared to species more dissimilar in body size (average Δ=0.75), although there was variation across species. In general, carnivores overlapped much less with humans (average Δ=0.20) than did herbivores (average Δ=0.52) and omnivores (average Δ=0.43). Spotted hyenas (Crocuta crocuta), in particular, appeared to alter activity to reduce overlap with humans. This study provides baseline information on presence, distribution, and activity of large- and medium-sized terrestrial and arboreal mammals in an understudied biodiversity hotspot. My findings are concerning for biodiversity conservation as rare and endangered species (e.g., mountain nyalas (Tragelaphus buxtoni), Ethiopian wolves (Canis simensis)) were rarely or never photographed, and larger carnivores (e.g., lions (Panthera leo), leopards, jackals), generally had low capture rates. The species with higher capture rates, occupancy, and activity tended to be those that can tolerate or take advantage of human activity and disturbance. Species sensitive to human disturbance eventually may be lost unless measures can be put in place to reduce human impacts. This baseline knowledge is important for future studies examining trends in mammalian wildlife populations, such as site extinction and colonization, or changes in overlap with humans, in a landscape that is continuing to experience human-caused, landscape change. / Master of Science / Harenna forest, which is located in Bale Mountains National Park, Ethiopia is an important habitat to both wildlife and people. However, it faces a number of challenges as a result of population growth leading to increased coffee farming and livestock grazing resulting in reduced habitat for wildlife species. I used 48 cameras located across the forest to record presence of terrestrial mammals and document their distribution and daily activity across the landscape. I also used data such as vegetation indices, elevation, and distances to human-disturbed areas to determine what influenced wildlife species. Cameras recorded 26 species of mammals. I had enough data to determine distribution for 13 species and daily activity for 14 species. I found that presence across the landscape and activity of herbivores and omnivores was generally higher than that of carnivores. Additionally, I found that human activity or disturbance often had a positive influence on herbivore and omnivore distribution, but my ability to detect species in camera traps was primarily influenced by habitat or landscape features. Carnivore distribution on the landscape was not influenced much by humans or habitat, but their detectability was often positively influenced by presence of humans. In addition to daily activity, I also analyzed overlap in activity between species pairs and between species and humans, to determine whether wildlife changed their temporal activity to overlap less with similar sized competitors or in response to high human use. For herbivores and omnivores, I found that tree hyraxes and crested porcupines were active at night, Menelik's bushbucks were active at sunrise and sunset, and cape bushbucks, common duiker, olive baboon, bushpig, and giant forest hogs were active either during the day or throughout the day and night. I found little evidence that the herbivores or omnivores avoided each other temporally and only the Menelik bushbuck and duiker appeared to avoid humans. For carnivores, I found that leopards and honey badgers were active early morning and evening, and the common genet, African civet, white-tailed mongoose, and spotted hyenas were all active at night only. Carnivores generally overlapped less with humans than herbivores and omnivores. I found some evidence that carnivores more similar in body size had lower temporal overlap with each other and that spotted hyaenas appeared to avoid activity during times of day when humans were active. My study not only provides baseline information on terrestrial and arboreal mammals present in Harenna forest, Ethiopia, but is also necessary for understanding how wildlife species use the landscape and particularly how presence of humans influences wild animal behavior. My findings are concerning for biodiversity conservation because I had few to no photographs, respectively, of the endangered mountain nyala and Ethiopian wolf. In fact, most of the species with a wide distribution on the landscape, or with high activity, were common or smaller species that are tolerant of, or could take advantage of, human disturbance. Without concerted effort to curtail the current landscape change caused by humans, the area is likely to lose species less tolerant of humans, and biodiversity will ultimately decline. anthropogenic detection camera-traps Ethiopia Harenna Forest occupancy carnivores Kernel Density Estimation (KDE) overlap ungulates
44	Relating forced climate change to natural variability and emergent dynamics of the climate-economy system Kellie-Smith, Owen January 2010 (has links) This thesis is in two parts. The first part considers a theoretical relationship between the natural variability of a stochastic model and its response to a small change in forcing. Over a large enough scale, both the real climate and a climate model are characterised as stochastic dynamical systems. The dynamics of the systems are encoded in the probabilities that the systems move from one state into another. When the systems’ states are discretised and listed, then transition matrices of all these transition probabilities may be formed. The responses of the systems to a small change in forcing are expanded in terms of the eigenfunctions and eigenvalues of the Fokker-Planck equations governing the systems’ transition densities, which may be estimated from the eigenvalues and eigenvectors of the transition matrices. Smoothing the data with a Gaussian kernel improves the estimate of the eigenfunctions, but not the eigenvalues. The significance of differences in two systems’ eigenvalues and eigenfunctions is considered. Three time series from HadCM3 are compared with corresponding series from ERA-40 and the eigenvalues derived from the three pairs of series differ significantly. The second part analyses a model of the coupled climate-economic system, which suggests that the pace of economic growth needs to be reduced and the resilience to climate change needs to be increased in order to avoid a collapse of the human economy. The model condenses the climate-economic system into just three variables: a measure of human wealth, the associated accumulation of greenhouse gases, and the consequent level of global warming. Global warming is assumed to dictate the pace of economic growth. Depending on the sensitivity of economic growth to global warming, the model climate-economy system either reaches an equilibrium or oscillates in century-scale booms and busts. 330.015195
45	Efficacité, généricité et praticabilité de l'attaque par information mutuelle utilisant la méthode d'estimation de densité par noyau / Efficiency, genericity and practicability of Kerned-based mutual information analysis Carbone, Mathieu 16 March 2015 (has links) De nos jours, les attaques par canaux auxiliaires sont facilement réalisables et très puissantes face aux implémentations cryptographiques. Cela pose une sérieuse menace en ce qui concerne la sécurité des crypto-systèmes. En effet, l'exécution d'un algorithme cryptographique produit inévitablement des fuites d'information liées aux données internes manipulées par le cryptosystèmes à travers des canaux auxiliaires (temps, température, consommation de courant, émissions électro-magnétiques, etc.). Certaines d'entre elles étant sensibles, un attaquant peut donc les exploiter afin de retrouver la clé secrète. Une des étapes les plus importantes d'une attaque par canaux auxiliaires est de quantifier la dépendance entre une quantité physique mesurée et un modèle de fuite supposé. Pour se faire, un outil statistique, aussi appelé distingueur, est utilisé dans le but de trouver une estimation de la clé secrète. Dans la littérature, une pléthore de distingueurs a été proposée. Cette thèse porte sur l'attaque utilisant l'information mutuelle comme distingueur, appelé l'attaque par information mutuelle. Dans un premier temps, nous proposons de combler le fossé d'un des problèmes majeurs concernant l'estimation du coefficient d'information mutuelle, lui-même demandant l'estimation de densité. Nos investigations ont été menées en utilisant une méthode non paramétrique pour l'estimation de densité: l'estimation par noyau. Une approche de sélection de la largeur de fenêtre basée sur l'adaptativité est proposée sous forme d'un critère (spécifique au cas des attaques par canaux auxiliaires). Par conséquent, une analyse est menée pour donner une ligne directrice afin de rendre l'attaque par information mutuelle optimale et générique selon la largeur de fenêtre mais aussi d'établir quel contexte (relié au moment statistique de la fuite) est plus favorable pour l'attaque par information mutuelle. Dans un second temps, nous abordons un autre problème lié au temps de calcul élevé (étroitement lié à la largeur de la fenêtre) de l'attaque par information mutuelle utilisant la méthode du noyau. Nous évaluons un algorithme appelé Arbre Dual permettant des évaluations rapides de fonctions noyau. Nous avons aussi montré expérimentalement que l'attaque par information mutuelle dans le domaine fréquentiel, est efficace et rapide quand celle-ci est combinée avec l'utilisation d'un modèle fréquentiel de fuite. En outre, nous avons aussi suggéré une extension d'une méthode déjà existante pour détecter une fuite basée sur un moment statistique d'ordre supérieur. / Nowadays, Side-Channel Analysis (SCA) are easy-to-implement whilst powerful attacks against cryptographic implementations posing a serious threat to the security of cryptosystems for the designers. Indeed, the execution of cryptographic algorithms unvoidably leaks information about internally manipulated data of the cryptosystem through side-channels (time, temperature, power consumption, electromagnetic emanations, etc), for which some of them are sensible(depending on the secret key). One of the most important SCA steps for an adversary is to quantify the dependency between the measured side-channel leakage and an assumed leakage model using a statistical tool, also called distinguisher, in order to find an estimation of the secret key. In the SCA literature, a plethora of distinguishers have been proposed. This thesis focuses on Mutual Information (MI) based attacks, the so-called Mutual Information Analysis (MIA) and proposes to fill the gap of the major practical issue consisting in estimating MI index which itself requires the estimation of underlying distributions. Investigations are conducted using the popular statistical technique for estimating the underlying density distribution with minimal assumptions: Kernel Density Estimation (KDE). First, a bandwidth selection scheme based on an adaptivity criterion is proposed. This criterion is specific to SCA.As a result, an in-depth analysis is conducted in order to provide a guideline to make MIA efficient and generic with respect to this tuning hyperparameter but also to establish which attack context (connected to the statistical moment of leakage) is favorable of MIA. Then, we address another issue of the kernel-based MIA lying in the computational burden through a so-called Dual-Tree algorithm allowing fast evaluations of 'pair-wise` kernel functions. We also showed that MIA running into the frequency domain is really effective and fast when combined with the use of an accurate frequency leakage model. Additionally, we suggested an extension of an existing method to detect leakage embedded on higher-order statistical moments. Attaques par canaux cachés Information mutuelle Estimation de densité par noyaux Side Channel Attacks Mutual Information Kernel Density Estimation
46	Resampling Evaluation of Signal Detection and Classification : With Special Reference to Breast Cancer, Computer-Aided Detection and the Free-Response Approach Bornefalk Hermansson, Anna January 2007 (has links) <p>The first part of this thesis is concerned with trend modelling of breast cancer mortality rates. By using an age-period-cohort model, the relative contributions of period and cohort effects are evaluated once the unquestionable existence of the age effect is controlled for. The result of such a modelling gives indications in the search for explanatory factors. While this type of modelling is usually performed with 5-year period intervals, the use of 1-year period data, as in Paper I, may be more appropriate.</p><p>The main theme of the thesis is the evaluation of the ability to detect signals in x-ray images of breasts. Early detection is the most important tool to achieve a reduction in breast cancer mortality rates, and computer-aided detection systems can be an aid for the radiologist in the diagnosing process.</p><p>The evaluation of computer-aided detection systems includes the estimation of distributions. One way of obtaining estimates of distributions when no assumptions are at hand is kernel density estimation, or the adaptive version thereof that smoothes to a greater extent in the tails of the distribution, thereby reducing spurious effects caused by outliers. The technique is described in the context of econometrics in Paper II and then applied together with the bootstrap in the breast cancer research area in Papers III-V.</p><p>Here, estimates of the sampling distributions of different parameters are used in a new model for free-response receiver operating characteristic (FROC) curve analysis. Compared to earlier work in the field, this model benefits from the advantage of not assuming independence of detections in the images, and in particular, from the incorporation of the sampling distribution of the system's operating point.</p><p>Confidence intervals obtained from the proposed model with different approaches with respect to the estimation of the distributions and the confidence interval extraction methods are compared in terms of coverage and length of the intervals by simulations of lifelike data.</p> Statistics breast cancer trend modelling FROC confidence intervals threshold independence bootstrap kernel density estimation mammography computer-aided detection Statistik
47	A Framework for Determining the Reliability of Nanoscale Metallic Oxide Semiconductor (MOS) Devices Otieno, Wilkistar 31 December 2010 (has links) An increase in worldwide investments during the past several decades has pro-pelled scienti c breakthroughs in nanoscience and technology research to new and exciting levels. To ensure that these discoveries lead to commercially viable prod-ucts, it is important to address some of the fundamental engineering and scientific challenges related to nanodevices. Due to the centrality of reliability to product integrity, nanoreliability requires critical analysis and understanding to ensure long-term sustainability of nanodevices and systems. In this study, we construct a relia-bility framework for nanoscale dielectric lms used in Metallic Oxide Semiconductor (MOS) devices. The successful fabrication and incorporation of metallic oxides in MOS devices was a major milestone in the electronics industry. However, with the progressive scaling of transistors, the dielectric dimension has progressively decreased to about 2nm. This reduction has had severe reliability implications and challenges including: short channeling e ects and leakage currents due to quantum-mechanical tunneling which leads to increased power dissipation and eventually temperature re-lated gate degradation. We develop a framework to characterize and model reliability of recently devel-oped gate dielectrics of Si-MOS devices. We accomplish this through the following research steps: (i) the identi cation of the failure mechanisms of Si-based high-k gates (stress, material, environmental), (ii) developing a 3-D failure simulation as a way to acquire simulated failure data, (iii) the identi cation of the dielectric failure prob-ability structure using both kernel estimation and nonparametric Bayesian schemes so as to establish the life pro le of high-k gate dielectric. The goal is to eventually develop the appropriate failure extrapolation model to relate the reliability at the test conditions to the reliability at normal use conditions. This study provides modeling and analytical clarity regarding the inherent failure characteristics and hence the reliability of metal/high-k gate stacks of Si-based sub-strates. In addition, this research will assist manufacturers to optimally characterize, predict and manage the reliability of metal high-k gate substrates. The proposed reliability framework could be extended to other thin lm devices and eventually to other nanomaterials and devices. nanoreliability dielectric accelerated degradation kernel density estimates bayesian density estimates American Studies Arts and Humanities Industrial Engineering Other Environmental Sciences Sustainability
48	STATISTICS IN THE BILLERA-HOLMES-VOGTMANN TREESPACE Weyenberg, Grady S. 01 January 2015 (has links) This dissertation is an effort to adapt two classical non-parametric statistical techniques, kernel density estimation (KDE) and principal components analysis (PCA), to the Billera-Holmes-Vogtmann (BHV) metric space for phylogenetic trees. This adaption gives a more general framework for developing and testing various hypotheses about apparent differences or similarities between sets of phylogenetic trees than currently exists. For example, while the majority of gene histories found in a clade of organisms are expected to be generated by a common evolutionary process, numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct from the histories of the majority of genes. Such “outlying” gene trees are considered to be biologically interesting and identifying these genes has become an important problem in phylogenetics. The R sofware package kdetrees, developed in Chapter 2, contains an implementation of the kernel density estimation method. The primary theoretical difficulty involved in this adaptation concerns the normalizion of the kernel functions in the BHV metric space. This problem is addressed in Chapter 3. In both chapters, the software package is applied to both simulated and empirical datasets to demonstrate the properties of the method. A few first theoretical steps in adaption of principal components analysis to the BHV space are presented in Chapter 4. It becomes necessary to generalize the notion of a set of perpendicular vectors in Euclidean space to the BHV metric space, but there some ambiguity about how to best proceed. We show that convex hulls are one reasonable approach to the problem. The Nye-PCA- algorithm provides a method of projecting onto arbitrary convex hulls in BHV space, providing the core of a modified PCA-type method. Phylogenetic trees Non-parametric statistics Outlier Detection Kernel Density Estimation Principal Components Analysis Applied Statistics Computational Biology Statistical Methodology
49	Stochastic modelling using large data sets : applications in ecology and genetics Coudret, Raphaël 16 September 2013 (has links) (PDF) There are two main parts in this thesis. The first one concerns valvometry, which is here the study of the distance between both parts of the shell of an oyster, over time. The health status of oysters can be characterized using valvometry in order to obtain insights about the quality of their environment. We consider that a renewal process with four states underlies the behaviour of the studied oysters. Such a hidden process can be retrieved from a valvometric signal by assuming that some probability density function linked with this signal, is bimodal. We then compare several estimators which take this assumption into account, including kernel density estimators.In another chapter, we compare several regression approaches, aiming at analysing transcriptomic data. To understand which explanatory variables have an effect on gene expressions, we apply a multiple testing procedure on these data, through the linear model FAMT. The SIR method may find nonlinear relations in such a context. It is however more commonly used when the response variable is univariate. A multivariate version of SIR was then developed. Procedures to measure gene expressions can be expensive. The sample size n of the corresponding datasets is then often small. That is why we also studied SIR when n is less than the number of explanatory variables p. Kernel density estimator Multiple testing Renewal process Sliced inverse regression Transcriptomics Valvometry
50	An Analysis Tool for Flight Dynamics Monte Carlo Simulations Restrepo, Carolina 1982- 16 December 2013 (has links) Spacecraft design is inherently difficult due to the nonlinearity of the systems involved as well as the expense of testing hardware in a realistic environment. The number and cost of flight tests can be reduced by performing extensive simulation and analysis work to understand vehicle operating limits and identify circumstances that lead to mission failure. A Monte Carlo simulation approach that varies a wide range of physical parameters is typically used to generate thousands of test cases. Currently, the data analysis process for a fully integrated spacecraft is mostly performed manually on a case-by-case basis, often requiring several analysts to write additional scripts in order to sort through the large data sets. There is no single method that can be used to identify these complex variable interactions in a reliable and timely manner as well as be applied to a wide range of flight dynamics problems. This dissertation investigates the feasibility of a unified, general approach to the process of analyzing flight dynamics Monte Carlo data. The main contribution of this work is the development of a systematic approach to finding and ranking the most influential variables and combinations of variables for a given system failure. Specifically, a practical and interactive analysis tool that uses tractable pattern recognition methods to automate the analysis process has been developed. The analysis tool has two main parts: the analysis of individual influential variables and the analysis of influential combinations of variables. This dissertation describes in detail the two main algorithms used: kernel density estimation and nearest neighbors. Both are non-parametric density estimation methods that are used to analyze hundreds of variables and combinations thereof to provide an analyst with insightful information about the potential cause for a specific system failure. Examples of dynamical systems analysis tasks using the tool are provided. spacecraft design nearest neighbors kernel density estimation guidance, navigation, and control pattern recognition data analysis Monte Carlo simulation

Search results