Spelling suggestions: "subject:"dataanalysis"" "subject:"data.analysis""
351 |
Bi-filtration and stability of TDA mapper for point cloud dataBungula, Wako Tasisa 01 August 2019 (has links)
TDA mapper is an algorithm used to visualize and analyze big data. TDA mapper is applied to a dataset, X, equipped with a filter function f from X to R. The output of the algorithm is an abstract graph (or simplicial complex). The abstract graph captures topological and geometric information of the underlying space of X.
One of the interests in TDA mapper is to study whether or not a mapper graph is stable. That is, if a dataset X is perturbed by a small value, and denote the perturbed dataset by X∂, we would like to compare the TDA mapper graph of X to the TDA mapper graph of X∂. Given a topological space X, if the cover of the image of f satisfies certain conditions, Tamal Dey, Facundo Memoli, and Yusu Wang proved that the TDA mapper is stable. That is, the mapper graph of X differs from the mapper graph of X∂ by a small value measured via homology.
The goal of this thesis is three-fold. The first is to introduce a modified TDA mapper algorithm. The fundamental difference between TDA mapper and the modified version is the modified version avoids the use of filter function. In comparing the mapper graph outputs, the proposed modified mapper is shown to capture more geometric and topological features. We discuss the advantages and disadvantages of the modified mapper.
Tamal Dey, Facundo Memoli, and Yusu Wang showed that a filtration of covers induce a filtration of simplicial complexes, which in turn induces a filtration of homology groups. While Tamal Dey, Facundo Memoli, and Yusu Wang focused on TDA mapper's application to topological space, the second goal of this thesis is to show DBSCAN clustering gives a filtration of covers when TDA mapper is applied to a point cloud. Hence, DBSCAN gives a filtration of mapper graphs (simplicial complexes) and homology groups. More importantly, DBSCAN gives a filtration of covers, mapper graphs, and homology groups in three parameter directions: bin size, epsilon, and Minpts. Hence, there is a multi-dimensional filtration of covers, mapper graphs, and homology groups. We also note that single-linkage clustering is a special case of DBSCAN clustering, so the results proved to be true when DBSCAN is used are also true when single-linkage is used. However, complete-linkage does not give a filtration of covers in the direction of bin, hence no filtration of simpicial complexes and homology groups exist when complete-linkage is applied to cluster a dataset. In general, the results hold for any clustering algorithm that gives a filtration of covers.
The third (and last) goal of this thesis is to prove that two multi-dimensional persistence modules (one: with respect to the original dataset, X; two: with respect to the ∂-perturbation of X) are 2∂-interleaved. In other words, the mapper graphs of X and X∂ differ by a small value as measured by homology.
|
352 |
Investigating Post-Earnings-Announcement Drift Using Principal Component Analysis and Association Rule MiningSchweickart, Ian R. W. 01 January 2017 (has links)
Post-Earnings-Announcement Drift (PEAD) is commonly accepted in the fields of accounting and finance as evidence for stock market inefficiency. Less accepted are the numerous explanations for this anomaly. This project aims to investigate the cause for PEAD by harnessing the power of machine learning algorithms such as Principle Component Analysis (PCA) and a rule-based learning technique, applied to large stock market data sets. Based on the notion that the market is consumer driven, repeated occurrences of irrational behavior exhibited by traders in response to news events such as earnings reports are uncovered. The project produces findings in support of the PEAD anomaly using non-accounting nor financial methods. In particular, this project finds evidence for delayed price response exhibited in trader behavior, a common manifestation of the PEAD phenomenon.
|
353 |
Improved Standard Error Estimation for Maintaining the Validities of Inference in Small-Sample Cluster Randomized Trials and Longitudinal StudiesTanner, Whitney Ford 01 January 2018 (has links)
Data arising from Cluster Randomized Trials (CRTs) and longitudinal studies are correlated and generalized estimating equations (GEE) are a popular analysis method for correlated data. Previous research has shown that analyses using GEE could result in liberal inference due to the use of the empirical sandwich covariance matrix estimator, which can yield negatively biased standard error estimates when the number of clusters or subjects is not large. Many techniques have been presented to correct this negative bias; However, use of these corrections can still result in biased standard error estimates and thus test sizes that are not consistently at their nominal level. Therefore, there is a need for an improved correction such that nominal type I error rates will consistently result.
First, GEEs are becoming a popular choice for the analysis of data arising from CRTs. We study the use of recently developed corrections for empirical standard error estimation and the use of a combination of two popular corrections. In an extensive simulation study, we find that nominal type I error rates can be consistently attained when using an average of two popular corrections developed by Mancl and DeRouen (2001, Biometrics 57, 126-134) and Kauermann and Carroll (2001, Journal of the American Statistical Association 96, 1387-1396) (AVG MD KC). Use of this new correction was found to notably outperform the use of previously recommended corrections.
Second, data arising from longitudinal studies are also commonly analyzed with GEE. We conduct a simulation study, finding two methods to attain nominal type I error rates more consistently than other methods in a variety of settings: First, a recently proposed method by Westgate and Burchett (2016, Statistics in Medicine 35, 3733-3744) that specifies both a covariance estimator and degrees of freedom, and second, AVG MD KC with degrees of freedom equaling the number of subjects minus the number of parameters in the marginal model.
Finally, stepped wedge trials are an increasingly popular alternative to traditional parallel cluster randomized trials. Such trials often utilize a small number of clusters and numerous time intervals, and these components must be considered when choosing an analysis method. A generalized linear mixed model containing a random intercept and fixed time and intervention covariates is the most common analysis approach. However, the sole use of a random intercept applies assumptions that will be violated in practice. We show, using an extensive simulation study based on a motivating example and a more general design, alternative analysis methods are preferable for maintaining the validity of inference in small-sample stepped wedge trials with binary outcomes. First, we show the use of generalized estimating equations, with an appropriate bias correction and a degrees of freedom adjustment dependent on the study setting type, will result in nominal type I error rates. Second, we show the use of a cluster-level summary linear mixed model can also achieve nominal type I error rates for equal cluster size settings.
|
354 |
Prediction of DNA-Binding Proteins and their Binding SitesPokhrel, Pujan 01 May 2018 (has links)
DNA-binding proteins play an important role in various essential biological processes such as DNA replication, recombination, repair, gene transcription, and expression. The identification of DNA-binding proteins and the residues involved in the contacts is important for understanding the DNA-binding mechanism in proteins. Moreover, it has been reported in the literature that the mutations of some DNA-binding residues on proteins are associated with some diseases. The identification of these proteins and their binding mechanism generally require experimental techniques, which makes large scale study extremely difficult. Thus, the prediction of DNA-binding proteins and their binding sites from sequences alone is one of the most challenging problems in the field of genome annotation. Since the start of the human genome project, many attempts have been made to solve the problem with different approaches, but the accuracy of these methods is still not suitable to do large scale annotation of proteins. Rather than relying solely on the existing machine learning techniques, I sought to combine those using novel “stacking technique” and used the problem-specific architectures to solve the problem with better accuracy than the existing methods. This thesis presents a possible solution to the DNA-binding proteins prediction problem which performs better than the state-of-the-art approaches.
|
355 |
An Automated Analysis Of Single Particle Tracking Data For Proteins That Exhibit Multi Component Motion.Ali, Rehan 01 January 2018 (has links)
Neurons are polarized cells with dendrites and an axon projecting from their cell body. Due to this polarized structure a major challenge for neurons is the transport of material to and from the cell body. The transport that occurs between the cell body and axons is called Axonal transport. Axonal transport has three major components: molecular motors which act as vehicles, microtubules which serve as tracks on which these motors move and microtubule associated proteins which regulate the transport of material. Axonal transport maintains the integrity of a neuron and its dysfunction is linked to neurodegenerative diseases such as, Alzheimer’s disease, Frontotemporal dementia linked to chromosome 17 and Pick’s disease. Therefore, understanding the process of axonal transport is extremely important.
Single particle tracking is one method in which axonal transport is studied. This involves fluorescent labelling of molecular motors and microtubule associated proteins and tracking their position in time. Single particle tracking has shown that both, molecular motors and microtubule associated proteins exhibit motion with multiple components. These components are directed, where motion is in one direction, diffusive, where motion is random, and static, where there is no motion. Moreover, molecular motors and microtubule associated proteins also switch between these different components in a single instance of motion.
We have developed a MATLAB program, called MixMAs, which specializes in analyzing the data provided by single particle tracking. MixMAs uses a sliding window approach to analyze trajectories of motion. It is capable of distinguishing between different components of motion that are exhibited by molecular motors and microtubule associated proteins. It also identifies transitions that take place between different components of motion. Most importantly, it is not limited by the number of transitions and the number of components present in a single trajectory. The analysis results provided by MixMAs include all the necessary parameters required for a complete characterization of a particle’s motion. These parameters are the number of different transitions that take place between different components of motion, the dwell times of different components of motion, velocity for directed component of motion and diffusion coefficient for diffusive component of motion.
We have validated the working of MixMAs by simulating motion of particles which show all three components of motion with all the possible transitions that can take place between them. The simulations are performed for different values of error in localizing the position of a particle. The simulations confirm that MixMAs accurately calculates parameters of motion for a range of localization errors. Finally, we show an application of MixMAs on experimentally obtained single particle data of Kinesin-3 motor.
|
356 |
Provision of Hospital-based Palliative Care and the Impact on Organizational and Patient OutcomesRoczen, Marisa L 01 January 2016 (has links)
Hospital-based palliative care services aim to streamline medical care for patients with chronic and potentially life-limiting illnesses by focusing on individual patient needs, efficient use of hospital resources, and providing guidance for patients, patients’ families and clinical providers toward making optimal decisions concerning a patient’s care. This study examined the nature of palliative care provision in U.S. hospitals and its impact on selected organizational and patient outcomes, including hospital costs, length of stay, in-hospital mortality, and transfer to hospice. Hospital costs and length of stay are viewed as important economic indicators. Specifically, lower hospital costs may increase a hospital’s profit margin and shorter lengths of stay can enable patient turnover and efficiency of care. Higher rates of hospice transfers and lower in-hospital mortality may be considered positive outcomes from a patient perspective, as the majority of patients prefer to die at home or outside of the hospital setting.
Several data sources were utilized to obtain information about patient, hospital, and county characteristics; patterns of hospitals’ palliative care provision; and patients’ hospital costs, length of stay, in-hospital mortality, and transfer to hospice (if a patient survived hospitalization). The study sample consisted of 3,763,339 patients; 348 urban, general, short-term, acute care, non-federal hospitals; and 111 counties located in six states over a 5-year study (2007-2011). Hospital-based palliative care provision was measured by the presence of three palliative care services, including inpatient palliative care consultation services (PAL), inpatient palliative care units (IPAL), and hospice programs (HOSPC). Derived from Institutional Theory, Resource Dependence Theory, and Donabedian’s Structure Process-Outcome framework, 13 hypotheses were tested using a hierarchical (generalized) linear modeling approach.
The study findings suggested that hospital size was associated with a higher probability of hospital-based palliative care provision. Conversely, the presence of palliative care services through a hospital’s health system, network, or joint venture was associated with a lower probability of hospital-based palliative care provision. The study findings also indicated that hospitals with an IPAL or HOSPC incurred lower hospital costs, whereas hospitals with PAL incurred higher hospital costs. The presence of PAL, IPAL, and HOSPC was generally associated with a lower probability of in-hospital mortality and transfer to hospice. Finally, the effects of hospital-based palliative care services on length of stay were mixed, and further research is needed to understand this relationship.
|
357 |
A Comparison of Flare Forecasting Methods. III. Systematic Behaviors of Operational Solar Flare Forecasting SystemsLeka, K.D., Park, S-H., Kusano, K., Andries, J., Barnes, G., Bingham, S., Bloomfield, D.S., McCloskey, A.E., Delouille, V., Falconer, D., Gallagher, P.T., Georgoulis, M.K., Kubo, Y., Lee, K., Lee, S., Lobzin, V., Mun, J., Murray, S.A., Nageem, T.A.M.H., Qahwaji, Rami S.R., Sharpe, M., Steenburgh, R., Steward, G., Terkilsden, M. 08 October 2019 (has links)
Yes / A workshop was recently held at Nagoya University (31 October – 02 November
2017), sponsored by the Center for International Collaborative Research, at the Institute for Space-Earth Environmental Research, Nagoya University, Japan, to quantitatively compare the performance of today’s operational solar flare forecasting facilities.
Building upon Paper I of this series (Barnes et al. 2016), in Paper II (Leka et al. 2019)
we described the participating methods for this latest comparison effort, the evaluation methodology, and presented quantitative comparisons. In this paper we focus on
the behavior and performance of the methods when evaluated in the context of broad
implementation differences. Acknowledging the short testing interval available and the
small number of methods available, we do find that forecast performance: 1) appears to
improve by including persistence or prior flare activity, region evolution, and a human
“forecaster in the loop”; 2) is hurt by restricting data to disk-center observations; 3)
may benefit from long-term statistics, but mostly when then combined with modern
data sources and statistical approaches. These trends are arguably weak and must be
viewed with numerous caveats, as discussed both here and in Paper II. Following this
present work, we present in Paper IV a novel analysis method to evaluate temporal
patterns of forecasting errors of both types (i.e., misses and false alarms; Park et al.
2019). Hence, most importantly, with this series of papers we demonstrate the techniques for facilitating comparisons in the interest of establishing performance-positive
methodologies. / We wish to acknowledge funding from the Institute for Space-Earth Environmental Research, Nagoya University for supporting the workshop and its participants. We would also like to acknowledge the “big picture” perspective brought by Dr. M. Leila Mays during her participation in the workshop. K.D.L. and G.B. acknowledge that the DAFFS and DAFFS-G tools were developed under NOAA SBIR contracts WC-133R-13-CN-0079 (Phase-I) and WC-133R-14-CN-0103 (PhaseII) with additional support from Lockheed-Martin Space Systems contract #4103056734 for Solar-B FPP Phase E support. A.E.McC. was supported by an Irish Research Council Government of Ireland Postgraduate Scholarship. D.S.B. and M.K.G were supported by the European Union Horizon 2020 research and innovation programme under grant agreement No. 640216 (FLARECAST project; http://flarecast.eu). MKG also acknowledges research performed under the A-EFFort project and subsequent service implementation, supported under ESA Contract number 4000111994/14/D/ MPR. S. A. M. is supported by the Irish Research Council Postdoctoral Fellowship Programme and the US Air Force Office of Scientific Research award FA9550-17-1-039. The operational Space Weather services of ROB/SIDC are partially funded through the STCE, a collaborative framework funded by the Belgian Science Policy Office.
|
358 |
Data Analysis Using Experimental Design Model Factorial Analysis of Variance/Covariance (DMAOVC.BAS)Newton, Wesley E. 01 May 1985 (has links)
DMAOVC.BAS is a computer program written in the compiler version of microsoft basic which performs factorial analysis of variance/covariance with expected mean squares. The program accommodates factorial and other hierarchical experimental designs with balanced sets of data. The program is writ ten for use on most modest sized microprocessors, in which the compiler is available. The program is parameter file driven where the parameter file consists of the response variable structure, the experimental design model expressed in a similar structure as seen in most textbooks, information concerning the factors (i.e. fixed or random, and the number of levels), and necessary information to perform covariance analysis. The results of the analysis are written to separate files in a format that can be used for reporting purposes and further computations if needed.
|
359 |
Anomaly detection and root cause diagnosis in cellular networks / Détection d’anomalies et analyse des causes racines dans les réseaux cellulairesMdini, Maha 20 September 2019 (has links)
Grâce à l'évolution des outils d'automatisation et d'intelligence artificielle, les réseauxmobiles sont devenus de plus en plus dépendants de la machine. De nos jours, une grandepartie des tâches de gestion de réseaux est exécutée d'une façon autonome, sans interventionhumaine. Dans cette thèse, nous avons focalisé sur l'utilisation des techniques d'analyse dedonnées dans le but d'automatiser et de consolider le processus de résolution de défaillancesdans les réseaux. Pour ce faire, nous avons défini deux objectifs principaux : la détectiond'anomalies et le diagnostic des causes racines de ces anomalies. Le premier objectif consiste àdétecter automatiquement les anomalies dans les réseaux sans faire appel aux connaissancesdes experts. Pour atteindre cet objectif, nous avons proposé un algorithme, Watchmen AnomalyDetection (WAD), basé sur le concept de la reconnaissance de formes (pattern recognition). Cetalgorithme apprend le modèle du trafic réseau à partir de séries temporelles périodiques etdétecte des distorsions par rapport à ce modèle dans le flux de nouvelles données. Le secondobjectif a pour objet la détermination des causes racines des problèmes réseau sans aucuneconnaissance préalable sur l'architecture du réseau et des différents services. Pour ceci, nousavons conçu un algorithme, Automatic Root Cause Diagnosis (ARCD), qui permet de localiser lessources d'inefficacité dans le réseau. ARCD est composé de deux processus indépendants :l'identification des contributeurs majeurs à l'inefficacité globale du réseau et la détection desincompatibilités. WAD et ARCD ont fait preuve d'efficacité. Cependant, il est possible d'améliorerces algorithmes sur plusieurs aspects. / With the evolution of automation and artificial intelligence tools, mobile networks havebecome more and more machine reliant. Today, a large part of their management tasks runs inan autonomous way, without human intervention. In this thesis, we have focused on takingadvantage of the data analysis tools to automate the troubleshooting task and carry it to a deeperlevel. To do so, we have defined two main objectives: anomaly detection and root causediagnosis. The first objective is about detecting issues in the network automatically withoutincluding expert knowledge. To meet this objective, we have proposed an algorithm, WatchmenAnomaly Detection (WAD), based on pattern recognition. It learns patterns from periodic timeseries and detect distortions in the flow of new data. The second objective aims at identifying theroot cause of issues without any prior knowledge about the network topology and services. Toaddress this question, we have designed an algorithm, Automatic Root Cause Diagnosis (ARCD)that identifies the roots of network issues. ARCD is composed of two independent threads: MajorContributor identification and Incompatibility detection. WAD and ARCD have been proven to beeffective. However, many improvements of these algorithms are possible.
|
360 |
Trend Analysis and Modeling of Health and Environmental Data: Joinpoint and Functional ApproachKafle, Ram C. 04 June 2014 (has links)
The present study is divided into two parts: the first is on developing the statistical analysis and modeling of mortality (or incidence) trends using Bayesian joinpoint regression and the second is on fitting differential equations from time series data to derive the rate of change of carbon dioxide in the atmosphere.
Joinpoint regression model identifies significant changes in the trends of the incidence, mortality, and survival of a specific disease in a given population. Bayesian approach of joinpoint regression is widely used in modeling statistical data to identify the points in the trend where the significant changes occur. The purpose of the present study is to develop an age-stratified Bayesian joinpoint regression model to describe mortality trends assuming that the observed counts are probabilistically characterized by the Poisson distribution. The proposed model is based on Bayesian model selection criteria with the smallest number of joinpoints that are sufficient to explain the Annual Percentage Change (APC). The prior probability distributions are chosen in such a way that they are automatically derived from the model index contained in the model space. The proposed model and methodology estimates the age-adjusted mortality rates in different epidemiological studies to compare the trends by accounting the confounding effects of age. The future mortality rates are predicted using the Bayesian Model Averaging (BMA) approach.
As an application of the Bayesian joinpoint regression, first we study the childhood brain cancer mortality rates (non age-adjusted rates) and their Annual Percentage Change (APC) per year using the existing Bayesian joinpoint regression models in the literature. We use annual observed mortality counts of children ages 0-19 from 1969-2009 obtained from Surveillance Epidemiology and End Results (SEER) database of the National Cancer Institute (NCI). The predictive distributions are used to predict the future mortality rates. We also compare this result with the mortality trend obtained using joinpoint software of NCI, and to fit the age-stratified model, we use the cancer mortality counts of adult lung and bronchus cancer (25-85+ years), and brain and other Central Nervous System (CNS) cancer (25-85+ years) patients obtained from the Surveillance Epidemiology and End Results (SEER) data base of the National Cancer Institute (NCI).
The second part of this study is the statistical analysis and modeling of noisy data using functional data analysis approach. Carbon dioxide is one of the major contributors to Global Warming. In this study, we develop a system of differential equations using time series data of the major sources of the significant contributable variables of carbon dioxide in the atmosphere. We define the differential operator as data smoother and use the penalized least square fitting criteria to smooth the data. Finally, we optimize the profile error sum of squares to estimate the necessary differential operator. The proposed models will give us an estimate of the rate of change of carbon dioxide in the atmosphere at a particular time. We apply the model to fit emission of carbon dioxide data in the continental United States. The data set is obtained from the Carbon Dioxide Information Analysis Center (CDIAC), the primary climate-change data and information analysis center of the United States Department of Energy.
The first four chapters of this dissertation contribute to the development and application of joinpiont and the last chapter discusses the statistical modeling and application of differential equations through data using functional data analysis approach.
|
Page generated in 0.061 seconds