Spelling suggestions: "subject:"[een] DATA ANALYSIS"" "subject:"[enn] DATA ANALYSIS""
11 |
Automated Machine Learning: Intellient Binning Data Preparation and Regularized Regression ClassfierZhu, Jianbin 01 January 2023 (has links) (PDF)
Automated machine learning (AutoML) has become a new trend which is the process of automating the complete pipeline from the raw dataset to the development of machine learning model. It not only can relief data scientists' works but also allows non-experts to finish the jobs without solid knowledge and understanding of statistical inference and machine learning. One limitation of AutoML framework is the data quality differs significantly batch by batch. Consequently, fitted model quality for some batches of data can be very poor due to distribution shift for some numerical predictors. In this dissertation, we develop an intelligent binning to resolve this problem. In addition, various regularized regression classifiers (RRCs) including Ridge, Lasso and Elastic Net regression have been tested to enhance model performance further after binning. We focus on the binary classification problem and have developed an AutoML framework using Python to handle the entire data preparation process including data partition and intelligent binning. This system has been tested extensively by simulations and real datasets analyses and the results have shown that (1) All the models perform better with intelligent binding for both balanced and imbalance binary classification problem. (2) Regression-based methods are more sensitive than tree-based methods using intelligent binning. RRCs can work better than other tree methods by using intelligent binning technique. (3) Weighted RRC can obtain the best results compared to other methods. (4) Our framework is an effective and reliable tool to conduct AutoML.
|
12 |
An Evaluation of the Performance of Proc ARIMA's Identify Statement: A Data-Driven Approach using COVID-19 Cases and Deaths in FloridaShahela, Fahmida Akter 01 January 2021 (has links) (PDF)
Understanding data on novel coronavirus (COVID-19) pandemic, and modeling such data over time are crucial for decision making at managing, fighting, and controlling the spread of this emerging disease. This thesis work looks at some aspects of exploratory analysis and modeling of COVID-19 data obtained from the Florida Department of Health (FDOH). In particular, the present work is devoted to data collection, preparation, description, and modeling of COVID-19 cases and deaths reported by FDOH between March 12, 2020, and April 30, 2021. For modeling data on both cases and deaths, this thesis utilized an autoregressive integrated moving average (ARIMA) times series model. The "IDENTIFY" statement of SAS PROC ARIMA suggests a few competing models with suggested values of the parameter p (the order of the Autoregressive model), d (the order of the differencing), and q (the order of the Moving Average model). All suggested models are then compared using AIC (Akaike Information Criterion), SBC (Schwarz Bayes Criterion), and MAE (Mean Absolute Error) values, and the best-fitting models are then chosen with smaller values of the above model comparison criteria. To evaluate the performance of the model selected under this modeling approach, the procedure is repeated using the first six month's data and forecasting the next 7 days data, nine month's data and forecasting the next 7 days data, and then all reported FDOH data from March 12, 2020, to April 30, 2021, and forecasting the future data. The findings of exploratory data analysis that suggests higher COVID-19 cases for females compared to males and higher male deaths compared to females are taken into consideration by evaluating the performance of final models by gender for both cases and deaths' data reported by FDOH. The gender-specific models appear to be better under model comparison criteria Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) compared to models based on gender aggregated data. It is observed that the fitted models reasonably predicted the future numbers of confirmed cases and deaths. Given similarities in reported COVID-19 data, the proposed modeling approach can be applied to data in the USA and many other States, and countries around the world.
|
13 |
A Simulation-Based Task Analysis using Agent-Based, Discrete Event and System Dynamics SimulationAngelopoulou, Anastasia 01 January 2015 (has links)
Recent advances in technology have increased the need for using simulation models to analyze tasks and obtain human performance data. A variety of task analysis approaches and tools have been proposed and developed over the years. Over 100 task analysis methods have been reported in the literature. However, most of the developed methods and tools allow for representation of the static aspects of the tasks performed by expert system-driven human operators, neglecting aspects of the work environment, i.e. physical layout, and dynamic aspects of the task. The use of simulation can help face the new challenges in the field of task analysis as it allows for simulation of the dynamic aspects of the tasks, the humans performing them, and their locations in the environment. Modeling and/or simulation task analysis tools and techniques have been proven to be effective in task analysis, workload, and human reliability assessment. However, most of the existing task analysis simulation models and tools lack features that allow for consideration of errors, workload, level of operator's expertise and skills, among others. In addition, the current task analysis simulation tools require basic training on the tool to allow for modeling the flow of task analysis process and/or error and workload assessment. The modeling process is usually achieved using drag and drop functionality and, in some cases, programming skills. This research focuses on automating the modeling process and simulating individuals (or groups of individuals) performing tasks in a dynamic work environment in any domain. The main objective of this research is to develop a universal tool that allows for modeling and simulation of task analysis models in a short amount of time with limited need for training or knowledge of modeling and simulation theory. A Universal Task Analysis Simulation Modeling (UTASiMo) tool can be used for automatically generating simulation models that analyze the tasks performed by human operators. UTASiMo is a multi-method modeling and simulation tool developed as a combination of agent-based, discrete event, and system dynamics simulation models. A generic multi-method modeling and simulation framework, named 3M&S Framework, as well as the Unified Modeling Language have been used for the design of the conceptual model and the implementation of the simulation tool. UTASiMo-generated models are dynamically created during run-time based on user inputs. The simulation results include estimations of operator workload, task completion time, and probability of human errors based on human operator variability and task structure.
|
14 |
Improvements in and Relating to Processing Apparatus & MethodNoras, James M., Jones, Steven M.R., Rajamani, Haile S., Shepherd, Simon J., Van Eetvelt, Peter 25 May 2004 (has links)
No
|
15 |
Models for Univariate and Multivariate Analysis of Longitudinal and Clustered DataLuo, Dandan Unknown Date
No description available.
|
16 |
Industrial Batch Data Analysis Using Latent Variable MethodsRodrigues, Cecilia 09 1900 (has links)
Currently most batch processes run in an open loop manner with respect to final product quality, regardless of the performance obtained. This fact, allied with the increased industrial importance of batch processes, indicates that there is a pressing need for the development and dissemination of automated batch quality control techniques that suit present industrial needs. Within this context, the main objective of the current work is to exemplify the use of empirical latent variable methods to reduce product quality variability in batch processes. These methods are also known as multiway principal component analysis (MPCA) and partial least squares (MPLS) and were originally introduced by Nomikos and MacGregor (1992, 1994, 1995a and 1995b ). Their use is tied with the concepts of statistical process control (SPC) and lead to incremental process improvements. Throughout this thesis three different sets of industrial sets of data, originating from different batch process were analyzed. The first section of this thesis (Chapter 3) demonstrates how MPCA and multi-block, multiway, partial least squares (MB-MPLS) methods can be successfully used to troubleshoot an industrial batch unit in order to identify optimal process conditions with respect to quality. Additionally, approaches to batch data laundering are proposed. The second section (Chapter 4) elaborates on the use of a MPCA model to build a single, all-encompassing, on-line monitoring scheme for the heating phase of a multi-grade batch annealing process. Additionally, this same data set is used to present a simple alignment technique for batch data when on-line monitoring is intended (Chapter 5). This technique is referred to as pre-alignment and it relies on the use of a PLS model to predict the duration of new batches. Also, various methods for dealing with matrices containing different sized observations are proposed and evaluated. Finally, the last section (Chapter 6) deals with end-point prediction of a condensation polymerization process. / Thesis / Master of Applied Science (MASc)
|
17 |
Application of extreme value theoryHakimi Sibooni, J. January 1988 (has links)
No description available.
|
18 |
MAPS OF THE MAGELLANIC CLOUDS FROM COMBINED SOUTH POLE TELESCOPE AND PLANCK DATACrawford, T. M., Chown, R., Holder, G. P., Aird, K. A., Benson, B. A., Bleem, L. E., Carlstrom, J. E., Chang, C. L., Cho, H-M., Crites, A. T., Haan, T. de, Dobbs, M. A., George, E. M., Halverson, N. W., Harrington, N. L., Holzapfel, W. L., Hou, Z., Hrubes, J. D., Keisler, R., Knox, L., Lee, A. T., Leitch, E. M., Luong-Van, D., Marrone, D. P., McMahon, J. J., Meyer, S. S., Mocanu, L. M., Mohr, J. J., Natoli, T., Padin, S., Pryke, C., Reichardt, C. L., Ruhl, J. E., Sayre, J. T., Schaffer, K. K., Shirokoff, E., Staniszewski, Z., Stark, A. A., Story, K. T., Vanderlinde, K., Vieira, J. D., Williamson, R. 09 December 2016 (has links)
We present maps of the Large and Small Magellanic Clouds from combined South Pole Telescope (SPT) and Planck data. The Planck satellite observes in nine bands, while the SPT data used in this work were taken with the three-band SPT-SZ camera, The SPT-SZ bands correspond closely to three of the nine Planck bands, namely those centered at 1.4, 2.1, and 3.0 mm. The angular resolution of the Planck data ranges from 5 to 10 arcmin, while the SPT resolution ranges from 1.0 to 1.7 arcmin. The combined maps take advantage of the high resolution of the SPT data and the long-timescale stability of the space-based Planck observations to deliver robust brightness measurements on scales from the size of the maps down to similar to 1 arcmin. In each band, we first calibrate and color-correct the SPT data to match the Planck data, then we use noise estimates from each instrument and knowledge of each instrument's beam to make the inverse-variance-weighted combination of the two instruments' data as a function of angular scale. We create maps assuming a range of underlying emission spectra and at a range of final resolutions. We perform several consistency tests on the combined maps and estimate the expected noise in measurements of features in them. We compare maps from this work to those from the Herschel HERITAGE survey, finding general consistency between the data sets. All data products described in this paper are available for download from the NASA Legacy Archive for Microwave Background Data Analysis server.
|
19 |
Topics in functional data analysis with biological applicationsLi, Yehua 02 June 2009 (has links)
Functional data analysis (FDA) is an active field of statistics, in which the primary subjects
in the study are curves. My dissertation consists of two innovative applications of
functional data analysis in biology. The data that motivated the research broadened the
scope of FDA and demanded new methodology. I develop new nonparametric methods to
make various estimations, and I focus on developing large sample theories for the proposed
estimators.
The first project is motivated from a colon carcinogenesis study, the goal of which is to
study the function of a protein (p27) in colon cancer development. In this study, a number
of colonic crypts (units) were sampled from each rat (subject) at random locations along
the colon, and then repeated measurements on the protein expression level were made on
each cell (subunit) within the selected crypts. In this problem, measurements within each
crypt can be viewed as a function, since the measurements can be indexed by the cell
locations. The functions from the same subject are spatially correlated along the colon,
and my goal is to estimate this correlation function using nonparametric methods. We use
this data set as an motivation and propose a kernel estimator of the correlation function
in a more general framework. We develop a pointwise asymptotic normal distribution
for the proposed estimator when the number of subjects is fixed and the number of units within each subject goes to infinity. Based on the asymptotic theory, we propose a weighted
block bootstrapping method for making inferences about the correlation function, where the
weights account for the inhomogeneity of the distribution of the unit locations. Simulation
studies are also provided to illustrate the numerical performance of the proposed method.
My second project is on a lipoprotein profile data, where the goal is to use lipoprotein
profile curves to predict the cholesterol level in human blood. Again, motivated by the data,
we consider a more general problem: the functional linear models (Ramsay and Silverman,
1997) with functional predictor and scalar response. There is literature developing different
methods for this model; however, there is little theory to support the methods. Therefore,
we focus more on the theoretical properties of this model. There are other contemporary
theoretical work on methods based on Principal Component Regression. Our work is different
in the sense that we base our method on roughness penalty approach and consider a
more realistic scenario that the functional predictor is observed only on discrete points. To
reduce the difficulty of the theoretical derivations, we restrict the functions with a periodic
boundary condition and develop an asymptotic convergence rate for this problem in Chapter
III. A more general result based on splines is a future research topic that I give some
discussion in Chapter IV.
|
20 |
The Implications and Flow Behavior of the Hydraulically Fractured Wells in Shale Gas FormationAlmarzooq, Anas Mohammadali S. 2010 December 1900 (has links)
Shale gas formations are known to have low permeability. This low permeability can be as low as 100 nano darcies. Without stimulating wells drilled in the shale gas formations, it is hard to produce them at an economic rate. One of the stimulating approaches is by drilling horizontal wells and hydraulically fracturing the formation. Once the formation is fractured, different flow patterns will occur. The dominant flow regime observed in the shale gas formation is the linear flow or the transient drainage from the formation matrix toward the hydraulic fracture. This flow could extend up to years of production and it can be identified by half slop on the log-log plot of the gas rate against time. It could be utilized to evaluate the hydraulic fracture surface area and eventually evaluate the effectiveness of the completion job. Different models from the literature can be used to evaluate the completion job. One of the models used in this work assumes a rectangular reservoir with a slab shaped matrix between each two hydraulic fractures. From this model, there are at least five flow regions and the two regions discussed are the Region 2 in which bilinear flow occurs as a result of simultaneous drainage form the matrix and hydraulic fracture. The other is Region 4 which results from transient matrix drainage which could extend up to many years. The Barnett shale production data will be utilized throughout this work to show sample of the calculations.
This first part of this work will evaluate the field data used in this study following a systematic procedure explained in Chapter III. This part reviews the historical production, reservoir and fluid data and well completion records available for the wells being analyzed. It will also check for data correlations from the data available and explain abnormal flow behaviors that might occur utilizing the field production data. It will explain why some wells might not fit into each model. This will be followed by a preliminary diagnosis, in which flow regimes will be identified, unclear data will be filtered, and interference and liquid loading data will be pointed. After completing the data evaluation, this work will evaluate and compare the different methods available in the literature in order to decide which method will best fit to analyze the production data from the Barnett shale. Formation properties and the original gas in place will be evaluated and compared for different methods.
|
Page generated in 0.0549 seconds