Global ETD Search

11	APPLICATION OF BIG DATA ANALYTICS FRAMEWORK FOR ENHANCING CUSTOMER EXPERIENCE ON E-COMMERCE SHOPPING PORTALS Nimita Shyamsunder Atal (8785316) 01 May 2020 (has links) <div> <p>E-commerce organizations, these days, need to keep striving for constant innovation. Customers have a massive impact on the performance of an organization, so industries need to have solid customer retention strategies. Various big data analytics methodologies are being used by organizations to improve overall online customer experience. While there are multiple techniques available, this research study utilized and tested a framework proposed by Laux et al. (2017), which combines Big Data and Six Sigma methodologies, to the e-commerce domain for identification of issues faced by the customer; this was done by analyzing online product reviews and ratings of customers to provide improvement strategies for enhancing customer experience. </p> <p>Analysis performed on the data showed that approximately 90% of the customer reviews had positive polarity. Among the factors which were identified to have affected the opinions of the customers, the Rating field had the most impact on the sentiments of the users and it was found to be statistically significant. Upon further analysis of reviews with lower rating, the results attained showed that the major issues faced by customers were related to the product itself; most issues were more specifically about the size/fit of the product, followed by the product quality, material used, how the product looked on the online portal versus how it looked in reality, and its price concerning the quality.</p> </div> <br> Natural Language Processing Pattern Recognition and Data Mining Big Data Framework Six Sigma DMAIC E-commerce platforms Online Shopping Portal Customer rev Customer Experience customer satisfaction
12	Probabilistic Diagnostic Model for Handling Classifier Degradation in Machine Learning Gustavo A. Valencia-Zapata (8082655) 04 December 2019 (has links) Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. This research consists of three main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers. Finally, a probabilistic sampling technique based on training set diagnosis for avoiding classifier degradation is proposed<br> Statistics Pattern Recognition and Data Mining Class imbalance Overlapping Small-disjuncts Noisy labels Sparseness Gaussian Mixture Models Separation index Classifier degradation Bayesian Information Criterion (BIC)
13	Be More with Less: Scaling Deep-learning with Minimal Supervision Yaqing Wang (12470301) 28 April 2022 (has links) <p> </p> <p>Large-scale deep learning models have reached previously unattainable performance for various tasks. However, the ever-growing resource consumption of neural networks generates large carbon footprint, brings difficulty for academics to engage in research and stops emerging economies from enjoying growing Artificial Intelligence (AI) benefits. To further scale AI to bring more benefits, two major challenges need to be solved. Firstly, even though large-scale deep learning models achieved remarkable success, their performance is still not satisfactory when fine-tuning with only a handful of examples, thereby hindering widespread adoption in real-world applications where a large scale of labeled data is difficult to obtain. Secondly, current machine learning models are still mainly designed for tasks in closed environments where testing datasets are highly similar to training datasets. When the deployed datasets have distribution shift relative to collected training data, we generally observe degraded performance of developed models. How to build adaptable models becomes another critical challenge. To address those challenges, in this dissertation, we focus on two topics: few-shot learning and domain adaptation, where few-shot learning aims to learn tasks with limited labeled data and domain adaption address the discrepancy between training data and testing data. In Part 1, we show our few-shot learning studies. The proposed few-shot solutions are built upon large-scale language models with evolutionary explorations from improving supervision signals, incorporating unlabeled data and improving few-shot learning abilities with lightweight fine-tuning design to reduce deployment costs. In Part 2, domain adaptation studies are introduced. We develop a progressive series of domain adaption approaches to transfer knowledge across domains efficiently to handle distribution shifts, including capturing common patterns across domains, adaptation with weak supervision and adaption to thousands of domains with limited labeled data and unlabeled data. </p> Computer Engineering Applied Computer Science Pattern Recognition and Data Mining Minimally-supervised Learning Semi-supervised Learning Data Mining Deep Learning Fake News Detection Natural Language Processing Domain Adaptation
14	You Only Gesture Once (YouGo): American Sign Language Translation using YOLOv3 Mehul Nanda (8786558) 01 May 2020 (has links) <div>The study focused on creating and proposing a model that could accurately and precisely predict the occurrence of an American Sign Language gesture for an alphabet in the English Language</div><div>using the You Only Look Once (YOLOv3) Algorithm. The training dataset used for this study was custom created and was further divided into clusters based on the uniqueness of the ASL sign.</div><div>Three diverse clusters were created. Each cluster was trained with the network known as darknet. Testing was conducted using images and videos for fully trained models of each cluster and</div><div>Average Precision for each alphabet in each cluster and Mean Average Precision for each cluster was noted. In addition, a Word Builder script was created. This script combined the trained models, of all 3 clusters, to create a comprehensive system that would create words when the trained models were supplied</div><div>with images of alphabets in the English language as depicted in ASL.</div> Computer Vision Image Processing Pattern Recognition and Data Mining Object Detection Neural Networks YOLO YOLOv3 Sign Language Sign Language Translation ASL Image Processing Convolutional Neural Network
15	Text mining for social harm and criminal justice application Ritika Pandey (9147281) 30 July 2020 (has links) Increasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users’ transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug culture and provide tools with potential applications in combating opioid crises. Lastly, we present a knowledge graph based framework for homicide investigation chronologies that may aid investigators in analyzing homicide case data and also allow for post hoc analysis of key features that determine whether a homicide is ultimately solved. For this purpose<br>we perform named entity recognition to determine witnesses, detectives and suspects from chronology, use keyword expansion to identify various evidence types and finally link these entities and evidence to construct a homicide investigation knowledge graph. We compare the performance over several choice of methodologies for these sub-tasks and analyze the association between network statistics of knowledge graph and homicide solvability. <br> Natural Language Processing Pattern Recognition and Data Mining machine learning data mining techniques text mining Social harm Criminal Justice Natural language processing
16	Skin lesion detection using deep learning Rajit Chandra (12495442) 03 May 2022 (has links) <p>Skin lesion can be deadliest if not detected early. Early detection of skin lesion can save many lives. Artificial Intelligence and Machine learning is helping healthcare in many ways and so in the diagnosis of skin lesion. Computer aided diagnosis help clinicians in detecting the cancer. The study was conducted to classify the seven classes of skin lesion using very powerful convolutional neural networks. The two pre trained models i.e., DenseNet and Incepton-v3 were employed to train the model and accuracy, precision, recall, f1score and ROC-AUC was calculated for every class prediction. Moreover, gradient class activation maps were also used to aid the clinicians in determining what are the regions of image that influence model to make a certain decision. These visualizations are used for explainability of the model. Experiments showed that DenseNet performed better then Inception V3. Also it was noted that gradient class activation maps highlighted different regions for predicting same class. The main contribution was to introduce medical aided visualizations in lesion classification model that will help clinicians in understanding the decisions of the model. It will enhance the reliability of the model. Also, different optimizers were employed with both models to compare the accuracies.</p> Computer vision Image processing Pattern recognition Data mining and knowledge discovery Skin Cancer Diagnosis Convolutional Neural Networks Imaging DenseNet InceptionNet-V 3 pretrained model focal loss Image Processing Pattern Recognition and Data Mining Computer Vision
17	Efficient and Scalable Subgraph Statistics using Regenerative Markov Chain Monte Carlo Mayank Kakodkar (12463929) 26 April 2022 (has links) <p>In recent years there has been a growing interest in data mining and graph machine learning for techniques that can obtain frequencies of <em>k</em>-node Connected Induced Subgraphs (<em>k</em>-CIS) contained in large real-world graphs. While recent work has shown that 5-CISs can be counted exactly, no exact polynomial-time algorithms are known that solve this task for <em>k </em>> 5. In the past, sampling-based algorithms that work well in moderately-sized graphs for <em>k</em> ≤ 8 have been proposed. In this thesis I push this boundary up to <em>k</em> ≤ 16 for graphs containing up to 120M edges, and to <em>k</em> ≤ 25 for smaller graphs containing between a million to 20M edges. I do so by re-imagining two older, but elegant and memory-efficient algorithms -- FANMOD and PSRW -- which have large estimation errors by modern standards. This is because FANMOD produces highly correlated k-CIS samples and the cost of sampling the PSRW Markov chain becomes prohibitively expensive for k-CIS’s larger than <em>k </em>> 8.</p> <p>In this thesis, I introduce:</p> <p>(a) <strong>RTS:</strong> a novel regenerative Markov chain Monte Carlo (MCMC) sampling procedure on the tree, generated on-the-fly by the FANMOD algorithm. RTS is able to run on multiple cores and multiple machines (embarrassingly parallel) and compute confidence intervals of estimates, all this while preserving the memory-efficient nature of FANMOD. RTS is thus able to estimate subgraph statistics for <em>k</em> ≤ 16 for larger graphs containing up to 120M edges, and for <em>k</em> ≤ 25 for smaller graphs containing between a million to 20M edges.</p> <p>(b) <strong>R-PSRW:</strong> which scales the PSRW algorithm to larger CIS-sizes using a rejection sampling procedure to efficiently sample transitions from the PSRW Markov chain. R-PSRW matches RTS in terms of scaling to larger CIS sizes.</p> <p>(c) <strong>Ripple:</strong> which achieves unprecedented scalability by stratifying the R-PSRW Markov chain state-space into ordered strata via a new technique that I call <em>sequential stratified regeneration</em>. I show that the Ripple estimator is consistent, highly parallelizable, and scales well. Ripple is able to <em>count</em> CISs of size up to <em>k </em>≤ 12 in real world graphs containing up to 120M edges.</p> <p>My empirical results show that the proposed methods offer a considerable improvement over the state-of-the-art. Moreover my methods are able to run at a scale that has been considered unreachable until now, not only by prior MCMC-based methods but also by other sampling approaches. </p> <p><strong>Optimization of Restricted Boltzmann Machines. </strong>In addition, I also propose a regenerative transformation of MCMC samplers of Restricted Boltzmann Machines RBMs. My approach, Markov Chain Las Vegas (MCLV) gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has a maximum number of Markov chain step-count <em>K</em> (referred as MCLV-<em>K</em>). I present a MCLV-<em>K</em> gradient estimator (LVS-<em>K</em>) for RBMs and explore the correspondence and differences between LVS-<em>K</em> and Contrastive Divergence (CD-<em>K</em>). LVS-<em>K</em> significantly outperforms CD-<em>K</em> in the task of training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models.</p> Pattern Recognition and Data Mining Markov Chain Monte Carlo Random Walk Regenerative Sampling Motif Analysis Subgraph Counting Graph Mining Energy Based Models Generative Models Markov Random Fields Restricted Boltzmann Machine Random Walk Tours
18	Ameliorating Environmental Effects on Hyperspectral Images for Improved Phenotyping in Greenhouse and Field Conditions Dongdong Ma (9224231) 14 August 2020 (has links) Hyperspectral imaging has become one of the most popular technologies in plant phenotyping because it can efficiently and accurately predict numerous plant physiological features such as plant biomass, leaf moisture content, and chlorophyll content. Various hyperspectral imaging systems have been deployed in both greenhouse and field phenotyping activities. However, the hyperspectral imaging quality is severely affected by the continuously changing environmental conditions such as cloud cover, temperature and wind speed that induce noise in plant spectral data. Eliminating these environmental effects to improve imaging quality is critically important. In this thesis, two approaches were taken to address the imaging noise issue in greenhouse and field separately. First, a computational simulation model was built to simulate the greenhouse microclimate changes (such as the temperature and radiation distributions) through a 24-hour cycle in a research greenhouse. The simulated results were used to optimize the movement of an automated conveyor in the greenhouse: the plants were shuffled with the conveyor system with optimized frequency and distance to provide uniform growing conditions such as temperature and lighting intensity for each individual plant. The results showed the variance of the plants’ phenotyping feature measurements decreased significantly (i.e., by up to 83% in plant canopy size) in this conveyor greenhouse. Secondly, the environmental effects (i.e., sun radiation) on <a>aerial </a>hyperspectral images in field plant phenotyping were investigated and modeled. <a>An artificial neural network (ANN) method was proposed to model the relationship between the image variation and environmental changes. Before the 2019 field test, a gantry system was designed and constructed to repeatedly collect time-series hyperspectral images with 2.5 minutes intervals of the corn plants under varying environmental conditions, which included sun radiation, solar zenith angle, diurnal time, humidity, temperature and wind speed. Over 8,000 hyperspectral images of </a>corn (<i>Zea mays </i>L.) were collected with synchronized environmental data throughout the 2019 growing season. The models trained with the proposed ANN method were able to accurately predict the variations in imaging results (i.e., 82.3% for NDVI) caused by the changing environments. Thus, the ANN method can be used by remote sensing professionals to adjust or correct raw imaging data for changing environments to improve plant characterization. Agricultural Engineering Pattern Recognition and Data Mining Agronomy Crops, Agricultural Remote sensing imagery Hyperspectral Reflectance Imaging Plant phenotyping features Environment variation Diurnal variability Time series decomposition Artificial neural network High throughput indoor phenotyping Greenhouse microclimate control
19	Diffusion Tensor Imaging Analysis for Subconcussive Trauma in Football and Convolutional Neural Network-Based Image Quality Control That Does Not Require a Big Dataset Ikbeom Jang (5929832) 14 May 2019 (has links) Diffusion Tensor Imaging (DTI) is a magnetic resonance imaging (MRI)-based technique that has frequently been used for the identification of brain biomarkers of neurodevelopmental and neurodegenerative disorders because of its ability to assess the structural organization of brain tissue. In this work, I present (1) preclinical findings of a longitudinal DTI study that investigated asymptomatic high school football athletes who experienced repetitive head impact and (2) an automated pipeline for assessing the quality of DTI images that uses a convolutional neural network (CNN) and transfer learning. The first section addresses the effects of repetitive subconcussive head trauma on the white matter of adolescent brains. Significant concerns exist regarding sub-concussive injury in football since many studies have reported that repetitive blows to the head may change the microstructure of white matter. This is more problematic in youth-aged athletes whose white matter is still developing. Using DTI and head impact monitoring sensors, regions of significantly altered white matter were identified and within-season effects of impact exposure were characterized by identifying the volume of regions showing significant changes for each individual. The second section presents a novel pipeline for DTI quality control (QC). The complex nature and long acquisition time associated with DTI make it susceptible to artifacts that often result in inferior diagnostic image quality. We propose an automated QC algorithm based on a deep convolutional neural network (DCNN). Adaptation of transfer learning makes it possible to train a DCNN with a relatively small dataset in a short time. The QA algorithm detects not only motion- or gradient-related artifacts, but also various erroneous acquisitions, including images with regional signal loss or those that have been incorrectly imaged or reconstructed. Neuroscience Biomarkers Health Care Diseases Central Nervous System Health Informatics Image Processing Pattern Recognition and Data Mining Diffusion Tensor Imaging Traumatic Brain Injury Subconcussive Injury Diffusion-Weighted Imaging Magnetic Resonance Imaging Image Quality Assessment Quality Control Convolutional Neural Network Transfer Learning Football Sport Concussion Quality Assurance
20	n-TARP: A Random Projection based Method for Supervised and Unsupervised Machine Learning in High-dimensions with Application to Educational Data Analysis Yellamraju Tarun (6630578) 11 June 2019 (has links) Analyzing the structure of a dataset is a challenging problem in high-dimensions as the volume of the space increases at an exponential rate and typically, data becomes sparse in this high-dimensional space. This poses a significant challenge to machine learning methods which rely on exploiting structures underlying data to make meaningful inferences. This dissertation proposes the <i>n</i>-TARP method as a building block for high-dimensional data analysis, in both supervised and unsupervised scenarios.<div><br></div><div>The basic element, <i>n</i>-TARP, consists of a random projection framework to transform high-dimensional data to one-dimensional data in a manner that yields point separations in the projected space. The point separation can be tuned to reflect classes in supervised scenarios and clusters in unsupervised scenarios. The <i>n</i>-TARP method finds linear separations in high-dimensional data. This basic unit can be used repeatedly to find a variety of structures. It can be arranged in a hierarchical structure like a tree, which increases the model complexity, flexibility and discriminating power. Feature space extensions combined with <i>n</i>-TARP can also be used to investigate non-linear separations in high-dimensional data.<br></div><div><br></div><div>The application of <i>n</i>-TARP to both supervised and unsupervised problems is investigated in this dissertation. In the supervised scenario, a sequence of <i>n</i>-TARP based classifiers with increasing complexity is considered. The point separations are measured by classification metrics like accuracy, Gini impurity or entropy. The performance of these classifiers on image classification tasks is studied. This study provides an interesting insight into the working of classification methods. The sequence of <i>n</i>-TARP classifiers yields benchmark curves that put in context the accuracy and complexity of other classification methods for a given dataset. The benchmark curves are parameterized by classification error and computational cost to define a benchmarking plane. This framework splits this plane into regions of "positive-gain" and "negative-gain" which provide context for the performance and effectiveness of other classification methods. The asymptotes of benchmark curves are shown to be optimal (i.e. at Bayes Error) in some cases (Theorem 2.5.2).<br></div><div><br></div><div>In the unsupervised scenario, the <i>n</i>-TARP method highlights the existence of many different clustering structures in a dataset. However, not all structures present are statistically meaningful. This issue is amplified when the dataset is small, as random events may yield sample sets that exhibit separations that are not present in the distribution of the data. Thus, statistical validation is an important step in data analysis, especially in high-dimensions. However, in order to statistically validate results, often an exponentially increasing number of data samples are required as the dimensions increase. The proposed <i>n</i>-TARP method circumvents this challenge by evaluating statistical significance in the one-dimensional space of data projections. The <i>n</i>-TARP framework also results in several different statistically valid instances of point separation into clusters, as opposed to a unique "best" separation, which leads to a distribution of clusters induced by the random projection process.<br></div><div><br></div><div>The distributions of clusters resulting from <i>n</i>-TARP are studied. This dissertation focuses on small sample high-dimensional problems. A large number of distinct clusters are found, which are statistically validated. The distribution of clusters is studied as the dimensionality of the problem evolves through the extension of the feature space using monomial terms of increasing degree in the original features, which corresponds to investigating non-linear point separations in the projection space.<br></div><div><br></div><div>A statistical framework is introduced to detect patterns of dependence between the clusters formed with the features (predictors) and a chosen outcome (response) in the data that is not used by the clustering method. This framework is designed to detect the existence of a relationship between the predictors and response. This framework can also serve as an alternative cluster validation tool.<br></div><div><br></div><div>The concepts and methods developed in this dissertation are applied to a real world data analysis problem in Engineering Education. Specifically, engineering students' Habits of Mind are analyzed. The data at hand is qualitative, in the form of text, equations and figures. To use the <i>n</i>-TARP based analysis method, the source data must be transformed into quantitative data (vectors). This is done by modeling it as a random process based on the theoretical framework defined by a rubric. Since the number of students is small, this problem falls into the small sample high-dimensions scenario. The <i>n</i>-TARP clustering method is used to find groups within this data in a statistically valid manner. The resulting clusters are analyzed in the context of education to determine what is represented by the identified clusters. The dependence of student performance indicators like the course grade on the clusters formed with <i>n</i>-TARP are studied in the pattern dependence framework, and the observed effect is statistically validated. The data obtained suggests the presence of a large variety of different patterns of Habits of Mind among students, many of which are associated with significant grade differences. In particular, the course grade is found to be dependent on at least two Habits of Mind: "computation and estimation" and "values and attitudes."<br></div> Statistics Education Applied Statistics Probability Theory Stochastic Analysis and Modelling Pattern Recognition and Data Mining Education Assessment and Evaluation High-dimensions n-TARP Clustering Methods Machine Learning data analysis study Educational Data Statistical test pattern analysis techniques Pattern Dependence Benchmarks

Search results