Spelling suggestions: "subject:"random forest."" "subject:"fandom forest.""
91 |
Classifying textual fast food restaurant reviews quantitatively using text mining and supervised machine learning algorithmsWright, Lindsey 01 May 2018 (has links)
Companies continually seek to improve their business model through feedback and customer satisfaction surveys. Social media provides additional opportunities for this advanced exploration into the mind of the customer. By extracting customer feedback from social media platforms, companies may increase the sample size of their feedback and remove bias often found in questionnaires, resulting in better informed decision making. However, simply using personnel to analyze the thousands of relative social media content is financially expensive and time consuming. Thus, our study aims to establish a method to extract business intelligence from social media content by structuralizing opinionated textual data using text mining and classifying these reviews by the degree of customer satisfaction. By quantifying textual reviews, companies may perform statistical analysis to extract insight from the data as well as effectively address concerns. Specifically, we analyzed a subset of 56,000 Yelp reviews on fast food restaurants and attempt to predict a quantitative value reflecting the overall opinion of each review. We compare the use of two different predictive modeling techniques, bagged Decision Trees and Random Forest Classifiers. In order to simplify the problem, we train our model to accurately classify strongly negative and strongly positive reviews (1 and 5 stars) reviews. In addition, we identify drivers behind strongly positive or negative reviews allowing businesses to understand their strengths and weaknesses. This method provides companies an efficient and cost-effective method to process and understand customer satisfaction as it is discussed on social media.
|
92 |
Power V. Threhsold: Near-Channel Morphology Controls Sediment Rating Curve Shape in Coastal Redwood WatershedsFisher, Adam Caspian Nebraska 01 December 2019 (has links)
River sediment is one of the most pervasive pollutants in the world. Excess amounts of fine sediment can reduce water quality, damage stream ecosystems, and harm aquatic life. Both natural and human-caused processes can add sediment to a river, such as tectonic uplift, landslides, and timber harvesting. Therefore, it is important to understand how fine sediment enters and moves through a rive system to inform policymakers and land-managers on effective ecosystem management.
In this study, we determined how the relationship between river flow and suspended sediment changed among watersheds along the North Coast of California. We found a rise in suspended sediment concentration at median flows following extreme timber harvesting. Additionally, our results indicate that river flow and suspended sediment relationships are influenced by timber harvest activity, tectonic uplift, rainfall patterns, and near-channel environments.
These results support previous findings that extreme land disturbance in a watershed, be it natural or human-caused, can change river flow and suspended sediment relationships. Our results suggest that policymakers and land-managers should take into account tectonic uplift when making regulation and should prioritize protecting near-channel environments.
|
93 |
Machine learning improves automated cortical surface reconstruction in human MRI studiesEllis, David G. 01 May 2017 (has links)
Analysis of surface models reconstructed from human MR images gives re- searchers the ability to quantify the shape and size of the cerebral cortex. Increasing the reliability of automatic reconstructions would increase the precision and, therefore, power of studies utilizing cortical surface models. We looked at four different workflows for reconstructing cortical surfaces:
1) BAW + LOGIMSOS- B;
2) FreeSurfer + LOGISMOS-B;
3) BAW + FreeSurfer + Machine Learning + LOGISMOS-B;
4) Standard FreeSurfer(Dale et al. 1999).
Workflows 1-3 were developed in this project. Workflow 1 utilized both BRAINSAutoWorkup(BAW)(Kim et al. 2015) and a surface reconstruction tool called LOGISMOS-B(Oguz et al. 2014). Workflow 2 added LOGISMOS-B to a custom built FreeSurfer workflow that was highly optimized for parallel processing. Workflow 3 combined workflows 1 and 2 and added random forest classifiers for predicting the edges of the cerebral cortex. These predictions were then fed into LOGISMOS-B as the cost function for graph segmentation. To compare these work- flows, a dataset of 578 simulated cortical volume changes was created from 20 different sets of MR scans. The workflow utilizing machine learning (workflow 3) produced cortical volume changes with the least amount of error when compared to the known volume changes from the simulations. Machine learning can be effectively used to help reconstruct cortical surfaces that more precisely track changes in the cerebral cortex. This research could be used to increase the power of future projects studying correlations between cortical morphometrics and neurological health.
|
94 |
Machine-learning based automated segmentation tool development for large-scale multicenter MRI data analysisKim, Eun Young 01 December 2013 (has links)
Background: Volumetric analysis of brain structures from structural Mag- netic Resonance (MR) images advances the understanding of the brain by providing means to study brain morphometric changes quantitatively along aging, development, and disease status. Due to the recent increased emphasis on large-scale multicenter brain MR study design, the demand for an automated brain MRI processing tool has increased as well. This dissertation describes an automatic segmentation framework for subcortical structures of brain MRI that is robust for a wide variety of MR data.
Method: The proposed segmentation framework, BRAINSCut, is an inte- gration of robust data standardization techniques and machine-learning approaches. First, a robust multi-modal pre-processing tool for automated registration, bias cor- rection, and tissue classification, has been implemented for large-scale heterogeneous multi-site longitudinal MR data analysis. The segmentation framework was then constructed to achieve robustness for large-scale data via the following comparative experiments: 1) Find the best machine-learning algorithm among several available approaches in the field. 2) Find an efficient intensity normalization technique for the proposed region-specific localized normalization with a choice of robust statistics. 3) Find high quality features that best characterize the MR brain subcortical structures. Our tool is built upon 32 handpicked multi-modal muticenter MR images with man- ual traces of six subcortical structures (nucleus accumben, caudate nucleus, globus pallidum, putamen, thalamus, and hippocampus) from three experts.
A fundamental task associated with brain MR image segmentation for re- search and clinical trials is the validation of segmentation accuracy. This dissertation evaluated the proposed segmentation framework in terms of validity and reliability. Three groups of data were employed for the various evaluation aspects: 1) traveling human phantom data for the multicenter reliability, 2) a set of repeated scans for the measurement stability across various disease statuses, and 3) a large-scale data from Huntington's disease (HD) study for software robustness as well as segmentation accuracy.
Result: Segmentation accuracy of six subcortical structures was improved with 1) the bias-corrected inputs, 2) the two region-specific intensity normalization strategies and 3) the random forest machine-learning algorithm with the selected feature-enhanced image. The analysis of traveling human phantom data showed no center-specific bias in volume measurements from BRAINSCut. The repeated mea- sure reliability of the most of structures also displayed no specific association to disease progression except for caudate nucleus from the group of high risk for HD. The constructed segmentation framework was successfully applied on multicenter MR data from PREDICT-HD [133] study ( < 10% failure rate over 3000 scan sessions pro- cessed).
Conclusion: Random-forest based segmentation method is effective and robust to large-scale multicenter data variation, especially with a proper choice of the intensity normalization techniques. Benefits of proper normalization approaches are more apparent compared to the custom set of feature-enhanced images for the ccuracy and robustness of the segmentation tool. BRAINSCut effectively produced subcortical volumetric measurements that are robust to center and disease status with validity confirmed by human experts and low failure rate from large-scale multicenter MR data. Sample size estimation, which is crutial for designing efficient clinical and research trials, is provided based on our experiments for six subcortical structures.
|
95 |
Maintaining Population Persistence in the Face of an Extremely Altered Hydrograph: Implications for Three Sensitive Fishes in a Tributary of the Green River, UtahBottcher, Jared L. 01 May 2009 (has links)
The ability of an organism to disperse to suitable habitats, especially in modified and fragmented systems, determines individual fitness and overall population viability. The bluehead sucker (Catostomus discobolus), flannelmouth sucker (Catostomus latipinnis), and roundtail chub (Gila robusta) are three species native to the upper Colorado River Basin that now occupy only 50% of their historic range. Despite these distributional declines, populations of all three species are present in the San Rafael River, a highly regulated tributary of the Green River, Utah, providing an opportunity for research. Our goal was to determine the timing and extent of movement, habitat preferences, and limiting factors, ultimately to guide effective management and recovery of these three species. In 2007-2008, we sampled fish from 25 systematically selected, 300-m reaches in the lower 64 km of the San Rafael River, spaced to capture the range of species, life-stages, and habitat conditions present. We implanted all target species with a passive integrated transponder (PIT) tag, installed a passive PIT tag antennae, and measured key habitat parameters throughout each reach and at the site of native fish capture. We used random forest modeling to identify and rank the most important abiotic and biotic predictor variables, and reveal potential limiting factors in the San Rafael River. While flannelmouth sucker were relatively evenly distributed within our study area, highest densities of roundtail chub and bluehead sucker occurred in isolated, upstream reaches characterized by complex habitat. In addition, our movement and length-frequency data indicate downstream drift of age-0 roundtail chub, and active upstream movement of adult flannelmouth sucker, both from source populations, providing the lower San Rafael River with colonists. Our random forest analysis highlights the importance of pools, riffles, and distance-to-source populations, suggesting that bluehead sucker and roundtail chub are habitat limited in the lower San Rafael River. These results suggest management efforts should focus on diversifying habitat, maintaining in-stream flow, and removing barriers to movement.
|
96 |
Modeling USA stream temperatures for stream biodiversity and climate change assessmentsHill, Ryan A. 01 May 2013 (has links)
Stream temperature (ST) is a primary determinant of individual stream species distributions and community composition. Moreover, thermal modifications associated with urbanization, agriculture, reservoirs, and climate change can significantly alter stream ecosystem structure and function. Despite its importance, we lack ST measurements for the vast majority of USA streams. To effectively manage these important systems, we need to understand how STs vary geographically, what the natural (reference) thermal condition of altered streams was, and how STs will respond to climate change. Empirical ST models, if calibrated with physically meaningful predictors, could provide this information. My dissertation objectives were to: (1) develop empirical models that predict reference- and nonreference-condition STs for the conterminous USA, (2) assess how well modeled STs represent measured STs for predicting stream biotic communities, and (3) predict potential climate-related alterations to STs. For objective 1, I used random forest modeling with environmental data from several thousand US Geological Survey sites to model geographic variation in nonreference mean summer, mean winter, and mean annual STs. I used these models to identify thresholds of watershed alteration below which there were negligible effects on ST. With these reference-condition sites, I then built ST models to predict summer, winter, and annual STs that should occur in the absence of human-related alteration (r2 = 0.87, 0.89, 0.95, respectively). To meet objective 2, I compared how well modeled and measured ST predicted stream benthic invertebrate composition across 92 streams. I also compared predicted and measured STs for estimating taxon-specific thermal optima. Modeled and measured STs performed equally well in both predicting invertebrate composition and estimating taxon-specific thermal optima (r2 between observation and model-derived optima = 0.97). For objective 3, I first showed that predicted and measured ST responded similarly to historical variation in air temperatures. I then used downscaled climate projections to predict that summer, winter, and annual STs will warm by 1.6 °C - 1.7 °C on average by 2099. Finally, I used additional modeling to identify initial stream and watershed conditions (i.e., low heat loss rates and small base-flow index) most strongly associated with ST vulnerability to climate change.
|
97 |
A Decision Support Model for Personalized Cancer TreatmentRico-Fontalvo, Florentino Antonio 30 October 2014 (has links)
This work is motivated by the need of providing patients with a decision support system that facilitates the selection of the most appropriate treatment strategy in cancer treatment. Treatment options are currently subject to predetermined clinical pathways and medical expertise, but generally, do not consider the individual patient characteristics or preferences. Although genomic patient data are available, this information is rarely used in the clinical setting for real-life patient care. In the area of personalized medicine, the advancement in the fundamental understanding of cancer biology and clinical oncology can promote the prevention, detection, and treatment of cancer diseases.
The objectives of this research are twofold. 1) To develop a patient-centered decision support model that can determine the most appropriate cancer treatment strategy based on subjective medical decision criteria, and patient's characteristics concerning the treatment options available and desired clinical outcomes; and 2) to develop a methodology to organize and analyze gene expression data and validate its accuracy as a predictive model for patient's response to radiation therapy (tumor radiosensitivity).
The complexity and dimensionality of the data generated from gene expression microarrays requires advanced computational approaches. The microarray gene expression data processing and prediction model is built in four steps: response variable transformation to emphasize the lower and upper extremes (related to Radiosensitive and Radioresistant cell lines); dimensionality reduction to select candidate gene expression probesets; model development using a Random Forest algorithm; and validation of the model in two clinical cohorts for colorectal and esophagus cancer patients.
Subjective human decision-making plays a significant role in defining the treatment strategy. Thus, the decision model developed in this research uses language and mechanisms suitable for human interpretation and understanding through fuzzy sets and degree of membership. This treatment selection strategy is modeled using a fuzzy logic framework to account for the subjectivity associated to the medical strategy and the patient's characteristics and preferences. The decision model considers criteria associated to survival rate, adverse events and efficacy (measured by radiosensitivity) for treatment recommendation. Finally, a sensitive analysis evaluates the impact of introducing radiosensitivity in the decision-making process.
The intellectual merit of this research stems from the fact that it advances the science of decision-making by integrating concepts from the fields of artificial intelligence, medicine, biology and biostatistics to develop a decision aid approach that considers conflictive objectives and has a high practical value. The model focuses on criteria relevant to cancer treatment selection but it can be modified and extended to other scenarios beyond the healthcare environment.
|
98 |
Classification of Genotype and Age of Eyes Using RPE Cell Size and ShapeYu, Jie 18 December 2012 (has links)
Retinal pigment epithelium (RPE) is a principal site of pathogenesis in age-related macular de-generation (AMD). AMD is a main source of vision loss even blindness in the elderly and there is no effective treatment right now. Our aim is to describe the relationship between the morphology of RPE cells and the age and genotype of the eyes. We use principal component analysis (PCA) or functional principal component method (FPCA), support vector machine (SVM), and random forest (RF) methods to analyze the morphological data of RPE cells in mouse eyes to classify their age and genotype. Our analyses show that amongst all morphometric measures of RPE cells, cell shape measurements (eccentricity and solidity) are good for classification. But combination of cell shape and size (perimeter) provide best classification.
|
99 |
A Universal Islanding Detection Technique for Distributed Generation Using Pattern RecognitionFaqhruldin, Omar 22 August 2013 (has links)
In the past, distribution systems were characterized by a unidirectional power flow where power flows from the main power generation units to consumers. However, with changes in power system regulation and increasing incentives for integrating renewable energy sources, Distributed Generation (DG) has become an important component of modern distribution systems. However, when a portion of the system is energized by one or more DG and is disconnected from the grid, this portion becomes islanded and might cause several operational and safety issues. Therefore, an accurate and fast islanding detection technique is needed to avoid these issues as per IEEE Standard 1547-2003 [1]. Islanding detection techniques are dependent on the type of the DG connected to the system and can achieve accurate results when only one type of DG is used in the system. Thus, a major challenge is to design a universal islanding technique to detect islanding accurately and in a timely manner for different DG types and multiple DG units in the system.
This thesis introduces an efficient universal islanding detection method that can be applied to both Inverter-based DG and Synchronous-based DG. The proposed method relies on extracting a group of features from measurements of the voltage and frequency at the Point of Common Coupling (PCC) of the targeted island. The Random Forest (RF) classification technique is used to distinguish between islanding and non-islanding situations with the goals of achieving a zero Non-Detection Zone (NDZ), which is a region where islanding detection techniques fail to detect islanding, as well as avoiding nuisance DG tripping during non-islanding conditions. The accuracy of the proposed technique is evaluated using a cross-validation technique. The methodology of the proposed islanding detection technique is shown to have a zero NDZ, 98% accuracy, and fast response when applied to both types of DGs. Finally, four other classifiers are compared with the Random Forest classifier, and the RF technique proved to be the most efficient approach for islanding detection.
|
100 |
Ensemble Learning With Imbalanced DataShoemaker, Larry 20 September 2010 (has links)
We describe an ensemble approach to learning salient spatial regions from arbitrarily
partitioned simulation data. Ensemble approaches for anomaly detection
are also explored. The partitioning comes from the distributed processing requirements
of large-scale simulations. The volume of the data is such that classifiers
can train only on data local to a given partition. Since the data partition reflects
the needs of the simulation, the class statistics can vary from partition to partition.
Some classes will likely be missing from some or even most partitions. We combine
a fast ensemble learning algorithm with scaled probabilistic majority voting in
order to learn an accurate classifier from such data. Since some simulations are
difficult to model without a considerable number of false positive errors, and since
we are essentially building a search engine for simulation data, we order predicted
regions to increase the likelihood that most of the top-ranked predictions are correct
(salient). Results from simulation runs of a canister being torn and from a casing
being dropped show that regions of interest are successfully identified in spite of
the class imbalance in the individual training sets. Lift curve analysis shows that the
use of data driven ordering methods provides a statistically significant improvement
over the use of the default, natural time step ordering. Significant time is saved for
the end user by allowing an improved focus on areas of interest without the need to
conventionally search all of the data. We have also found that using random forests
weighted and distance-based outlier ensemble methods for supervised learning of
anomaly detection provide significant accuracy improvements when compared to
existing methods on the same dataset. Further, distance-based outlier and local
outlier factor ensemble methods for unsupervised learning of anomaly detection
also compare favorably to existing methods.
|
Page generated in 0.4334 seconds