Spelling suggestions: "subject:"random forest (RF)"" "subject:"fandom forest (RF)""
1 |
Application of Machine Learning and AI for Prediction in Ungauged BasinsPin-Ching Li (16734693) 03 August 2023 (has links)
<p>Streamflow prediction in ungauged basins (PUB) is a process generating streamflow time series at ungauged reaches in a river network. PUB is essential for facilitating various engineering tasks such as managing stormwater, water resources, and water-related environmental impacts. Machine Learning (ML) has emerged as a powerful tool for PUB using its generalization process to capture the streamflow generation processes from hydrological datasets (observations). ML’s generalization process is impacted by two major components: data splitting process of observations and the architecture design. To unveil the potential limitations of ML’s generalization process, this dissertation explores its robustness and associated uncertainty. More precisely, this dissertation has three objectives: (1) analyzing the potential uncertainty caused by the data splitting process for ML modeling, (2) investigating the improvement of ML models’ performance by incorporating hydrological processes within their architectures, and (3) identifying the potential biases in ML’s generalization process regarding the trend and periodicity of streamflow simulations.</p><p>The first objective of this dissertation is to assess the sensitivity and uncertainty caused by the regular data splitting process for ML modeling. The regular data splitting process in ML was initially designed for homogeneous and stationary datasets, but it may not be suitable for hydrological datasets in the context of PUB studies. Hydrological datasets usually consist of data collected from diverse watersheds with distinct streamflow generation regimes influenced by varying meteorological forcing and watershed characteristics. To address the potential inconsistency in the data splitting process, multiple data splitting scenarios are generated using the Monte Carlo method. The scenario with random data splitting results accounts for frequent covariate shift and tends to add uncertainty and biases to ML’s generalization process. The findings in this objective suggest the importance of avoiding the covariate shift during the data splitting process when developing ML models for PUB to enhance the robustness and reliability of ML’s performance.</p><p>The second objective of this dissertation is to investigate the improvement of ML models’ performance brought by Physics-Guided Architecture (PGA), which incorporates ML with the rainfall abstraction process. PGA is a theory-guided machine learning framework integrating conceptual tutors (CTs) with ML models. In this study, CTs correspond to rainfall abstractions estimated by Green-Ampt (GA) and SCS-CN models. Integrating the GA model’s CTs, which involves information on dynamic soil properties, into PGA models leads to better performance than a regular ML model. On the contrary, PGA models integrating the SCS-CN model's CTs yield no significant improvement of ML model’s performance. The results of this objective demonstrate that the ML’s generalization process can be improved by incorporating CTs involving dynamic soil properties.</p><p>The third objective of this dissertation is to explore the limitations of ML’s generalization process in capturing trend and periodicity for streamflow simulations. Trend and periodicity are essential components of streamflow time series, representing the long-term correlations and periodic patterns, respectively. When the ML models generate streamflow simulations, they tend to have relatively strong long-term periodic components, such as yearly and multiyear periodic patterns. In addition, compared to the observed streamflow data, the ML models display relatively weak short-term periodic components, such as daily and weekly periodic patterns. As a result, the ML’s generalization process may struggle to capture the short-term periodic patterns in the streamflow simulations. The biases in ML’s generalization process emphasize the demands for external knowledge to improve the representation of the short-term periodic components in simulating streamflow.</p>
|
2 |
Three Essays in EconomicsDaniel G Kebede (16652025) 03 August 2023 (has links)
<p> The overall theme of my dissertation is applying frontier econometric models to interesting economic problems. The first chapter analyzes how individual consumption responds to permanent and transitory income shocks is limited by model misspecification and availability of data. The misspecification arises from ignoring unemployment risk while estimating income shocks. I employ the Heckman two step regression model to consistently estimate income shocks. Moreover, to deal with data sparsity, I propose identifying the partial consumption insurance and income and consumption volatility heterogeneities at the household level using Least Absolute Shrinkage and Selection Operator (LASSO). Using PSID data, I estimate partial consumption insurance against permanent shock of 63% and 49% for white and black household heads, respectively; the white and black household heads self-insure against 100% and 90% of the transitory income shocks, respectively. Moreover, I find income and consumption volatilities and partial consumption insurance parameters vary across time. In the second chapter I recast smooth structural break test proposed by Chen and Hong (2012), in a predictive regression setting. The regressors are characterized using the local to non-stationarity framework. I conduct a Monte Carlo experiment to evaluate the finite sample performance of the test statistic and examine an empirical example to demonstrate its practical application. The Monte Carlo simulations show that the test statistic has better power and size compared to the popular SupF and LM. Empirically, compared to SupF and LM, the test statistic rejects the null hypothesis of no structural break more frequently when there actually is a structural break present in the data. The third chapter is a collaboration with James Reeder III. We study the effects of using promotions to drive public policy diffusion in regions with polarized political beliefs. We estimate a model that allows for heterogeneous effects at the county-level based upon state-level promotional offerings to drive vaccine adoption during COVID-19. Central to our empirical application is accounting for the endogenous action of state-level agents in generating promotional schemes. To address this challenge, we synthesize various sources of data at the county-level and leverage advances in both the Bass Diffusion model and 10 machine learning. Studying the vaccine rates at the county-level within the United States, we find evidence that the use of promotions actually reduced the overall rates of adoption in obtaining vaccination, a stark difference from other studies examining more localized vaccine rates. The negative average effect is driven primarily by the large number of counties that are described as republican leaning based upon their voting record in the 2020 election. Even directly accounting for the population’s vaccine hesitancy, this result still stands. Thus, our analysis suggests that in the polarized setting of the United States electorate, more localized policies on contentious topics may yield better outcomes than broad, state-level dictates. </p>
|
3 |
Malicious Activity Detection in Encrypted Network Traffic using A Fully Homomorphic Encryption MethodAdiyodi Madhavan, Resmi, Sajan, Ann Zenna January 2022 (has links)
Everyone is in need for their own privacy and data protection, since encryption transmission was becoming common. Fully Homomorphic Encryption (FHE) has received increased attention because of its capability to execute calculations over the encoded domain. Through using FHE approach, model training can be properly outsourced. The goal of FHE is to enable computations on encrypted files without decoding aside from the end outcome. The CKKS scheme is used in FHE.Network threats are serious danger to credential information, which enable an unauthorised user to extract important and sensitive data by evaluating the information of computations done on raw data. Thus the study provided an efficient solution to the problem of privacy protection in data-driven applications using Machine Learning. The study used an encrypted NSL KDD dataset. Machine learning-based techniques have emerged as a significant trend for detecting malicious attack. Thus, Random Forest (RF) is proposed for the detection of malicious attacks on Homomorphic encrypted data in the cloud server. Logistic Regression (LR) machine learning model is used to predict encrypted data on cloud server. Regardless of the distributed setting, the technique may retain the accuracy and integrity of the previous methods to obtain the final results.
|
4 |
Modelización integrada con aprendizaje automático para evaluar la contaminación por nutrientes en las masas de agua actual y bajo el efecto del cambio climático. Aplicación a la Demarcación Hidrográfica del JúcarDorado Guerra, Diana Yaritza 26 February 2024 (has links)
Tesis por compendio / [ES] La contaminación del agua representa un desafío ambiental crítico a nivel global y en la Unión Europea (UE), particularmente en la región mediterránea de España. El crecimiento poblacional, la demanda creciente de alimentos y combustibles, junto con el cambio climático, intensifican la contaminación por nutrientes en los cuerpos de agua. Esta contaminación amenaza la calidad del agua y los ecosistemas acuáticos, así como la salud humana. La complejidad de las vías de transporte de nutrientes hace que su monitoreo y mitigación sean complicados. Se requieren modelos integrales que vinculen procesos y relaciones de causa y efecto para controlar eficazmente la contaminación.
En la región mediterránea, como la Demarcación Hidrográfica del Júcar (DHJ), la interacción entre agua superficial y subterránea es clave, pero los modelos tradicionales presentan limitaciones. Esta tesis aborda estos desafíos al caracterizar la contribución de nutrientes a las masas de agua superficiales de la DHJ, evaluar medidas de reducción de la contaminación, considerando el cambio climático a largo plazo y aplicar técnicas de aprendizaje supervisado para predecir la concentración de nitratos. El acoplamiento de modelos hidrológicos y de calidad del agua, junto con el aprendizaje automático, ofrece una comprensión profunda y valiosa de los factores detrás de la contaminación por nutrientes y proporciona una base sólida para la toma de decisiones y la gestión sostenible del agua en la DHJ y regiones similares. Esta tesis fue estructurada como un compendio de tres artículos que abarcan estos desafíos.
El primer artículo profundiza en la compleja interacción entre las aguas superficiales y las subterráneas en las cuencas de la DHJ, centrándose en la dinámica de la contaminación por nitratos. Los resultados muestran una correlación directa entre las concentraciones de nitratos en ríos y acuíferos a lo largo del eje principal de los ríos Júcar y Turia, lo cual destaca el papel fundamental de las aportaciones de agua subterránea en la contribución a los niveles de nitratos de los ríos. Además, el estudio identifica regiones aguas abajo con actividades agrícolas y urbanas intensificadas como focos de contaminación por nitratos.
El segundo artículo aborda la vulnerabilidad de la calidad de las aguas superficiales al cambio climático y escenarios de reducción de la contaminación difusa y puntual en las cuencas de la DHJ a largo plazo. Los resultados indican que, en los escenarios de cambio climático, se espera que aumenten significativamente las masas de agua con un mal estado de amonio, fósforo y DBO5, y en menor proporción las masas en mal estado de nitratos. En concreto, las concentraciones medias de amonio y fósforo podrían duplicarse durante los meses de bajo caudal. Para mantener la calidad actual del agua, se requieren reducciones sustanciales de al menos el 25% de la contaminación difusa por nitratos y del 50% de las cargas puntuales de amonio, fósforo y DBO5.
El tercer artículo presenta un enfoque innovador para simular la concentración de nitratos en masas de agua superficiales mediante modelos de aprendizaje automático. Aprovechando los métodos de selección de características y los algoritmos random forest (RF) y eXtreme Gradient Boosting (XGBoost), el estudio logró una gran precisión en la predicción de la concentración de nitratos. Estos modelos analizaron 19 variables de entrada, que abarcan factores ecológicos, hidrológicos y ambientales, junto con datos de concentración de nitratos procedentes de estaciones de aforo de la calidad de las aguas superficiales. En particular, la investigación destaco que la localización desempeña un papel dominante, explicando el 87% de la variabilidad de los nitratos en relación con la concentración de nitrógeno y fósforo. Esta investigación destaco el potencial del aprendizaje automático en la predicción de la calidad del agua y la evaluación de riesgos. / [CA] La contaminació de l'aigua representa un desafiament ambiental crític a nivell global i a la Unió Europea (UE), particularment a la regió mediterrània d'Espanya. El creixement poblacional, la demanda creixent d'aliments i combustibles, juntament amb el canvi climàtic, intensifiquen la contaminació per nutrients en els cossos d'aigua. Aquesta contaminació amenaça la qualitat de l'aigua i els ecosistemes aquàtics, així com la salut humana. La complexitat de les vies de transport de nutrients fa que el seu monitoratge i mitigació siguin complicats. Es requereixen models integrals que vinculin processos i relacions de causa i efecte per a controlar eficaçment la contaminació.
A la regió mediterrània, com la Demarcació Hidrogràfica del Xúquer (DHJ), la interacció entre aigua superficial i subterrània és clau, però els models tradicionals presenten limitacions. Aquesta tesi aborda aquests desafiaments en caracteritzar la contribució de nutrients a les masses d'aigua superficials de la DHJ, avaluar mesures de reducció de la contaminació, considerant el canvi climàtic a llarg termini i aplicar tècniques d'aprenentatge supervisat per a predir la concentració de nitrats. L'acoblament de models hidrològics i de qualitat de l'aigua, juntament amb l'aprenentatge automàtic, ofereix una comprensió profunda i valuosa dels factors darrere de la contaminació per nutrients i proporciona una base sòlida per a la presa de decisions i la gestió sostenible de l'aigua en la DHJ i regions similars. Aquesta tesi va ser estructurada com un compendi de tres articles que abasten aquests desafiaments.
El primer article aprofundeix en la complexa interacció entre les aigües superficials i les subterrànies en les conques de la DHJ, centrant-se en la dinàmica de la contaminació per nitrats. Els resultats mostren una correlació directa entre les concentracions de nitrats en rius i aqüífers al llarg de l'eix principal dels rius Xúquer i Túria, la qual cosa destaca el paper fonamental de les aportacions d'aigua subterrània en la contribució als nivells de nitrats dels rius. A més, l'estudi identifica regions aigües avall amb activitats agrícoles i urbanes intensificades com a focus de contaminació per nitrats.
El segon article aborda la vulnerabilitat de la qualitat de les aigües superficials al canvi climàtic i escenaris de reducció de la contaminació difusa i puntual en les conques de la DHJ a llarg termini. Els resultats indiquen que, en els escenaris de canvi climàtic, s'espera que augmentin significativament les masses d'aigua amb un mal estat d'amoni, fòsfor i DBO5, i en menor proporció les masses en mal estat de nitrats. En concret, les concentracions mitjanes d'amoni i fòsfor podrien duplicar-se durant els mesos de baix cabal. Per a mantenir la qualitat actual de l'aigua, es requereixen reduccions substancials d'almenys el 25% de la contaminació difusa per nitrats i del 50% de les càrregues puntuals d'amoni, fòsfor i DBO5.
El tercer article presenta un enfocament innovador per a simular la concentració de nitrats en masses d'aigua superficials mitjançant models d'aprenentatge automàtic. Aprofitant els mètodes de selecció de característiques i els algorismes random forest (RF) i extremi Gradient Boosting (XGBoost), l'estudi va aconseguir una gran precisió en la predicció de la concentració de nitrats. Aquests models van analitzar 19 variables d'entrada, que abasten factors ecològics, hidrològics i ambientals, juntament amb dades de concentració de nitrats procedents d'estacions d'aforament de la qualitat de les aigües superficials. En particular, la recerca destaco que la localització exerceix un paper dominant, explicant el 87% de la variabilitat dels nitrats en relació amb la concentració de nitrogen i fòsfor. Aquesta recerca destaco el potencial de l'aprenentatge automàtic en la predicció de la qualitat de l'aigua i l'avaluació de riscos. / [EN] Water pollution poses a critical environmental challenge globally and in the European Union (EU), particularly in the Mediterranean region of Spain. Population growth, increasing demand for food and fuels, coupled with climate change, intensify nutrient pollution in water bodies. This pollution threatens water quality, aquatic ecosystems, and human health. The complexity of nutrient transport pathways makes monitoring and mitigation challenging. Comprehensive models that link processes and cause-and-effect relationships are required to effectively control pollution. In the Mediterranean region, such as the Júcar River Basin District (RBD), the interaction between surface and groundwater is crucial, but traditional models have limitations. This thesis addresses these challenges by characterising the contribution of nutrients to surface waters in the Júcar RBD, evaluating pollution reduction measures considering long-term climate change, and applying supervised learning techniques to predict nitrate concentrations. The coupling of hydrological and water quality models, along with machine learning, provides a deep and valuable understanding of the factors behind nutrient pollution and establishes a solid foundation for decision-making and sustainable water management in the Júcar RBD and similar regions. This thesis is structured as a compendium of three articles that encompass these challenges. The first article delves into the complex interaction between surface and groundwater in the Júcar RBD basins, focusing on nitrate pollution dynamics.The results reveal a direct linear correlation between nitrate concentrations in rivers and aquifers along the main axes of the Júcar and Turia rivers, highlighting the fundamental role of groundwater contributions to river nitrate levels. Additionally, the study identifies downstream regions with intensified agricultural and urban activities as nitrate pollution hotspots. This research not only identifies pollution sources but also offers a means to predict nitrate concentrations and assess the effectiveness of pollution prevention measures.
The second article addresses the vulnerability of surface water quality to climate change and long-term diffuse and point source pollution reduction scenarios in the Júcar RBD basins. In a region where nutrient concentrations are of particular concern, the study investigates how changing climatic conditions, including rising temperatures and altered precipitation patterns, affect nitrate, ammonium, phosphorus, and biochemical oxygen demand (BOD5) levels. The results indicate that under climate change scenarios, significantly more water bodies are expected to be in poor condition for ammonium, phosphorus, and BOD5, and to a lesser extent, nitrate. Specifically, average concentrations of ammonium and phosphorus could double during low-flow months. To maintain current water quality, substantial reductions of at least 25% in diffuse nitrate pollution and 50% in point source loads of ammonium, phosphorus, and BOD5 are required. This research underscores the importance of water quality management strategies.
The third article introduces an innovative approach to simulate nitrate concentrations in surface water bodies using machine learning models. Leveraging feature selection methods and artificial intelligence algorithms, including random forest (RF) and eXtreme Gradient Boosting (XGBoost), the study achieved high precision in predicting nitrate concentrations. These models analysed 19 input variables spanning ecological, hydrological, and environmental factors, along with nitrate concentration data from surface water quality gauging stations. In particular, the research highlighted the dominant role of location, explaining 87% of nitrate variability in relation to nitrogen and phosphorus concentration. This research showcased the potential of machine learning in water quality prediction and risk assessment. / We appreciate the help provided by the Júcar River Basin District Authority (CHJ), who gathered
field data. The first author’s research was partially funded by a PhD scholarship from the food
research stream of the programme “Colombia Científica—Pasaporte a la Ciencia”, granted by
the Colombian Institute for Educational Technical Studies Abroad (Instituto Colombiano de
Crédito Educativo y Estudios Técnicos en el Exterior, ICETEX). The authors thank the Spanish
Research Agency (AEI) for the financial support to RESPHIRA project (PID2019-106322RB-
100)/AEI/10.13039/501100011033. The contributors gratefully acknowledge funding for open
access charge: CRUE-Universitat Politècnica de València / Dorado Guerra, DY. (2024). Modelización integrada con aprendizaje automático para evaluar la contaminación por nutrientes en las masas de agua actual y bajo el efecto del cambio climático. Aplicación a la Demarcación Hidrográfica del Júcar [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/202898 / Compendio
|
5 |
Applied Machine Learning Predicts the Postmortem Interval from the Metabolomic FingerprintArpe, Jenny January 2024 (has links)
In forensic autopsies, accurately estimating the postmortem interval (PMI) is crucial. Traditional methods, relying on physical parameters and police data, often lack precision, particularly after approximately two days have passed since the person's death. New methods are increasingly focusing on analyzing postmortem metabolomics in biological systems, acting as a 'fingerprint' of ongoing processes influenced by internal and external molecules. By carefully analyzing these metabolomic profiles, which span a diverse range of information from events preceding death to postmortem changes, there is potential to provide more accurate estimates of the PMI. The limitation of available real human data has hindered comprehensive investigation until recently. Large-scale metabolomic data collected by the National Board of Forensic Medicine (RMV, Rättsmedicinalverket) presents a unique opportunity for predictive analysis in forensic science, enabling innovative approaches for improving PMI estimation. However, the metabolomic data appears to be large, complex, and potentially nonlinear, making it difficult to interpret. This underscores the importance of effectively employing machine learning algorithms to manage metabolomic data for the purpose of PMI predictions, the primary focus of this project. In this study, a dataset consisting of 4,866 human samples and 2,304 metabolites from the RMV was utilized to train a model capable of predicting the PMI. Random Forest (RF) and Artificial Neural Network (ANN) models were then employed for PMI prediction. Furthermore, feature selection and incorporating sex and age into the model were explored to improve the neural network's performance. This master's thesis shows that ANN consistently outperforms RF in PMI estimation, achieving an R2 of 0.68 and an MAE of 1.51 days compared to RF's R2 of 0.43 and MAE of 2.0 days across the entire PMI-interval. Additionally, feature selection indicates that only 35% of total metabolites are necessary for comparable results with maintained predictive accuracy. Furthermore, Principal Component Analysis (PCA) reveals that these informative metabolites are primarily located within a specific cluster on the first and second principal components (PC), suggesting a need for further research into the biological context of these metabolites. In conclusion, the dataset has proven valuable for predicting PMI. This indicates significant potential for employing machine learning models in PMI estimation, thereby assisting forensic pathologists in determining the time of death. Notably, the model shows promise in surpassing current methods and filling crucial gaps in the field, representing an important step towards achieving accurate PMI estimations in forensic practice. This project suggests that machine learning will play a central role in assisting with determining time since death in the future.
|
6 |
Data mining and predictive analytics application on cellular networks to monitor and optimize quality of service and customer experienceMuwawa, Jean Nestor Dahj 11 1900 (has links)
This research study focuses on the application models of Data Mining and Machine Learning covering cellular network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms have been applied on real cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: RStudio for Machine Learning and process visualization, Apache Spark, SparkSQL for data and big data processing and clicData for service Visualization. Two use cases have been studied during this research. In the first study, the process of Data and predictive Analytics are fully applied in the field of Telecommunications to efficiently address users’ experience, in the goal of increasing customer loyalty and decreasing churn or customer attrition. Using real cellular network transactions, prediction analytics are used to predict customers who are likely to churn, which can result in revenue loss. Prediction algorithms and models including Classification Tree, Random Forest, Neural Networks and Gradient boosting have been used with an
exploratory Data Analysis, determining relationship between predicting variables. The data is segmented in to two, a training set to train the model and a testing set to test the model. The evaluation of the best performing model is based on the prediction accuracy, sensitivity, specificity and the Confusion Matrix on the test set. The second use case analyses Service Quality Management using modern data mining techniques and the advantages of in-memory big data processing with Apache Spark and SparkSQL to save cost on tool investment; thus, a low-cost Service Quality Management model is proposed and analyzed. With increase in Smart phone adoption, access to mobile internet services, applications such as streaming, interactive chats require a certain service level to ensure customer satisfaction. As a result, an SQM framework is developed with Service Quality Index (SQI) and Key Performance Index (KPI). The research concludes with recommendations and future studies around modern technology applications in Telecommunications including Internet of Things (IoT), Cloud and recommender systems. / Cellular networks have evolved and are still evolving, from traditional GSM (Global System for Mobile Communication) Circuit switched which only supported voice services and extremely low data rate, to LTE all Packet networks accommodating high speed data used for various service applications such as video streaming, video conferencing, heavy torrent download; and for say in a near future the roll-out of the Fifth generation (5G) cellular networks, intended to support complex technologies such as IoT (Internet of Things), High Definition video streaming and projected to cater massive amount of data. With high demand on network services and easy access to mobile phones, billions of transactions are performed by subscribers. The transactions appear in the form of SMSs, Handovers, voice calls, web browsing activities, video and audio streaming, heavy downloads and uploads. Nevertheless, the stormy growth in data traffic and the high requirements of new services introduce bigger challenges to Mobile Network Operators (NMOs) in analysing the big data traffic flowing in the network. Therefore, Quality of Service (QoS) and Quality of Experience (QoE) turn in to a challenge. Inefficiency in mining, analysing data and applying predictive intelligence on network traffic can produce high rate of unhappy customers or subscribers, loss on revenue and negative services’ perspective. Researchers and Service Providers are investing in Data mining,
Machine Learning and AI (Artificial Intelligence) methods to manage services and experience. This research study focuses on the application models of Data Mining and Machine Learning covering network traffic, in the objective to arm Mobile Network Operators with full view of performance branches (Services, Device, Subscribers). The purpose is to optimize and minimize the time to detect service and subscriber patterns behaviour. Different data mining techniques and predictive algorithms will be applied on cellular network datasets to uncover different data usage patterns using specific Key Performance Indicators (KPIs) and Key Quality Indicators (KQI). The following tools will be used to develop the concept: R-Studio for Machine Learning, Apache Spark, SparkSQL for data processing and clicData for Visualization. / Electrical and Mining Engineering / M. Tech (Electrical Engineering)
|
Page generated in 0.0653 seconds