<p dir="ltr">The volume of big data has surged, notably in unstructured textual data, comprising emails, social media, and more. Currently, unstructured data represents over 80% of global data, the growth is propelled by digitalization. Unstructured text data analysis is crucial for various applications like social media sentiment analysis, customer feedback interpretation, and medical records classification. The complexity is due to the variability in language use, context sensitivity, and the nuanced meanings that are expressed in natural language. Traditional machine learning approaches, while effective in handling structured data, frequently fall short when applied to unstructured text data due to the complexities. Extracting value from this data requires advanced analytics and machine learning. Recognizing the challenges, we developed innovative ensemble approaches that combine the strengths of multiple conventional machine learning classifiers through a probabilistic approach. Response to the challenges , we developed two novel models: the Consensus-Based Integration Model (CBIM) and the Unified Predictive Averaging Model (UPAM).The CBIM and UPAM ensemble models were applied to Twitter (40,000 data samples) and the National Electronic Injury Surveillance System (NEISS) datasets (323,344 data samples) addressing various challenges in unstructured text analysis. The NEISS dataset achieved an unprecedented accuracy of 99.50%, demonstrating the effectiveness of ensemble models in extracting relevant features and making accurate predictions. The Twitter dataset, utilized for sentiment analysis, demonstrated a significant boost in accuracy over conventional approaches, achieving a maximum of 65.83%. The results highlighted the limitations of conventional machine learning approaches when dealing with complex, unstructured text data and the potential of ensemble models. The models exhibited high accuracy across various datasets and tasks, showcasing their versatility and effectiveness in obtaining valuable insights from unstructured text data. The results obtained extend the boundaries of text analysis and improve the field of natural language processing.</p>
Identifer | oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/25669425 |
Date | 26 April 2024 |
Creators | Srushti Sandeep Vichare (17277901) |
Source Sets | Purdue University |
Detected Language | English |
Type | Text, Thesis |
Rights | CC BY 4.0 |
Relation | https://figshare.com/articles/thesis/_b_PROBABILISTIC_ENSEMBLE_MACHINE_LEARNING_APPROACHES_FOR_UNSTRUCTURED_TEXTUAL_DATA_CLASSIFICATION_b_/25669425 |
Page generated in 0.002 seconds