Global ETD Search

151	Social Data Mining for Crime Intelligence: Contributions to Social Data Quality Assessment and Prediction Methods Isah, Haruna January 2017 (has links) With the advancement of the Internet and related technologies, many traditional crimes have made the leap to digital environments. The successes of data mining in a wide variety of disciplines have given birth to crime analysis. Traditional crime analysis is mainly focused on understanding crime patterns, however, it is unsuitable for identifying and monitoring emerging crimes. The true nature of crime remains buried in unstructured content that represents the hidden story behind the data. User feedback leaves valuable traces that can be utilised to measure the quality of various aspects of products or services and can also be used to detect, infer, or predict crimes. Like any application of data mining, the data must be of a high quality standard in order to avoid erroneous conclusions. This thesis presents a methodology and practical experiments towards discovering whether (i) user feedback can be harnessed and processed for crime intelligence, (ii) criminal associations, structures, and roles can be inferred among entities involved in a crime, and (iii) methods and standards can be developed for measuring, predicting, and comparing the quality level of social data instances and samples. It contributes to the theory, design and development of a novel framework for crime intelligence and algorithm for the estimation of social data quality by innovatively adapting the methods of monitoring water contaminants. Several experiments were conducted and the results obtained revealed the significance of this study in mining social data for crime intelligence and in developing social data quality filters and decision support systems. / Commonwealth Scholarship Commission. Social networks analysis Data mining Social network data quality Digital crime intelligence
152	Evaluation of Machine Learning techniques for Master Data Management Toçi, Fatime January 2023 (has links) In organisations, duplicate customer master data present a recurring problem. Duplicate records can result in errors, complication, and inefficiency since they frequently result from dissimilar systems or inadequate data integration. Since this problem is made more complicated by changing client information over time, prompt detection and correction are essential. In addition to improving data quality, eliminating duplicate information also improves business processes, boosts customer confidence, and makes it easier to make wise decisions. This master’s thesis explores machine learning’s application to the field of Master Data Management. The main objective of the project is to assess how machine learning may improve the accuracy and consistency of master data records. The project aims to support the improvement of data quality within enterprises by managing issues like duplicate customer data. One of the research topics of study is if machine learning can be used to improve the accuracy of customer data, and another is whether it can be used to investigate scientific models for customer analysis when cleaning data using machine learning. Dimension identification, appropriate algorithm selection, appropriate parameter value selection, and output analysis are the four steps in the study's process. As a ground truth for our project, we came to conclusion that 22,000 is the correct number of clusters for our clustering algorithms which represents the number of unique customers. Saying this, the best performing algorithm based on number of clusters and the silhouette score metric turned out the be KMEANS with 22,000 clusters and a silhouette score of 0.596, followed by BIRCH with 22,000 number of clusters and a silhouette score of 0.591. Master Data Management Machine Learning data quality data duplicates Information Systems
153	Data Quality Assessment Methodology for Improved Prognostics Modeling Chen, Yan 19 April 2012 (has links) No description available. Industrial Engineering Prognostics and Health Management Data Quality Laplacian Eigenmap diagnostic modeling Feature Ranking Outlier Detection
154	Detecting Satisficing in Online Surveys Salifu, Shani 18 April 2012 (has links) No description available. Behavioral Psychology Behavioral Sciences Educational Technology online surveys satisficing optimizing hyperlinked questions repeated questions data quality
155	ANALYSIS OF BEDROCK EROSIONAL FEATURES IN ONTARIO AND OHIO: IMPROVING UNDERSTANDING OF SUBGLACIAL EROSIONAL PROCESSES Puckering, Stacey L. 10 1900 (has links) <p>Extensive assemblages of glacial erosional features are commonly observed on bedrock outcrops in deglaciated landscapes. There is considerable debate regarding the origins of many subglacial erosional landforms, due to a relative paucity of detailed data concerning these features and a need for improved understanding of the subglacial processes that may form them. This study presents detailed documentation and maps of assemblages of glacial erosional features from select field sites throughout the Great Lakes basins. The characteristics and spatial distribution of p-forms exposed on variable substrates at the Whitefish Falls, Vineland, Pelee Island and Kelleys Island field sites were investigated in order to determine the mode of p-form origin to identify significant spatial and temporal variability in subglacial processes operating at these locations. Observations from this work suggest that p-forms evolve through multiple phases of erosion, whereby glacial ice initially abrades the bedrock surface, leaving behind streamlined bedrock highs, striations and glacial grooves. Subsequent erosion by vortices in turbulent subglacial meltwater sculpts the flanks of bedrock highs and grooves into p-forms. These forms are subjected to a second phase of subglacial abrasion that ornaments the sinuous, sharp rimmed features with linear striae. The presence of multi-directional (‘chaotic’) striae at some sites suggests erosion by saturated till may contribute to, but is not essential for p-form development. Investigation in the Halton Hills region of Ontario focused on modeling bedrock topography in order to delineate the extent and geometry of buried bedrock valleys thought to host potential municipal significant aquifer units. Various approaches to subsurface modeling were investigated in the Halton Hills region using a combination of primary data (collected from boreholes and outcrop), intermediate data collected through aerial photography and consultant reports, and extensively screened low quality data from the Ontario Waterwell Database. A new, ‘quality weighted’ approach to modeling variable quality data was explored but proved ineffective for the purposes of this study, as differential weighting of high and low quality data either over-smoothed the model or significantly altered data values. A series of models were interpolated and compared using calculated RMSE values derived from model cross-validation. The preferred bedrock topography model of the Halton Hills region had the lowest RMSE score, and allowed identification of three major buried bedrock valleys systems (the Georgetown, Acton and 16 Mile Creek buried valleys) which contain up to 40 – 50 m of Quaternary infill. These valleys were likely carved through a combination of fluvial and glacial erosion during the late Quaternary period, and their orientation may be influenced by pre-existing structural weaknesses in the bedrock. Future work on subglacial erosional landforms should focus on the temporal scale in which subglacial processes, through association with other subglacial landforms and dating methods.</p> / Master of Science (MSc) bedrock erosional features Quaternary geology geomorphology subsurface modeling data quality Geology Geomorphology Glaciology Geology
156	Data quality and governance in a UK social housing initiative: Implications for smart sustainable cities Duvier, Caroline, Anand, Prathivadi B., Oltean-Dumbrava, Crina 03 March 2018 (has links) No / Smart Sustainable Cities (SSC) consist of multiple stakeholders, who must cooperate in order for SSCs to be successful. Housing is an important challenge and in many cities, therefore, a key stakeholder are social housing organisations. This paper introduces a qualitative case study of a social housing provider in the UK who implemented a business intelligence project (a method to assess data networks within an organisation) to increase data quality and data interoperability. Our analysis suggests that creating pathways for different information systems within an organisation to ‘talk to’ each other is the first step. Some of the issues during the project implementation include the lack of training and development, organisational reluctance to change, and the lack of a project plan. The challenges faced by the organisation during this project can be helpful for those implementing SSCs. Currently, many SSC frameworks and models exist, yet most seem to neglect localised challenges faced by the different stak Data quality Data interoperability Social housing Smart sustainable cities Business intelligence
157	A Comprehensive Approach to Evaluating Usability and Hyperparameter Selection for Synthetic Data Generation Adriana Louise Watson (19180771) 20 July 2024 (has links) <p dir="ltr">Data is the key component of every machine-learning algorithm. Without sufficient quantities of quality data, the vast majority of machine learning algorithms fail to perform. Acquiring the data necessary to feed algorithms, however, is a universal challenge. Recently, synthetic data production methods have become increasingly relevant as a method of ad-dressing a variety of data issues. Synthetic data allows researchers to produce supplemental data from an existing dataset. Furthermore, synthetic data anonymizes data without losing functionality. To advance the field of synthetic data production, however, measuring the quality of produced synthetic data is an essential step. Although there are existing methods for evaluating synthetic data quality, the methods tend to address finite aspects of the data quality. Furthermore, synthetic data evaluation from one study to another varies immensely adding further challenge to the quality comparison process. Finally, al-though tools exist to automatically tune hyperparameters, the tools fixate on traditional machine learning applications. Thus, identifying ideal hyperparameters for individual syn-thetic data generation use cases is also an ongoing challenge.</p> Data quality Adversarial machine learning Machine Learning Synthetic Data Time Series Data Adversarial Networks
158	Aggregated production and its effect on planned power flows Olsson, Viveka January 2024 (has links) The increase in renewable and intermittent energy resources subsequently means further demands for more stable and flexible power grids. An IGM (Individual Grid Model) uses a power grid’s characteristics and scenarios to calculate predicted power flows. The accuracy of these flows are heavily dependent on the quality of the input data. One of these inputs is production plans, that portrays where and in what magnitude a power generation module will provide power to the network. Today, not every production plan sent in to the TSO (Transmission System Operator) can be used in the calculation of the IGM, which is thought to provide less accurate power flows. The aim of the thesis is to analyse the structural information for production plans in relation to different data dimensions and to evaluate the impact on including an added amount of plans in the IGM. Specific cases relating to the dimension of granularity in the grid model managed by the TSO were further evaluated and used for the power flow analysis. By comparing an original IGM with one with the added granularity, an updated power flow analysis can be calculated and studied. Only a day-ahead and a two day-ahead IGM for one specific time and day were used in the analysis. The results from the power flow analysis show that the amount of improvement is influenced by the magnitude of the input data as well as where the added power is topologically placed. An improvement in predicted power flows for the transmission lines could be viewed for lines placed topologically closer to where most of the power generation was added. As the flow from the northern parts before the added power generation was already predicted to be a lot less than in reality, the added generation in the south led to less power flowing from the north, and evidently a decrease in accuracy. Some examples of stations portrayed an improvement of accuracy with the added production when studying the power lines connecting the stations to the remaining grid. Power systems data quality IGM Elektroteknik och elektronik
159	Leveraging Linguistic Insights for Uncertainty Calibration of ChatGPT and Evaluating Crowdsourced Annotations Venkata Divya Sree Pulipati (18469230) 09 July 2024 (has links) <p dir="ltr">The quality of crowdsource annotations has always been a challenge due to the variability in annotators backgrounds, task complexity, the subjective nature of many labeling tasks, and various other reasons. Hence, it is crucial to evaluate these annotations to ensure their reliability. Traditionally, human experts evaluate the quality of crowdsourced annotations, but this approach has its own challenges. Hence, this paper proposes to leverage large language models like ChatGPT-4 to evaluate one of the existing crowdsourced MAVEN dataset and explore its potential as an alternative solution. However, due to stochastic nature of LLMs, it is important to discern when to trust and question LLM responses. To address this, we introduce a novel approach that applies Rubin's framework for identifying and using linguistic cues within LLM responses as indicators of LLMs certainty levels. Our findings reveal that ChatGPT-4 successfully identified 63% of the incorrect labels, highlighting the potential for improving data label quality through human-AI collaboration on these identified inaccuracies. This study underscores the promising role of LLMs in evaluating crowdsourced data annotations offering a way to enhance accuracy and fairness of crowdsource annotations while saving time and costs.</p><p dir="ltr"><br></p> Natural language processing Data quality Computational linguistics Crowdsourcing Certainty calibration LLM Evaluating annotation quality
160	Data Exchange for Artificial Intelligence Incubation in Manufacturing Industrial Internet Zeng, Yingyan 21 August 2024 (has links) Industrial Cyber-physical Systems (ICPSs) connect industrial equipment and manufacturing processes via ubiquitous sensors, actuators, and computer units, forming the Manufacturing Industrial Internet (MII). With the data generated from MII, Artificial Intelligence (AI) greatly advances the data-driven decision making for manufacturing efficiency, quality improvement, and cost reduction. However, data with poor quality have posed significant challenges to the incubation (i.e., training, validation, and deployment) of AI models. In the offline training phase, training data with poor quality will result in inaccurate AI models. In the online training and deployment phases, high-volume and informative-poor data lead to the discrepancy of the AI modeling performance in different phases, and also lead to high communication and computation workload, and high cost in data acquisition and storage. In the incubation of AI models for multiple manufacturing stages or systems, exchanging and sharing datasets can significantly improve the efficiency of data collection for single manufacturing enterprise, and improve the quality of training datasets. However, inaccurate estimation of the value of datasets can cause ineffective dataset exchange and hamper the scaling up of AI systems. High-quality and high-value data not only enhance the modeling performance during AI incubation, but also contribute to effective data exchange for potential synergistic intelligence in MII. Therefore, it is important to assess and ensure the data quality in terms of its value for AI models. In this dissertation, our ultimate goal is to establish a data exchange paradigm to provide high-quality and high-value data for AI incubation in MII. To achieve the goal, three research tasks are proposed for different phases in AI incubation: (1) a prediction-oriented data generation method to actively generate highly informative data in the offline training phase for high prediction performance (Chapter 2); (2) an ensemble active learning by contextual bandits framework for acquisition and evaluation of passively collected online data for the continuous improvement and resilient modeling performance during the online training and deployment phases (Chapter 3); and (3) a context-aware, performance-oriented, and privacy-preserving dataset-sharing framework to efficiently share and exchange small-but-high-quality datasets between trusted stakeholders to allow their on-demand usage (Chapter 4). All the proposed methodologies have been evaluated and validated through simulation studies and applications to real manufacturing case studies. In Chapter 5, the contribution of the work is summarized and the future research directions are proposed. / Doctor of Philosophy / With the data collected in manufacturing processes, Artficial Intelligence (AI) methods greatly improve the data-driven decision making to improve the manufacturing efficiency, product quality, and cost. However, the advancement of AI methods heavily replies on the quality and amount of available datasets. In this dissertation, we focus on the impact of data in three stages of the development of AI models: (1) In the offline training phases (i.e., during design prototyping), limited data with poor quality will result in AI models with poor performance; (2) In the online training and deployment phases (i.e., during mass production), large-volume but poor-quality data will cause the discrepancy of AI modeling performance between training phase and deployment phase, and also result in high labelling and storage cost; (3) In the scaling up phase of AI models across multiple manufacturing stages or systems, it takes a long time and intensive effort for a single manufacturing enterprise to collect sufficient data to train advanced AI models. By exchanging datasets between manufacturers, the time and cost can be saved while the quality of training datasets can be improved. However, without accurately estimating the value of datasets, the exchange will be ineffective. To address these challenges in data for the AI models, this dissertation improves the quality and enables the exchange of data in the aforementioned three stages by: (1) a prediction-oriented data generation method to actively generate highly informative data in the offline training phase for high prediction performance (Chapter 2); (2) an ensemble active learning by contextual bandits framework for data acquisition and evaluation for the continuous improvement and resilient modeling performance during the online training and deployment phases (Chapter 3); and (3) a context-aware, performance-oriented, and privacy-preserving dataset-sharing framework to efficiently share and exchange small-but-high- quality datasets to allow their on-demand usage (Chapter 4). Finally, in Chapter 5, the contribution of the work is summarized and future research directions are proposed. Dataset Exchange Data Quality Dataset Valuation Data trading Manufacturing Industrial Internet

Search results