• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 24
  • 6
  • 5
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 50
  • 50
  • 15
  • 13
  • 12
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Statistical decisions in optimising grain yield

Norng, Sorn January 2004 (has links)
This thesis concerns Precision Agriculture (PA) technology which involves methods developed to optimise grain yield by examining data quality and modelling protein/yield relationship of wheat and sorghum fields in central and southern Queensland. An important part of developing strategies to optimisise grain yield is the understanding of PA technology. This covers major aspects of PA which includes all the components of Site- Specific Crop Management System (SSCM). These components are 1. Spatial referencing, 2. Crop, soil and climate monitoring, 3. Attribute mapping, 4. Decision suppport systems and 5. Differential action. Understanding how all five components fit into PA significantly aids the development of data analysis methods. The development of PA is dependent on the collection, analysis and interpretation of information. A preliminary data analysis step is described which covers both non-spatial and spatial data analysis methods. The non-spatial analysis involves plotting methods (maps, histograms), standard distribution and statistical summary (mean, standard deviation). The spatial analysis covers both undirected and directional variogram analyses. In addition to the data analysis, a theoretical investigation into GPS error is given. GPS plays a major role in the development of PA. A number of sources of errors affect the GPS and therefore effect the positioning measurements. Therefore, an understanding of the distribution of the errors and how they are related to each other over time is needed to complement the understanding of the nature of the data. Understanding the error distribution and the data give useful insights for model assumptions in regard to position measurement errors. A review of filtering methods is given and new methods are developed, namely, strip analysis and a double harvesting algoritm. These methods are designed specifically for controlled traffic and normal traffic respectively but can be applied to all kinds of yield monitoring data. The data resulting from the strip analysis and double harvesting algorithm are used in investigating the relationship between on-the-go yield and protein. The strategy is to use protein and yield in determining decisions with respect to nitrogen managements. The agronomic assumption is that protein and yield have a significant relationship based on plot trials. We investigate whether there is any significant relationship between protein and yield at the local level to warrent this kind of assumption. Understanding PA technology and being aware of the sources of errors that exist in data collection and data analysis are all very important in the steps of developing management decision strategies.
42

Amélioration de la qualité des données : correction sémantique des anomalies inter-colonnes / Improved data quality : correction of semantic inter-column anomalies

Zaidi, Houda 01 February 2017 (has links)
La qualité des données présente un grand enjeu au sein d'une organisation et influe énormément sur la qualité de ses services et sur sa rentabilité. La présence de données erronées engendre donc des préoccupations importantes autour de cette qualité. Ce rapport traite la problématique de l'amélioration de la qualité des données dans les grosses masses de données. Notre approche consiste à aider l'utilisateur afin de mieux comprendre les schémas des données manipulées, mais aussi définir les actions à réaliser sur celles-ci. Nous abordons plusieurs concepts tels que les anomalies des données au sein d'une même colonne, et les anomalies entre les colonnes relatives aux dépendances fonctionnelles. Nous proposons dans ce contexte plusieurs moyens de pallier ces défauts en nous intéressons à la performance des traitements ainsi opérés. / Data quality represents a major challenge because the cost of anomalies can be very high especially for large databases in enterprises that need to exchange information between systems and integrate large amounts of data. Decision making using erroneous data has a bad influence on the activities of organizations. Quantity of data continues to increase as well as the risks of anomalies. The automatic correction of these anomalies is a topic that is becoming more important both in business and in the academic world. In this report, we propose an approach to better understand the semantics and the structure of the data. Our approach helps to correct automatically the intra-column anomalies and the inter-columns ones. We aim to improve the quality of data by processing the null values and the semantic dependencies between columns.
43

Ensemble Stream Model for Data-Cleaning in Sensor Networks

Iyer, Vasanth 16 October 2013 (has links)
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
44

Travel time estimation in congested urban networks using point detectors data

Mahmoud, Anas Mohammad 02 May 2009 (has links)
A model for estimating travel time on short arterial links of congested urban networks, using currently available technology, is introduced in this thesis. The objective is to estimate travel time, with an acceptable level of accuracy for real-life traffic problems, such as congestion management and emergency evacuation. To achieve this research objective, various travel time estimation methods, including highway trajectories, multiple linear regression (MLR), artificial neural networks (ANN) and K –nearest neighbor (K-NN) were applied and tested on the same dataset. The results demonstrate that ANN and K-NN methods outperform linear methods by a significant margin, also, show particularly good performance in detecting congested intervals. To ensure the quality of the analysis results, set of procedures and algorithms based on traffic flow theory and test field information, were introduced to validate and clean the data used to build, train and test the different models.
45

Data-based Explanations of Random Forest using Machine Unlearning

Tanmay Laxman Surve (17537112) 03 December 2023 (has links)
<p dir="ltr">Tree-based machine learning models, such as decision trees and random forests, are one of the most widely used machine learning models primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory behavior. Given their overwhelming success for most tasks, it is of interest to identify root causes of the unexpected and discriminatory behavior of tree-based models. However, there has not been much work on understanding and debugging tree-based classifiers in the context of fairness. We introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to determine training data subsets responsible for model unfairness. Given a tree-based model learned on a training dataset, FairDebugger identifies the top-k training data subsets responsible for model unfairness, or bias, by measuring the change in model parameters when parts of the underlying training data are removed. We describe the architecture of FairDebugger and walk through real-world use cases to demonstrate how FairDebugger detects these patterns and their explanations.</p>
46

工商及服務業普查資料品質之研究 / Data quality research of industry and commerce census

邱詠翔 Unknown Date (has links)
資料品質的好壞會影響決策品質以及各種行動的執行成果,所以資料品質在近年來越來越受到重視。本研究包含了兩個資料庫,一個是產業創新調查資料庫,一個是95年工商及服務業普查資料庫,資料品質的好壞對一個資料庫來說也是一個相當重要的議題,資料庫中往往都含有錯誤的資料,錯誤的資料會導致分析結果出現偏差的狀況,所以在進行資料分析之前,資料清理與整理是必要的事前處理工作。 我們從母體資料分佈與樣本資料分佈得知,在清理與整理資料之前,平均創新員工人數為92.08,平均工商員工人數為135.54;在清理與整理資料之後,我們比較兩個資料庫員工人數的相關性、相似性、距離等性質,結果顯示兩個資料庫的資料一致性極高,平均創新員工人數與平均工商員工人數分別為39.01與42.12,跟母體平均員工人數7.05較為接近,也顯示出資料清理的重要性。 本研究使用的方法為事後分層抽樣,主要研究目的是要利用產業創新調查樣本來推估95年工商及服務業普查母體資料的準確性。產業創新調查樣本在推估母體從業員工人數與母體營業收入方面皆出現高估的狀況,推測出現高估的原因是產業創新調查母體為前中華徵信所出版的五千大企業名冊為母體底冊,而工商及服務業普查企業資料為一般企業母體底冊。因此,我們利用和產業創新調查樣本所相對應的工商普查樣本做驗證,發現95年工商及服務業普查樣本與產業創新調查樣本的資料一致性極高。 / Data quality is good or bad will affect the decision quality and achievements in the implementation of various actions, so the data quality more and more attention in recent years. This study consists of two databases, one is the industrial innovation survey database, another is the industry and commerce census database in ninety five years. Data quality is good or bad of a database is also a very important issue, the database often contain erroneous information, incorrect information will result in bias of the analysis results. So before carrying out data analysis, data cleaning and consolidation is necessary. We can know from the parent and the sample data distribution. Before data cleaning and consolidation, the average number of innovation employees is 92.08, and the average number of industrial-commerce employees is 135.54. After data cleaning and consolidation, we compare the correlation, similarity, and distance of the number of employees in two databases. The results show the data consistency of the two databases is very high, the average number of innovation employees is 39.01, and the average number of industrial-commerce employees is 42.12, it is closer to the average number of parent employees 7.05. This also shows the importance of data cleaning. Method used in the study is post-stratified sampling, the main research objective is to use industrial innovation survey sample to estimate the data accuracy of the industry and commerce census in ninety five years. Use industrial innovation survey sample to estimate the number of employees and operating revenue in the industry and commerce census in ninety five years are both overestimated, we guess the reason is that the parent of the industrial innovation survey is five thousand large enterprises published by China Credit Information, and the parent of the industry and commerce census is general enterprises. Therefore, we use the corresponding industry and commerce census sample for validation. The results show that the data consistency of the industrial innovation survey sample and the industry and commerce census sample in ninety five years is very high.
47

Predikce hodnot v čase / Prediction of Values on a Time Line

Maršová, Eliška January 2016 (has links)
This work deals with the prediction of numerical series whose application is suitable for prediction of stock prices. They explain the procedures for analysis and works with price charts. Also explains the methods of machine learning. Knowledge is used to build a program that finds patterns in numerical series for estimation.
48

Analýza dat síťové komunikace mobilních zařízení / Analysis of Mobile Devices Network Communication Data

Abraham, Lukáš January 2020 (has links)
At the beginning, the work describes DNS and SSL/TLS protocols, it mainly deals with communication between devices using these protocols. Then we'll talk about data preprocessing and data cleaning. Furthermore, the thesis deals with basic data mining techniques such as data classification, association rules, information retrieval, regression analysis and cluster analysis. The next chapter we can read something about how to identify mobile devices on the network. We will evaluate data sets that contain collected data from communication between the above mentioned protocols, which will be used in the practical part. After that, we finally get to the design of a system for analyzing network communication data. We will describe the libraries, which we used and the entire system implementation. We will perform a large number of experiments, which we will finally evaluate.
49

Dolování z dat v prostředí informačního systému K2 / Data Mining in K2 Information System

Figura, Petr Unknown Date (has links)
This project was originated by K2 atmitec Brno s.r.o. company. The result is data mining module in K2 information system environment. Engineered data module implements association analysis over the data of K2 information system data warehouse. Analyzed data contains information about sales filed in K2 information system. Module is implementing consumer basket analysis.
50

GIS-based Episode Reconstruction Using GPS Data for Activity Analysis and Route Choice Modeling / GIS-based Episode Reconstruction Using GPS Data

Dalumpines, Ron 26 September 2014 (has links)
Most transportation problems arise from individual travel decisions. In response, transportation researchers had been studying individual travel behavior – a growing trend that requires activity data at individual level. Global positioning systems (GPS) and geographical information systems (GIS) have been used to capture and process individual activity data, from determining activity locations to mapping routes to these locations. Potential applications of GPS data seem limitless but our tools and methods to make these data usable lags behind. In response to this need, this dissertation presents a GIS-based toolkit to automatically extract activity episodes from GPS data and derive information related to these episodes from additional data (e.g., road network, land use). The major emphasis of this dissertation is the development of a toolkit for extracting information associated with movements of individuals from GPS data. To be effective, the toolkit has been developed around three design principles: transferability, modularity, and scalability. Two substantive chapters focus on selected components of the toolkit (map-matching, mode detection); another for the entire toolkit. Final substantive chapter demonstrates the toolkit’s potential by comparing route choice models of work and shop trips using inputs generated by the toolkit. There are several tools and methods that capitalize on GPS data, developed within different problem domains. This dissertation contributes to that repository of tools and methods by presenting a suite of tools that can extract all possible information that can be derived from GPS data. Unlike existing tools cited in the transportation literature, the toolkit has been designed to be complete (covers preprocessing up to extracting route attributes), and can work with GPS data alone or in combination with additional data. Moreover, this dissertation contributes to our understanding of route choice decisions for work and shop trips by looking into the combined effects of route attributes and individual characteristics. / Dissertation / Doctor of Philosophy (PhD)

Page generated in 0.4869 seconds