Spelling suggestions: "subject:"[een] DATASETS"" "subject:"[enn] DATASETS""
1 |
Data mining for relationships in large datasetsTao, F. January 2003 (has links)
No description available.
|
2 |
'Dynamic scaling for three-dimensional information visualisation'Taylor, Ian January 2000 (has links)
No description available.
|
3 |
Clustering for ClassificationEvans, Reuben James Emmanuel January 2007 (has links)
Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.
|
4 |
Mining Truth Tables and Straddling Biclusters in Binary DatasetsOwens, Clifford Conley 07 January 2010 (has links)
As the world swims deeper into a deluge of data, binary datasets relating objects to properties can be found in many different fields. Such datasets abound in practically any area of interest, including biology, politics, entertainment, and education. This explosion calls for the definition of new types of patterns in binary data, as well as algorithms to find efficiently find these patterns.
In this work, we introduce truth tables as a new class of patterns to be mined in binary datasets. Truth tables represent a subset of properties which exhibit maximal variability (and hence, suggest independence) in occurrence patterns over the underlying objects. Unlike other measures of independence, truth tables possess anti-monotone features that can be exploited in order to mine them effectively. We present a level-wise algorithm that takes advantage of these features, showing results on real and synthetic data. These results demonstrate the scalability of our algorithm.
We also introduce new methods of mining straddling biclusters. Biclusters relate subsets of objects to subsets of properties they share within a single dataset. Straddling biclusters extend biclusters by relating a subset of objects to subsets of properties they share in two datasets. We present two levelwise algorithms, named UnionMiner and TwoMiner, which discover straddling biclusters efficiently by treating multiple datasets as a single dataset. We show results on real and synthetic data, and explore the advantages and limitations of each algorithm. We develop guidelines which suggest which of these algorithms is likely to perform better based on features of the datasets. / Master of Science
|
5 |
Scalable And Efficient Outlier Detection In Large Distributed Data Sets With Mixed-type AttributesKoufakou, Anna 01 January 2009 (has links)
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. This problem broadly arises under two scenarios: when outliers are to be removed from the data before analysis, and when useful information or knowledge can be extracted by the outliers themselves. Outlier Detection in the context of the second scenario is a research field that has attracted significant attention in a broad range of useful applications. For example, in credit card transaction data, outliers might indicate potential fraud; in network traffic data, outliers might represent potential intrusion attempts. The basis of deciding if a data point is an outlier is often some measure or notion of dissimilarity between the data point under consideration and the rest. Traditional outlier detection methods assume numerical or ordinal data, and compute pair-wise distances between data points. However, the notion of distance or similarity for categorical data is more difficult to define. Moreover, the size of currently available data sets dictates the need for fast and scalable outlier detection methods, thus precluding distance computations. Additionally, these methods must be applicable to data which might be distributed among different locations. In this work, we propose novel strategies to efficiently deal with large distributed data containing mixed-type attributes. Specifically, we first propose a fast and scalable algorithm for categorical data (AVF), and its parallel version based on MapReduce (MR-AVF). We extend AVF and introduce a fast outlier detection algorithm for large distributed data with mixed-type attributes (ODMAD). Finally, we modify ODMAD in order to deal with very high-dimensional categorical data. Experiments with large real-world and synthetic data show that the proposed methods exhibit large performance gains and high scalability compared to the state-of-the-art, while achieving similar accuracy detection rates.
|
6 |
Incremental Learning with Large DatasetsGiritharan, Balathasan 05 1900 (has links)
This dissertation focuses on the novel learning strategy based on geometric support vector machines to address the difficulties of processing immense data set. Support vector machines find the hyper-plane that maximizes the margin between two classes, and the decision boundary is represented with a few training samples it becomes a favorable choice for incremental learning. The dissertation presents a novel method Geometric Incremental Support Vector Machines (GISVMs) to address both efficiency and accuracy issues in handling massive data sets. In GISVM, skin of convex hulls is defined and an efficient method is designed to find the best skin approximation given available examples. The set of extreme points are found by recursively searching along the direction defined by a pair of known extreme points. By identifying the skin of the convex hulls, the incremental learning will only employ a much smaller number of samples with comparable or even better accuracy. When additional samples are provided, they will be used together with the skin of the convex hull constructed from previous dataset. This results in a small number of instances used in incremental steps of the training process. Based on the experimental results with synthetic data sets, public benchmark data sets from UCI and endoscopy videos, it is evident that the GISVM achieved satisfactory classifiers that closely model the underlying data distribution. GISVM improves the performance in sensitivity in the incremental steps, significantly reduced the demand for memory space, and demonstrates the ability of recovery from temporary performance degradation.
|
7 |
Generic work behaviors : the components of non job-specific performanceHunt, Steven Thomas January 1994 (has links)
No description available.
|
8 |
A Combined Approach to Handle Multi-class Imbalanced Data and to Adapt Concept Drifts using Machine LearningTumati, Saini 05 October 2021 (has links)
No description available.
|
9 |
HopsWorks : A project-based access control model for HadoopMoré, Andre, Gebremeskel, Ermias January 2015 (has links)
The growth in the global data gathering capacity is producing a vast amount of data which is getting vaster at an increasingly faster rate. This data properly analyzed can represent great opportunity for businesses, but processing it is a resource-intensive task. Sharing can increase efficiency due to reusability but there are legal and ethical questions that arise when data is shared. The purpose of this thesis is to gain an in depth understanding of the different access control methods that can be used to facilitate sharing, and choose one to implement on a platform that lets user analyze, share, and collaborate on, datasets. The resulting platform uses a project based access control on the API level and a fine-grained role based access control on the file system to give full control over the shared data to the data owner. / I dagsläget så genereras och samlas det in oerhört stora mängder data som växer i ett allt högre tempo för varje dag som går. Den korrekt analyserade datan skulle kunna erbjuda stora möjligheter för företag men problemet är att det är väldigt resurskrävande att bearbeta. Att göra det möjligt för organisationer att dela med sig utav datan skulle effektivisera det hela tack vare återanvändandet av data men det dyker då upp olika frågor kring lagliga samt etiska aspekter när man delar dessa data. Syftet med denna rapport är att få en djupare förståelse för dom olika åtkomstmetoder som kan användas vid delning av data för att sedan kunna välja den metod som man ansett vara mest lämplig att använda sig utav i en plattform. Plattformen kommer att användas av användare som vill skapa projekt där man vill analysera, dela och arbeta med DataSets, vidare kommer plattformens säkerhet att implementeras med en projekt-baserad åtkomstkontroll på API nivå och detaljerad rollbaserad åtkomstkontroll på filsystemet för att ge dataägaren full kontroll över den data som delas
|
10 |
[en] AUTOMATIC GENERATION OF BENCHMARKS FOR EVALUATING KEYWORD AND NATURAL LANGUAGE INTERFACES TO RDF DATASETS / [pt] GERAÇÃO AUTOMÁTICA DE BENCHMARKS PARA AVALIAR INTERFACES BASEADAS EM PALAVRAS-CHAVE E LINGUAGEM NATURAL PARA DATASETS RDFANGELO BATISTA NEVES JUNIOR 04 November 2022 (has links)
[pt] Os sistemas de busca textual fornecem aos usuários uma alternativa amigável
para acessar datasets RDF (Resource Description Framework). A avaliação
de desempenho de tais sistemas requer benchmarks adequados, consistindo
de datasets RDF, consultas e respectivas respostas esperadas. No entanto, os
benchmarks disponíveis geralmente possuem poucas consultas e respostas incompletas,
principalmente porque são construídos manualmente com a ajuda
de especialistas. A contribuição central desta tese é um método para construir
benchmarks automaticamente, com um maior número de consultas e com respostas
mais completas. O método proposto aplica-se tanto a consultas baseadas
em palavras-chave quanto em linguagem natural e possui duas partes: geração
de consultas e geração de respostas. A geração de consultas seleciona um
conjunto de entidades relevantes, chamadas de indutores, e, para cada uma,
heurísticas orientam o processo de extração de consultas relacionadas. A geração
de respostas recebe as consultas produzidas no passo anterior e computa
geradores de solução (SG), subgrafos do dataset original contendo diferentes
respostas às consultas. Heurísticas também orientam a construção dos SGs
evitando o desperdiço de recursos computacionais na geração de respostas irrelevantes. / [en] Text search systems provide users with a friendly alternative to access
Resource Description Framework (RDF) datasets. The performance evaluation
of such systems requires adequate benchmarks, consisting of RDF datasets,
text queries, and respective expected answers. However, available benchmarks
often have small sets of queries and incomplete sets of answers, mainly
because they are manually constructed with the help of experts. The central
contribution of this thesis is a method for building benchmarks automatically,
with larger sets of queries and more complete answers. The proposed method
works for both keyword and natural language queries and has two steps:
query generation and answer generation. The query generation step selects
a set of relevant entities, called inducers, and, for each one, heuristics guide
the process of extracting related queries. The answer generation step takes
the queries and computes solution generators (SG), subgraphs of the original
dataset containing different answers to the queries. Heuristics also guide
the construction of SGs, avoiding the waste of computational resources in
generating irrelevant answers.
|
Page generated in 0.0442 seconds