Global ETD Search

41	Mining frequent highly-correlated item-pairs at very low support levels Sandler, Ian 20 December 2011 (has links) The ability to extract frequent pairs from a set of transactions is one of the fundamental building blocks of data mining. When the number of items in a given transaction is relatively small the problem is trivial. Even when dealing with millions of transactions it is still trivial if the number of unique items in the transaction set is small. The problem becomes much more challenging when we deal with millions of transactions, each containing hundreds of items that are part of a set of millions of potential items. Especially when we are looking for highly correlated results at extremely low support levels. For 25 years the Direct Hashing and Pruning Park Chen Yu (PCY) algorithm has been the principal technique used when there are billions of potential pairs that need to be counted. In this paper we propose a new approach that allows us to take full advantage of both multi-core and multi-CPU availability which works in cases where PCY fails, with excellent performance scaling that continues even when the number of processors, unique items and items per transaction are at their highest. We believe that our approach has much broader applicability in the field of co-occurrence counting, and can be used to generate much more interesting results when mining very large data sets. / Graduate data mining park chen yu algorithm map reduce mining frequent datasets
42	GEOGRAPHIC DATA MINING AND GEOVISUALIZATION FOR UNDERSTANDING ENVIRONMENTAL AND PUBLIC HEALTH DATA Adu-Prah, Samuel 01 May 2013 (has links) Within the theoretical framework of this study it is recognized that a very large amount of real-world facts and geospatial data are collected and stored. Decision makers cannot consider all the available disparate raw facts and data. Problem-specific variables, including complex geographic identifiers have to be selected from this data and be validated. The problems associated with environmental- and public-health data are that (1) geospatial components of the data are not considered in analysis and decision making process, (2) meaningful geospatial patterns and clusters are often overlooked, and (3) public health practitioners find it difficult to comprehend geospatial data. Inspired by the advent of geographic data mining and geovisualization in public and environmental health, the goal of this study is to unveil the spatiotemporal dynamics in the prevalence of overweight and obesity in United States youths at regional and local levels over a twelve-year study period. Specific objectives of this dissertation are to (1) apply regionalization algorithms effective for the identification of meaningful clusters that are in spatial uniformity to youth overweight and obesity, and (2) use Geographic Information System (GIS), spatial analysis techniques, and statistical methods to explore the data sets for health outcomes, and (3) explore geovisualization techniques to transform discovered patterns in the data sets for recognition, flexible interaction and improve interpretation. To achieve the goal and the specific objectives of this dissertation, we used data sets from the National Longitudinal Survey of Youth 1997 (NLSY'97) early release (1997-2004), NLSY'97 current release (2005 - 2008), census 2000 data and yearly population estimates from 2001 to 2008, and synthetic data sets. The NLSY97 Cohort database range varied from 6,923 to 8,565 individuals during the period. At the beginning of the cohort study the age of individuals participating in this study was between 12 and 17 years, and in 2008, they were between 24 and 28 years. For the data mining tool, we applied the Regionalization with Dynamically Constrained Agglomerative clustering and Partitioning (REDCAP) algorithms to identify hierarchical regions based on measures of weight metrics of the U.S. youths. The applied algorithms are the single linkage clustering (SLK), average linkage clustering (ALK), complete linkage clustering (CLK), and the Ward's method. Moreover, we used GIS, spatial analysis techniques, and statistical methods to analyze the spatial varying association of overweight and obesity prevalence in the youth and to geographically visualize the results. The methods used included the ordinary least square (OLS) model, the spatial generalized linear mixed model (GLMM), Kulldorff's Scan space-time analysis, and the spatial interpolation techniques (inverse distance weighting). The three main findings for this study are: first, among the four algorithms ALK, Ward and CLK identified regions effectively than SLK which performed very poorly. The ALK provided more promising regions than the rest of the algorithms by producing spatial uniformity effectively related to the weight variable (body mass index). The regionalization algorithm-ALK provided new insights about overweight and obesity, by detecting new spatial clusters with over 30% prevalence. New meaningful clusters were detected in 15 counties, including Yazoo, Holmes, Lincoln, and Attala, in Mississippi; Wise, Delta, Hunt, Liberty, and Hardin in Texas; St Charles, St James, and Calcasieu in Louisiana; Choctaw, Sumter, and Tuscaloosa in Alabama. Demographically, these counties have race/ethnic composition of about 75% White, 11.6% Black and 13.4% others. Second, results from this study indicated that there is an upward trend in the prevalence of overweight and obesity in United States youths both in males and in females. Male youth obesity increased from 10.3% (95% CI=9.0, 11.0) in 1999 to 27.0% (95% CI=26.0, 28.0) in 2008. Likewise, female obesity increased from 9.6% (95% CI=8.0, 11.0) in 1999 to 28.9% (95% CI=27.0, 30.0) during the same period. Youth obesity prevalence was higher among females than among males. Aging is a substantial factor that has statistically highly significant association (p < 0.001) with prevalence of overweight and obesity. Third, significant cluster years for high rates were detected in 2003-2008 (relative risk 1.92, 3.4 annual prevalence cases per 100000, p < 0.0001) and that of low rates in 1997-2002 (relative risk 0.39, annual prevalence cases per 100000, p < 0.0001). Three meaningful spatiotemporal clusters of obesity (p < 0.0001) were detected in counties located within the South, Lower North Eastern, and North Central regions. Counties identified as consistently experiencing high prevalence of obesity and with the potential of becoming an obesogenic environment in the future are Copiah, Holmes, and Hinds in Mississippi; Harris and Chamber, Texas; Oklahoma and McCain, Oklahoma; Jefferson, Louisiana; and Chicot and Jefferson, Arkansas. Surprisingly, there were mixed trends in youth obesity prevalence patterns in rural and urban areas. Finally, from a public health perspective, this research have shown that in-depth knowledge of whether and in what respect certain areas have worse health outcomes can be helpful in designing effective community interventions to promote healthy living. Furthermore, specific information obtained from this dissertation can help guide geographically-targeted programs, policies, and preventive initiatives for overweight and obesity prevalence in the United States. Geographic data mining Longitudinal datasets Overweight and Obesity Public Health Spatiaotemporal analysis
43	Efficient Processing of Skyline Queries on Static Data Sources, Data Streams and Incomplete Datasets January 2014 (has links) abstract: Skyline queries extract interesting points that are non-dominated and help paint the bigger picture of the data in question. They are valuable in many multi-criteria decision applications and are becoming a staple of decision support systems. An assumption commonly made by many skyline algorithms is that a skyline query is applied to a single static data source or data stream. Unfortunately, this assumption does not hold in many applications in which a skyline query may involve attributes belonging to multiple data sources and requires a join operation to be performed before the skyline can be produced. Recently, various skyline-join algorithms have been proposed to address this problem in the context of static data sources. However, these algorithms suffer from several drawbacks: they often need to scan the data sources exhaustively to obtain the skyline-join results; moreover, the pruning techniques employed to eliminate tuples are largely based on expensive tuple-to-tuple comparisons. On the other hand, most data stream techniques focus on single stream skyline queries, thus rendering them unsuitable for skyline-join queries. Another assumption typically made by most of the earlier skyline algorithms is that the data is complete and all skyline attribute values are available. Due to this constraint, these algorithms cannot be applied to incomplete data sources in which some of the attribute values are missing and are represented by NULL values. There exists a definition of dominance for incomplete data, but this leads to undesirable consequences such as non-transitive and cyclic dominance relations both of which are detrimental to skyline processing. Based on the aforementioned observations, the main goal of the research described in this dissertation is the design and development of a framework of skyline operators that effectively handles three distinct types of skyline queries: 1) skyline-join queries on static data sources, 2) skyline-window-join queries over data streams, and 3) strata-skyline queries on incomplete datasets. This dissertation presents the unique challenges posed by these skyline queries and addresses the shortcomings of current skyline techniques by proposing efficient methods to tackle the added overhead in processing skyline queries on static data sources, data streams, and incomplete datasets. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2014 Computer science Data Streams Incomplete Data Multiple Datasets Skyline-Join Skyline Query Processing Skyline-Window-Join
44	Uma abordagem para avaliação da qualidade de linked datasets para aplicações de domínio específicos SARINHO, Walter Travassos 07 August 2014 (has links) Submitted by Luiz Felipe Barbosa (luiz.fbabreu2@ufpe.br) on 2015-03-10T19:50:40Z No. of bitstreams: 2 DISSERTAÇÃO Walter Travassos Sarinho.pdf: 4193052 bytes, checksum: e957fbf471c0ed377c4437c0e77882cb (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Approved for entry into archive by Daniella Sodre (daniella.sodre@ufpe.br) on 2015-03-11T11:44:01Z (GMT) No. of bitstreams: 2 DISSERTAÇÃO Walter Travassos Sarinho.pdf: 4193052 bytes, checksum: e957fbf471c0ed377c4437c0e77882cb (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-11T11:44:01Z (GMT). No. of bitstreams: 2 DISSERTAÇÃO Walter Travassos Sarinho.pdf: 4193052 bytes, checksum: e957fbf471c0ed377c4437c0e77882cb (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2014-08-07 / CAPES / O crescimento da Web de Dados possibilita uma série de novas aplicações que podem fazer uso de múltiplos Linked Datasets (fontes de dados publicadas segundo os princípios Linked Data). O grande número de fontes de dados disponíveis na Web de Dados, bem como a falta de informações sobre a proveniência e a qualidade destes dados, traz à tona um grande desafio: como identificar os melhores Linked Datasets para uma determinada aplicação? Uma possível solução é utilizar critérios de Qualidade da Informação (QI) no processo de avaliação dos Linked Datasets, considerando os requisitos específicos da aplicação. Neste cenário, esta dissertação propõe uma abordagem, denominada QualityStamp, para avaliação da qualidade de Linked Datasets para aplicações de Domínio Específico. A abordagem proposta utiliza cinco critérios de qualidade (disponibilidade, tempo de resposta, atraso de fila, completude e interlinking) cujo objetivo é avaliar três características dos Linked Datasets: (i) o desempenho, (ii) a capacidade de responder a um conjunto de consultas e (iii) o grau de interligações de um dataset a outro. A avaliação de qualidade é guiada pelos requisitos da aplicação, os quais são representados por meio de um conjunto de consultas e dos requisitos não funcionais, que correspondem aos critérios de qualidade mais indicados para a aplicação. Dessa forma, no momento da avaliação, um especialista no domínio (ED) deverá escolher os critérios de qualidade que mais se adequam à aplicação. Como resultado da avaliação, é gerada uma medida global de qualidade cujo objetivo é prover uma classificação entre linked datasets os candidatos. Ao longo do trabalho, a abordagem é apresentada e os experimentos para avaliação da mesma são descritos. Linked Data Linked Datasets Qualidade da Informação Avaliação da Qualidade Aplicações de domínio específico Requisitos da Aplicação
45	Image Representation using Attribute-Graphs Prabhu, Nikita January 2016 (has links) (PDF) In a digital world of Flickr, Picasa and Google Images, developing a semantic image represen-tation has become a vital problem. Image processing and computer vision researchers to date, have used several di erent representations for images. They vary from low level features such as SIFT, HOG, GIST etc. to high level concepts such as objects and people. When asked to describe an object or a scene, people usually resort to mid-level features such as size, appearance, feel, use, behaviour etc. Such descriptions are commonly referred to as the attributes of the object or scene. These human understandable, machine detectable attributes have recently become a popular feature category for image representation for various vision tasks. In addition to image and object characteristics, object interactions and back-ground/context information and the actions taking place in the scene form an important part of an image description. It is therefore, essential, to develop an image representation which can e ectively describe various image components and their interactions. Towards this end, we propose a novel image representation, termed Attribute-Graph. An Attribute-Graph is an undirected graph, incorporating both local and global image character-istics. The graph nodes characterise objects as well as the overall scene context using mid-level semantic attributes, while the edges capture the object topology and the actions being per-formed. We demonstrate the e ectiveness of Attribute-Graphs by applying them to the problem of image ranking. Since an image retrieval system should rank images in a way which is compatible with visual similarity as perceived by humans, it is intuitive that we work in a human understandable feature space. Most content based image retrieval algorithms treat images as a set of low level features or try to de ne them in terms of the associated text. Such a representation fails to capture the semantics of the image. This, more often than not, results in retrieved images which are semantically dissimilar to the query. Ranking using the proposed attribute-graph representation alleviates this problem. We benchmark the performance of our ranking algorithm on the rPascal and rImageNet datasets, which we have created in order to evaluate the ranking performance on complex queries containing multiple objects. Our experimental evaluation shows that modelling images as Attribute-Graphs results in improved ranking performance over existing techniques. Attribute-Graphs Image Representation Convolutional Neural Networks Graphs Attribute-Graph Image Ranking Datasets Computer Science
46	Adversarial Machine Learning: A Comparative Study on Contemporary Intrusion Detection Datasets Pacheco Monasterios, Yulexis D. January 2020 (has links) No description available. Computer Engineering Computer Science Adversarial Machine Learning Intrusion Detection Datasets UNSW-NB15 Bot-IoT
47	Microexpression Spotting in Video Using Optical Strain Godavarthy, Sridhar 01 July 2010 (has links) Microexpression detection plays a vital role in applications such as lie detection and psychological consultations. Current research is progressing in the direction of automating microexpression recognition by aiming at classifying the microexpressions in terms of FACS Action Units. Although high detection rates are being achieved, the datasets used for evaluation of these systems are highly restricted. They are limited in size - usually still pictures or extremely short videos; motion constrained; containing only a single microexpression and do not contain negative cases where microexpressions are absent. Only a few of these systems run in real time and even fewer have been tested on real life videos. This work proposes a novel method for automated spotting of facial microexpressions as a preprocessing step to existing microexpression recognition systems. By identifying and rejecting sequences that do not contain microexpressions, longer sequences can be converted into shorter, constrained, relevant sequences which comprise of only single microexpressions, which can then be passed as input to existing systems, improving their performance and efficiency. This method utilizes the small temporal extent of microexpressions for their identification. The extent is determined by the period for which strain, due to the non-rigid motion caused during facial movement, is impacted on the facial skin. The subject's face is divided into sub-regions, and facial strain is calculated for each of these regions. The strain patterns in individual regions are used to identify subtle changes which facilitate the detection of microexpressions. The strain magnitude is calculated using the central difference method over the robust and dense optical flow field of each subject's face. The computed strain is then thresholded using a variable threshold. If the duration for which the strain is above the threshold corresponds to the duration of a microexpression, detection is reported. The datasets used for algorithm evaluation are comprised of a mix of natural and enacted microexpressions. The results were promising with up to 80% true detection rate. Increased false positive spots in the Canal 9 dataset can be attributed to talking by the subjects causing fine movements in the mouth region. Performing speech detection to identify sequences where the subject is talking and excluding the mouth region during those periods could help reduce the number of false positives. Computer vision FACS Optical flow Facial deformation Datasets American Studies Arts and Humanities
48	Parallel Algorithm for Reduction of Data Processing Time in Big Data Silva, Jesús, Hernández Palma, Hugo, Niebles Núẽz, William, Ovallos-Gazabon, David, Varela, Noel 07 January 2020 (has links) Technological advances have allowed to collect and store large volumes of data over the years. Besides, it is significant that today's applications have high performance and can analyze these large datasets effectively. Today, it remains a challenge for data mining to make its algorithms and applications equally efficient in the need of increasing data size and dimensionality [1]. To achieve this goal, many applications rely on parallelism, because it is an area that allows the reduction of cost depending on the execution time of the algorithms because it takes advantage of the characteristics of current computer architectures to run several processes concurrently [2]. This paper proposes a parallel version of the FuzzyPred algorithm based on the amount of data that can be processed within each of the processing threads, synchronously and independently. Computer architecture Data mining Large dataset Data size Large datasets Large volumes Parallel version Technological advances
49	Analysis of Data from a Smart Home Research Environment Guthenberg, Patrik January 2022 (has links) This thesis projects presents a system for gathering and using data in the context of a smarthome research enviroment. The system was developed at the Human Health and ActivityLaborty, H2Al, at Luleå University of Technology and consists of two distinct parts. First, a data export application that runs in the H2Al enviroment. This application syn-chronizes data from various sensor systems and forwards the data for further analysis. Thisanalysis was performed in the iMotions platform in order to visualize, record and export data.As a delimitation, the only sensor used was the WideFind positional system installed at theH2Al. Secondly, an activity recognition application that uses data generated from the iMotionsplatform and data export application. This includes several scripts which transforms rawdata into labeled datasets and translates them into activity recognition models with the helpof machine learning algorithms. As a delimitation, activity recognition was limited to falldetection. These fall detection models were then hosted on a basic server to test accuracyand to act as an example use case for the rest of the project. The project resulted in an effective data gathering system and was generally successful asa tool to create datasets. The iMotions platform was especially successful in both visualizingand recording data together with the data export application. The example fall detectionmodels trained showed theoretical promise, but failed to deliver good results in practice,partly due to the limitations of the positional sensor system used. Some of the conclusions drawn at the end of the project were that the data collectionprocess needed more structure, planning and input from professionals, that a better positionalsensor system may be required for better fall detection results but also that this kind of systemshows promise in the context of smart homes, especially within areas like elderly healthcare. labeled datasets activity recognition fall detection iMotions WideFind H2Al kNN Computer Systems Datorsystem
50	Monocular Depth Estimation: Datasets, Methods, and Applications Bauer, Zuria 15 September 2021 (has links) The World Health Organization (WHO) stated in February 2021 at the Seventy- Third World Health Assembly that, globally, at least 2.2 billion people have a near or distance vision impairment. They also denoted the severe impact vision impairment has on the quality of life of the individual suffering from this condition, how it affects the social well-being and their economic independence in society, becoming in some cases an additional burden to also people in their immediate surroundings. In order to minimize the costs and intrusiveness of the applications and maximize the autonomy of the individual life, the natural solution is using systems that rely on computer vision algorithms. The systems improving the quality of life of the visually impaired need to solve different problems such as: localization, path recognition, obstacle detection, environment description, navigation, etc. Each of these topics involves an additional set of problems that have to be solved to address it. For example, for the task of object detection, there is the need of depth prediction to know the distance to the object, path recognition to know if the user is on the road or on a pedestrian path, alarm system to provide notifications of danger for the user, trajectory prediction of the approaching obstacle, and those are only the main key points. Taking a closer look at all of these topics, they have one key component in common: depth estimation/prediction. All of these topics are in need of a correct estimation of the depth in the scenario. In this thesis, our main focus relies on addressing depth estimation in indoor and outdoor environments. Traditional depth estimation methods, like structure from motion and stereo matching, are built on feature correspondences from multiple viewpoints. Despite the effectiveness of these approaches, they need a specific type of data for their proper performance. Since our main goal is to provide systems with minimal costs and intrusiveness that are also easy to handle we decided to infer the depth from single images: monocular depth estimation. Estimating depth of a scene from a single image is a simple task for humans, but it is notoriously more difficult for computational models to be able to achieve high accuracy and low resource requirements. Monocular Depth Estimation is this very task of estimating depth from a single RGB image. Since there is only a need of one image, this approach is used in applications such as autonomous driving, scene understanding or 3D modeling where other type of information is not available. This thesis presents contributions towards solving this task using deep learning as the main tool. The four main contributions of this thesis are: first, we carry out an extensive review of the state-of-the-art in monocular depth estimation; secondly, we introduce a novel large scale high resolution outdoor stereo dataset able to provide enough image information to solve various common computer vision problems; thirdly, we show a set of architectures able to predict monocular depth effectively; and, at last, we propose two real life applications of those architectures, addressing the topic of enhancing the perception for the visually impaired using low-cost wearable sensors. Monocular Depth Estimation RGB-D Datasets Deep Learning Machine Learning

Search results