Spelling suggestions: "subject:"4cluster analysis - data processing."" "subject:"4cluster analysis - mata processing.""
21 |
Concentric Layout, A New Scientific Data Layout For Matrix Data Set In Hadoop File SystemCheng, Lu 01 January 2010 (has links)
The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data speed and space efficiently, data preprocess operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout which adopted in current Hadoop framework by default, concentric data layout stores the data from the same sub-matrix into one chunk. This matches well with the matrix operations like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments indicate that the concentric data layout improves the overall performance, reduces the execution time by 38% when the file size is 16 GB, also it relieves the data overhead phenomenon and increases the effective data retrieval rate by 32% on average.
|
22 |
An Improved Utility Driven Approach Towards K-Anonymity Using Data Constraint RulesMorton, Stuart Michael 14 August 2013 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / As medical data continues to transition to electronic formats, opportunities arise for researchers to use this microdata to discover patterns and increase knowledge that can improve patient care. Now more than ever, it is critical to protect the identities of the
patients contained in these databases. Even after removing obvious “identifier”
attributes, such as social security numbers or first and last names, that clearly identify a specific person, it is possible to join “quasi-identifier” attributes from two or more publicly
available databases to identify individuals.
K-anonymity is an approach that has been used to ensure that no one individual
can be distinguished within a group of at least k individuals. However, the majority of the proposed approaches implementing k-anonymity have focused on improving the efficiency of algorithms implementing k-anonymity; less emphasis has been put towards ensuring the “utility” of anonymized data from a researchers’ perspective. We propose a
new data utility measurement, called the research value (RV), which extends existing
utility measurements by employing data constraints rules that are designed to improve
the effectiveness of queries against the anonymized data.
To anonymize a given raw dataset, two algorithms are proposed that use predefined
generalizations provided by the data content expert and their corresponding
research values to assess an attribute’s data utility as it is generalizing the data to
ensure k-anonymity. In addition, an automated algorithm is presented that uses
clustering and the RV to anonymize the dataset. All of the proposed algorithms scale
efficiently when the number of attributes in a dataset is large.
|
23 |
Performance analysis of EM-MPM and K-means clustering in 3D ultrasound breast image segmentationYang, Huanyi 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Mammographic density is an important risk factor for breast cancer, detecting and screening at an early stage could help save lives. To analyze breast density distribution, a good segmentation algorithm is needed. In this thesis, we compared two popularly used segmentation algorithms, EM-MPM and K-means Clustering. We applied them on twenty cases of synthetic phantom ultrasound tomography (UST), and nine cases of clinical mammogram and UST images. From the synthetic phantom segmentation comparison we found that EM-MPM performs better than K-means Clustering on segmentation accuracy, because the segmentation result fits the ground truth data very well (with superior Tanimoto Coefficient and Parenchyma Percentage). The EM-MPM is able to use a Bayesian prior assumption, which takes advantage of the 3D structure and finds a better localized segmentation. EM-MPM performs significantly better for the highly dense tissue scattered within low density tissue and for volumes with low contrast between high and low density tissues. For the clinical mammogram, image segmentation comparison shows again that EM-MPM outperforms K-means Clustering since it identifies the dense tissue more clearly and accurately than K-means. The superior EM-MPM results shown in this study presents a promising future application to the density proportion and potential cancer risk evaluation.
|
24 |
Spectroscopic and chemometric analysis of automotive clear coat paints by micro fourier transform infrared spectroscopyOsborne Jr., James D. January 2014 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Clear coats have been part of automotive field paint finishes for several decades. Originally a layer of paint with no pigment, they have evolved into a protective layer important to the appearance and longevity of the vehicle's finish. These clear coats have been studied previously using infrared spectroscopy and other spectroscopic techniques. Previous studies focused on either all the layers of an automobile finish or on chemometric analysis of clear coats using other analytical techniques. For this study, chemometric analysis was performed on preprocessed spectra averaged from five separate samples. Samples were analyzed on a Thermo-Nicolet Nexus 670 connected to a Continuμm™ FT-IR microscope. Two unsupervised chemometric techniques, Agglomerative Hierarchical Clustering (AHC) and Principal Component Analysis (PCA), were used to evaluate the data set. Discriminant analysis, a supervised technique, was evaluated using several known qualifiers; these included cluster group from AHC, make, model, and year. Although discriminant analysis confirmed the AHC and PCA results, no correlation to make, model, or year was indicated.
|
Page generated in 0.099 seconds