Spelling suggestions: "subject:"kmeans."" "subject:"bymeans.""
1 |
Massive Data K-means Clustering and Bootstrapping via A-optimal SubsamplingDali Zhou (6569396) 16 August 2019 (has links)
For massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time intensive, which will be enormously challenged when data size is massive. <i>k</i>-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data <i>k</i>-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling<i> k</i>-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical process theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal subsampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.
|
2 |
Visualization of Regional Liver Function with Hepatobiliary Contrast Agent Gd-EOB-DTPASamuelsson, Johanna January 2011 (has links)
Liver biopsy is a very common, but invasive procedure for diagnosing liver disease. However, such a biopsy may result in severe complications and in some cases even death. Therefore, it would be highly desirable to develop a non-invasive method which would provide the same amount of information on staging of the disease and also the location of pathologies. This thesis describes the implementation of such a non-invasive method for visualizing and quantifying liver function by the combination of MRI (Magnetic Resonance Imaging), image reconstruction, and image analysis, and pharmacokinetic modeling. The first attempt involved automatic segmentation, functional clustering (k-means) and classification (kNN) of in-data (liver, spleen and blood vessel segments) in the pharmacokinetic model. However, after implementing and analyzing this method some important issues were identified and the image segmentation method was therefore revised. The segmentation method that was subsequently developed involved a semi-automatic procedure, based on a modified image forest transform (IFT). The data were then simulated and optimized using a pharmacokinetic model describing the pharmacokinetics of the liver specific contrast agent Gd-EOB-DTPA in the human body. The output from the modeling procedure was then further analyzed, using a least-squares method, in order to assess liver function by estimating the fractions of hepatocytes, extracellular extravascular space (EES) and blood plasma in each voxel of the image. The result were in fair agreement with literature values, although further analyses and developments will be required in order to validate and also to confirm the accuracy of the method.
|
3 |
運用文字探勘及財務資料探討中國市場營運概況文字敘述及財務表現之一致性 / Using Text Mining and Financial Data to Explore for Consistency between Narrative Disclosure and Financial Performance in China Market鄭凱文 Unknown Date (has links)
本研究透過文字探勘對中國大陸2011年上市公司的MD&A進行分析,並搭配財務資訊相互比對,分析中國大陸上市公司所揭露的MD&A是否誇大,再透過實證研究分析造成中國大陸上市公司MD&A揭露誇大與否的原因。本研究樣本為2011年中國大陸所有上市公司所揭露的MD&A及相關財務資訊,MD&A非量化資訊係運用Stanford Word Segmenter斷詞資料庫、正負向詞典、TFIDF、K-means等技術進行群集分析,並結合財務資訊的K-Means群集分析,分析出中國大陸2011年上市公司MD&A揭露是否誇大;再代入公司規模、管理階層對風險的偏好程度、獲利能力、償債能力等變數,分析影響公司MD&A揭露誇大與否的因素。研究結果顯示,公司規模、管理階層對風險的偏好程度與公司MD&A資訊揭露傾向於不誇大呈顯著負相關,而公司獲利能力、公司償債能力與公司MD&A資訊揭露傾向於不誇大呈顯著正相關。本研究希望提供投資人另一種分析MD&A的方式,並建議投資人運用上市公司所揭露的MD&A資訊時,需額外考慮公司MD&A揭露有無誇大的情勢,並作適度的調整,以降低投資風險,擬定正確的投資決策。 / This study presented a way to analyze MD&A on listed companies in 2011 in China via text mining, crossing comparison with its fiscal information, validating whether disclosed MD&A on the China listed companies is overstated and its possible factors by empirical study. The research sample is the disclosed MD&A and related financial information on China listed companies in 2011. Qualitative narrative MD&A utilizes Stanford Word Segmenter, NTUSD, TFIDF and K-means performing cluster analysis, combining K-means cluster analysis of financial information, figuring out disclosed MD&A of China listed companies in 2011 exaggerated. By the variables of company scale, the Management of risk preference, profit, liquidity analyzes the effect factor of whether disclosed MD&A exaggerated or not. According to the research, the disclosed MD&A tending not to exaggerate is significantly and negatively related to company scale and the management of risk preference. Profitability and liquidity are significantly and positively relationship to disclosed MD&A tending not to exaggerate. The research is providing another way of reading MD&A with investors, suggesting investors need to take whether disclosed MD&A is overstated into consideration, and adjusting in appropriate in reducing the investment risk when making Investment decisions.
|
4 |
Internetové souřadnicové systémy / Internet coordinating systemsKrajčír, Martin January 2009 (has links)
Network coordinates (NC) system is an efficient mechanism for prediction of Internet distance with limited number of measurement. This work focus on distributed coordinates system which is evaluated by relative error. According to experimental results from simulated application, was created own algorithm to compute network coordinates. Algorithm was tested by using simulated network as well as RTT values from network PlanetLab. Experiments show that clustered nodes achieve positive results of synthetic coordinates with limited connection between nodes. This work propose implementation of own NC system in network with hierarchical aggregation. Created application was placed on research projects web page of the Department of Telecommunications.
|
5 |
Automatic K-Expectation-Maximization (K-EM) Clustering Algorithm for Data Mining ApplicationsHarsh, Archit 12 August 2016 (has links)
A non-parametric data clustering technique for achieving efficient data-clustering and improving the number of clusters is presented in this thesis. K-Means and Expectation-Maximization algorithms have been widely deployed in data-clustering applications. Result findings in related works revealed that both these algorithms have been found to be characterized with shortcomings. K-Means was established not to guarantee convergence and the choice of clusters heavily influenced the results. Expectation-Maximization’s premature convergence does not assure the optimality of results and as with K-Means, the choice of clusters influence the results. To overcome the shortcomings, a fast automatic K-EM algorithm is developed that provide optimal number of clusters by employing various internal cluster validity metrics, providing efficient and unbiased results. The algorithm is implemented on a wide array of data sets to ensure the accuracy of the results and efficiency of the algorithm.
|
6 |
K-groups: A Generalization of K-means by Energy DistanceLi, Songzi 29 April 2015 (has links)
No description available.
|
7 |
Statistische Eigenschaften von Clusterverfahren / Statistical properties of cluster proceduresSchorsch, Andrea January 2008 (has links)
Die vorliegende Diplomarbeit beschäftigt sich mit zwei Aspekten der statistischen Eigenschaften von Clusterverfahren. Zum einen geht die Arbeit auf die Frage der Existenz von unterschiedlichen Clusteranalysemethoden zur Strukturfindung und deren unterschiedlichen Vorgehensweisen ein. Die Methode des Abstandes zwischen Mannigfaltigkeiten und die K-means Methode liefern ausgehend von gleichen Daten unterschiedliche Endclusterungen.
Der zweite Teil dieser Arbeit beschäftigt sich näher mit den asymptotischen
Eigenschaften des K-means Verfahrens. Hierbei ist die Menge der optimalen Clusterzentren konsistent. Bei Vergrößerung des Stichprobenumfangs gegen Unendlich konvergiert diese in Wahrscheinlichkeit gegen die Menge der Clusterzentren, die das Varianzkriterium minimiert. Ebenfalls konvergiert die Menge der optimalen Clusterzentren für n gegen Unendlich gegen eine Normalverteilung. Es hat sich dabei ergeben, dass die einzelnen Clusterzentren voneinander abhängen. / The following thesis describes two different views onto the statistical characterics of clustering procedures. At first it adresses the questions whether different clustering methods exist to ascertain the structure of clusters and in what ays the strategies of these methods differ from each other. The method of distance between the manifolds as well as the k-means method provide different final clusters based on equal initial data.
The second part of the thesis concentrates on asymptotic properties of the k-means procedure. Here the amount of optimal clustering centres is consistent. If the size of the sample range is enlarged towards infinity, it also converges in probability towards the amount of clustering centres which minimized the whithin cluster sum of squares. Likewise the amount of optimal clustering centres converges for infinity towards the normal distribution. The main result shows that the individual clustering centres are dependent on each other.
|
8 |
探討美國上市公司MD&A揭露與財務表現一致性之決定因素 / Explore the Determinants of the Consistency between US Listed Companies’ MD&A Disclosure and Financial Performance李宸昕, Lee, Chen Hsin Unknown Date (has links)
本研究透過文字探勘對美國企業2004年至2014年的MD&A資訊進行分析,並搭配財務資訊相互比較,分析美國企業所揭露的MD&A語調一致性,接著透過實證研究分析造成美國企業MD&A語調一致性結果的原因。MD&A非量化資訊運用Loughran and McDonald正負向詞典、TFIDF、K-Means等技術進行分析,並結合財務資訊分析,分析美國企業2004年至2014年的MD&A資訊;再利用企業績效變異度、企業規模與企業成立年數等變數,來分析影響公司MD&A揭露誇大與否的因素。
研究結果顯示,企業規模、企業風險程度、分析師追蹤人數與企業成立年
數皆會深深影響MD&A語調的一致性。除了主要實證分析結果外,另外搭配三組穩健性測試來測試模型的敏感性。本研究希望讓資訊使用者運用企業所揭露的MD&A資訊時,能做更多適當的調整,考慮公司MD&A的揭露是否有過度樂觀誇大或是過度悲觀的情勢,並且可以藉此做出正確的經濟決策。 / This study presented a way to analyze the MD&A information of US listed companies from 2004 to 2014 via text mining techniques such as Loughran and McDonald Word Count and TFIDF. Then I cross compare a company’s MD&A information with its financial information using K-Means and establish an index to capture the consistency between the two types of information. Finally, I develop empirical model with explanatory variables such as volatility of earnings, company scale, company’s age, etc. for the consistency index.
According to the empirical results, company scale, company operating risks, analyst coverage, and company’s age are significantly related to the MD&A consistency. Three robustness checks demonstrate the similar results. The results suggest investors an additional way of using MD&A other than merely reading it. Investors should consider whether the MD&A is overstated or understated while using it in their investment decisions.
|
9 |
Inner Ensembles: Using Ensemble Methods in Learning StepAbbasian, Houman 16 May 2014 (has links)
A pivotal moment in machine learning research was the creation of an important new
research area, known as Ensemble Learning. In this work, we argue that ensembles are
a very general concept, and though they have been widely used, they can be applied in
more situations than they have been to date. Rather than using them only to combine
the output of an algorithm, we can apply them to decisions made inside the algorithm
itself, during the learning step. We call this approach Inner Ensembles. The motivation
to develop Inner Ensembles was the opportunity to produce models with the similar
advantages as regular ensembles, accuracy and stability for example, plus additional
advantages such as comprehensibility, simplicity, rapid classification and small memory
footprint. The main contribution of this work is to demonstrate how broadly this idea
can be applied, and highlight its potential impact on all types of algorithms. To support
our claim, we first provide a general guideline for applying Inner Ensembles to different algorithms. Then, using this framework, we apply them to two categories of learning
methods: supervised and un-supervised. For the former we chose Bayesian network, and
for the latter K-Means clustering. Our results show that 1) the overall performance of
Inner Ensembles is significantly better than the original methods, and 2) Inner Ensembles
provide similar performance improvements as regular ensembles.
|
10 |
[en] IMAGE SEGMENTATION BASED ON SUPERPIXEL GRAPHS / [pt] SEGMENTAÇÃO DE IMAGENS BASEADA EM GRAFOS DE SUPERPIXELCAROLINE ROSA REDLICH 01 August 2018 (has links)
[pt] A segmentação de imagens com objetivo de determinar a forma de objetos é ainda um problema difícil. A separação de regiões que correspondem a objetos contidos na imagem geralmente leva em consideração propriedades de similaridade, proximidade e descontinuidade. A imagem a ser segmentada pode ser de diversas naturezas, como fotografias, imagens médicas e sísmicas. Podemos encontrar na literatura muitos métodos de segmentação propostos como possíveis soluções para diferentes problemas. Recentemente a técnica de superpixel tem sido utilizada como um passo inicial que reduz o tamanho da entrada do problema. Este trabalho propõe uma metodologia de segmentação de imagens fotográficas e de ultrassom que se baseia em variantes de superpixels. A metodologia proposta se adapta a natureza da imagem e a complexidade do problema utilizando diferentes medidas de similaridade e distância. O trabalho apresenta também resultados que buscam esclarecer o procedimento proposto e a escolha de seus parâmetros. / [en] Image segmentation for object modeling is a complex task that is
still not well solved. The separation of the regions corresponding to each object in an image is based on proximity, similarity, and discontinuity of its boundaries. The image to be segmented can be of various natures, including photographs, medical and seismic images. We can find in literature many proposed segmentation methods used as solutions to different problems. Recently the superpixel technique has been used as an initial step that reduces the size of the problem input. This work proposes a methodology of
segmentation of photographs and ultrasound images based on variants of superpixels. The proposed methodology adapts to the image s nature and to the problem s complexity using different measures of similarity and distance. This work also presents results that seek to clarify the proposed procedure
and the choice of its parameters.
|
Page generated in 0.0386 seconds