Return to search

New results on estimating sortedness

Estimating the sortedness of a sequence has found applications in, e.g., sorting algorithms, database management and webpage ranking. As the data volume in many of these applications is massive, recent research has been focusing on estimating sortedness in the data stream model. In this thesis, we extend the study of this problem to a number of directions.



One common measurement of sortedness is the edit distance to monotonicity. Given a stream of items drawn from a totally ordered set, its edit distance to monotonicity is the minimum number of items to remove so that the remaining items are non-decreasing. The space complexity of estimating the edit distance to monotonicity of a data stream is becoming well-understood over the past few years. Motivated by applications on network quality monitoring, we extend the study to estimating the edit distance to monotonicity of a sliding window covering the w most recent items in the stream for any w _ 1. We give a deterministic algorithm which can return an estimate within a factor of (4 + _) using O( 1

_2 log2(_w)) space.



We further extend the study in two directions. First, we consider a stream where each item is associated with a value from a partially ordered set. We give a randomized (4+_)-approximate algorithm using O( 1_2 log _2w log w) space. Second, we consider an out-of-order stream where each item is associated with a creation time and a numerical value, and items may be out of order with respect to their creation times. The goal is to estimate the edit distance to monotonicity with respect to the numerical value of items arranged in the order of creation times. We show that any randomized constant-approximate algorithm requires linear space.



Finally, we revisit the classical problem of estimating the length of the longest increasing subsequence (LIS) of a data stream. Previous work shows that any deterministic algorithm requires ?(pN) space through a communication problem Hidden-IS, where N is the number of items in the stream. But the randomized space complexity of LIS is open [2]. [23] has given an efficient randomized protocol for Hidden-IS, showing that Hidden-IS may be significantly easier than LIS. We give an even simpler and more efficient randomized protocol for the Hidden-IS problem, indicating that it is unlikely that this communication problem can lead to a polynomial randomized space lower bound for the LIS problem. On the positive side, we propose a new communication problem which we conjecture to be hard enough to lead to a super polylogarithmic randomized space lower bound for the LIS problem. / published_or_final_version / Computer Science / Master / Master of Philosophy

  1. 10.5353/th_b4716850
  2. b4716850
Identiferoai:union.ndltd.org:HKU/oai:hub.hku.hk:10722/174329
Date January 2011
CreatorsPan, Jiangwei., 潘江伟.
ContributorsChan, HL, Lam, TW
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Source SetsHong Kong University Theses
LanguageEnglish
Detected LanguageEnglish
TypePG_Thesis
Sourcehttp://hub.hku.hk/bib/B4716850X
RightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works., Creative Commons: Attribution 3.0 Hong Kong License
RelationHKU Theses Online (HKUTO)

Page generated in 0.0018 seconds