Streaming Algorithms and data structures
Source: https://gist.github.com/debasishg/8172796
 General Background and Overview
a) Probabilistic Data Structures for Web Analytics and Data Mining : On Highly Scalable Blog (http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructureswebanalyticsdatamining/) : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation. b) Models and Issues in Data Stream Systems : (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.9846) c) Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani (http://www.vldb.org/conf/2002/S10P03.pdf) : One of the early papers on the subject. d) Methods for Finding Frequent Items in Data Streams by Graham Cormode & Marios Hadjieleftheriou (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep=rep1&type=pdf) e) The space complexity of approximating the frequency moments by Noga Alon, Yossi Matias, Mario Szegedy : one of the most influential papers introducing succinctness in computing frequency moments (http://www.tau.ac.il/~nogaa/PDFS/amsz4.pdf)
 Hyperloglog and MinHash : Implementation of a form of hyperloglog and adding capabilities of MinHash algorithm on to it which would enable to perform set intersections.”While it does require extra processing power to deal with collecting all the minima, it’s possible to get satisfactory performance out of the structure for a relatively low storage or memory footprint” (http://tech.adroll.com/blog/data/2013/07/10/hllminhash.html)

Streaming/Sketching Conference from AK Tech : Contains links to videos and slides from the speakers like Muthukrishnan who spoke about Count Min Sketch (http://blog.aggregateknowledge.com/2013/05/23/foundationcapitalandaggregateknowledgesponsorstreamingsketchingconference/)

 Qdigest
 a) Medians and Beyond: New Aggregation Techniques for Sensor Networks : The paper that introduced qdigest for range queries and quantile approximation (http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf) b) Blog post (http://papercruncher.com/2011/07/31/qdigest/) c) Blog post (http://www.prelert.com/blog/qdigestanalgorithmforcomputingapproximatequantilesonacollectionofintegers/) d) The Art of Approximating Distributions: Histograms and Quantiles at Scale  an alternative approach to qdigest (http://metamarkets.com/2013/histograms/#)

tdigest : A new data structure for accurate online accumulation of rankbased statistics such as quantiles and trimmed means. Ted Dunning’s variant of Qdigest that does some improvements (https://github.com/tdunning/tdigest)

 Implementations
 a) streamlib : A collection of Stream summarization and cardinality estimation algorithms like CM Sketch, Hyperloglog, Bloom Filters (https://github.com/addthis/streamlib) b) Algebird from Twitter : (https://github.com/twitter/algebird)

CountMin Sketch
a) An Improved Data Stream Summary: The CountMin Sketch and its Applications  Cormode & Muthukrishnan : The paper that introduced count min sketch (http://dimacs.rutgers.edu/~graham/pubs/papers/cmfull.pdf) b) collection of information on Count Min Sketch (https://sites.google.com/site/countminsketch/) c) Count Min Sketch by Cormode : Introductory paper (http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf) d) Streaming Algorithms and Sketches  Count Min Sketch on AK Tech Blog (http://blog.aggregateknowledge.com/2011/09/13/streamingalgorithmsandsketches/) e) Muthukrishnan talking on Count Min Sketch at AK Tech conference (http://www.youtube.com/watch?v=OOZC4KCErN0) f) Sketch Techniques for Approximate Query Processing by Cormode (http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf) g) Sketching data structures  a good overview of Bloom Filters and Count Min Sketch : (http://lkozma.net/blog/sketchingdatastructures/)
 Surveys
a) References for Data Stream Algorithms by Graham Cormode : an exhaustive set of references with explanations (http://dimacs.rutgers.edu/~graham/pubs/papers/bristol.pdf) b) Data Streams  Algorithms and Applications by s. Muthukrishnan (http://www.amazon.com/DataStreamsApplicationsFoundationsTheoretical/dp/193301914X) : This is an excellent monograph with surveys of all algorithms related to data streams. Also a free copy of the book is available from Muthu’s web site at http://www.cs.rutgers.edu/~muthu/ c) Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Graham Cormode1, Minos Garofalakis, Peter J. Haas and Chris Jermaine (http://www.nowpublishers.com/articles/foundationsandtrendsindatabases/DBS004). Describes basic principles and recent developments in approximate query processing. It focuses on four key synopses: random samples, histograms, wavelets, and sketches. It considers issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. It also discusses the tradeoffs between the different synopsis types.
 Distributed Streams Algorithms for Sliding Windows by Phillip B. Gibbons and Srikanta Tirthapura (http://home.engineering.iastate.edu/~snt/pubs/tocs04.pdf)
Related Links
 Piotr Indyk
 Madalgo (centre for massive algorithms).