1. Introduction > Motivations and applications: what does Big mean? > A few examples > General tasks > Tools References - [1], Chapter 1, in particular 1.3 - Very popular article: http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1 Suggested readings - [2] - IEEE Spectrum's interview with Michael Jordan: http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts 2. Recap on MapReduce and introduction to Spark > Hadoop > Spark References - [1], Chapter 2 - [3] - Spark documentation: http://spark.apache.org/documentation.html 3. Locality sensitive hashing A case study: min-hashing and estimation of Jaccard similarity of shingled documents. Min-wise permutations and Jaccard similarity. Locality sensitive hashing for Jaccard similarity. Signatures and the banding technique. Detecting candidate pairs. References - [1], Sections 3.1, 3.2, 3.3 and 3.4, 3.5 (no 3.5.5), 3.6, 3.7 until 3.7.2 (included) - [4] until section 3 (included) - Suggested reading: [1], section 3.8 LIST OF REFERENCES [1] Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014. [2] Boyd, Danah, and Kate Crawford. "Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon." Information, communication & society 15.5 (2012): 662-679. URL: http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878 [3] Zaharia, Matei, et al. "Spark: Cluster Computing with Working Sets." HotCloud 10.10-10 (2010): 95. URL: http://static.usenix.org/legacy/events/hotcloud10/tech/full_papers/Zaharia.pdf [4] Andoni, Alexandr, and Piotr Indyk. "Near-optimal hashing algorithms for near neighbor problem in high dimension." Communications of the ACM 51.1 (2008).