### Sampling Techniques for Massive Data

Google Tech TalksMarch 27, 2007ABSTRACTConsider a giant data matrix A of N rows and D columns. At Web scale, both N and D can be in the order of billions. In applications including duplicate (doc) detections, word associations, databases, nearest neighbors, kernels (e.g., for SVM), it is often desirable to store a very small fraction (sample) of the data to fit in physical memory for quickly computing summary statistics (e.g. L1 or L2 distances). Because the data are often highly sparse, conventional sampling methods (i.e., randomly selecting a few columns from the data matrix) would not work well. Two sampling methods, conditional random sampling (CRS) and stable random projections (SRP),...

Length:
49:50