Ability to Sample Uniformly 1000 Rows (or more if engine permits) in Data Refinery.

Many times, data comes ordered by some index or column. If this is the case, data exploration and visualization will be biased towards one part of the data. I propose to let the user pick what 1000 rows to display: first 1000, last 1000 or a uniform sample of 1000 rows.

Another proposal is to provide the user the ability to use more than a 1000 rows. Maybe the data set is too big and using all the rows is not possible, but maybe the data set has 10,000 rows and the user wants to see all the observations in the visualizations. The 1,000 rows seems a little bit low so maybe we can limit by number_rows * num_columns < M, where M is a not too big number where the tool can still provide fast results.

Please find attached an example of a map with and without uniform sample. Without sample, I onlt zip code on east coast and with sample I can see zips in all the country!

  • Jorge Castanon
  • Mar 1 2019
  Needs review
