On 5/21/2013 2:05 PM, David Eddy wrote:
John -
On May 21, 2013, at 10:20 AM, John Bottoms wrote:
With the most complex data sets
I've worked on, which is on the order to 150 million
points of dirty data,
So what does the enterprising Big Data Scientist do with so
much suspect data?
Clean it up? Smooth out the statistical anomalies? Cross
their fingers?
Deddy,
The cleanup for that project was based on an informed understanding
of the types of errors. Some came from the Scantron machine that
read the data sheets, and some came from incorrect input from the
users. In developing metrics the outliers offer little value in some
cases. Sometimes the data is graphed and the type of graph used is
important. Sometimes "rule-of-thumb" metrics are used, but you have
to know when they are validly usable. At times, an estimation
calculation is done first and then used in the statistical analysis.
The statisticians and psychometricians have an almost intuitive feel
for how to deal with dirty data. It is a part of BigData that has
not been addressed sufficiently yet.
-John Bottoms
FirstStar Systems
Concord, MA USA
__________________
David Eddy
|