[Top] [All Lists]

Re: [ontolog-forum] Unit testing and usability validation of schemas and

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: John Bottoms <john@xxxxxxxxxxxxxxxxxxxx>
Date: Tue, 21 May 2013 14:52:50 -0400
Message-id: <519BC282.2010109@xxxxxxxxxxxxxxxxxxxx>
On 5/21/2013 2:05 PM, David Eddy wrote:
John -

On May 21, 2013, at 10:20 AM, John Bottoms wrote:

With the most complex data sets I've worked on, which is on the order to 150 million points of dirty data,

So what does the enterprising Big Data Scientist do with so much suspect data?

Clean it up?  Smooth out the statistical anomalies?  Cross their fingers?

The cleanup for that project was based on an informed understanding of the types of errors. Some came from the Scantron machine that read the data sheets, and some came from incorrect input from the users. In developing metrics the outliers offer little value in some cases. Sometimes the data is graphed and the type of graph used is important. Sometimes "rule-of-thumb" metrics are used, but you have to know when they are validly usable. At times, an estimation calculation is done first and then used in the statistical analysis.

The statisticians and psychometricians have an almost intuitive feel for how to deal with dirty data. It is a part of BigData that has not been addressed sufficiently yet.

-John Bottoms
 FirstStar Systems
 Concord, MA USA
David Eddy

Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (01)

<Prev in Thread] Current Thread [Next in Thread>