ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] FW: FW: Looking to the Future of Data Science - NYTi

To: "[ontolog-forum] " <ontolog-forum@xxxxxxxxxxxxxxxx>
From: "Barkmeyer, Edward J" <edward.barkmeyer@xxxxxxxx>
Date: Thu, 4 Sep 2014 21:23:44 +0000
Message-id: <cdaa8938ee5f4f33aba7dcb98692afda@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
 
 
_____________________________________________
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx [mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Rich Cooper
Sent: Wednesday, September 03, 2014 7:35 PM
To: '[ontolog-forum] '
Subject: Re: [ontolog-forum] FW: FW: Looking to the Future of Data Science - NYTimes.com - 2014.08.27
 
 
Dear Ed,
 
You wrote:
My point about data discovery is that the data it operates on was originally captured and organized for useful purposes that just happen to be different from the purpose of the discovery exercise.  That data was not captured just because we thought that at some time in the future it might be valuable to have it.  (That is why I called the concern a ‘paradigm for the acquisition mindset.’) 
 
Yes, typically the captured data was purely for operational and accounting purposes in most cases, and the RDB layout was usually designed for the sole purpose of handling the throughput of the processing systems available.  That is what I mean by the “complexity of modern systems”. 
 
But a typical use case is that the business wants to understand customer purchases, purchase rates, purchases at other businesses (through sharing of data about customers [forget about privacy] based on Driver’s ID or SSN numbers as identifying parts.  Those items were not usually intended to be mined later, but they are later found useful in principal and therefore mined. 
 
Thus most really big data is based on more than one database source, though most of the data may be from one source. 
 
[EJB] Why is this “really big data”?  If the database is the one captured for other purposes, it does not get bigger because you changed the usage.  If you have SSNs in your database, it is not particularly difficult to extract them from whatever your customer record form is. 
 
[EJB] What you are talking about is probably examination of several years’ worth of transaction logs, rather than the database, which was designed only to carry ‘current’ operational information.  And yes, transaction logs are a poor organization for capturing historical information.  But the problem is recovering information from past practices, not from ‘the complexity of current technology’.  And for the transaction search problem, the first step is to apply a filter, and make a database of the things that get through it.  Then you can apply the available technology to a useful information structure.  Yes, it will take a while to run the filter, but it is a linear process and you do it once.  And yes, you probably have to hard-code the filter.  But the problem here is that the ‘big data’ in question is in a poor form for any purpose but the original – reconstructing the database after a failure.  Some partly hard-coded process will be needed to convert it to a form useful for any search technology.  And that process is well within the capabilities of existing equipment and software.
 
But Hans’ point was that there are all kinds of unsuspected, even unimagined correlations among data entities – not the entities in the data model, but those mentioned in the columns as data. 
 
[EJB] Absolutely!  In Excel, this is the “pivot table” idea, and it is in many ways the foundation of the LOD idea:  Where all does this name appear?  And more interestingly, how often does it appear in some relationship to this other name?  But that only becomes a big data problem when you couple it with federation of many data sources.  And now we ask:  Is the value of the answer worth the cost of the massive federation?  And there is another factor.  Back in the 1960s, the 12-hour accounting run was common, and the 48-hour resource allocation run was not unheard-of.  The fact that those runs now take seconds to minutes has caused us to think of a 48-hour run as “not in business time”, and yet, for many of these discovery processes, two weeks would be adequate for “business time” (if only the management had the patience for that, but they have a much shorter attention span).
 
I understand your concern though, it’s just I wanted to set the archive records straight about the mining that can be, and very often is now, applied to the larger business picture.  And with great profit by the way, according to reports from users. 
 
[EJB] Yes, the success stories.  I guess they didn’t have serious big data problems.
 
-Ed
 
-Rich
 
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx [mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Barkmeyer, Edward J
Sent: Wednesday, September 03, 2014 3:17 PM
To: [ontolog-forum]
Subject: Re: [ontolog-forum] FW: FW: Looking to the Future of Data Science - NYTimes.com - 2014.08.27
 
Rich,
 
Lest there be any further confusion, I was talking about XML as the data store form, not as the transmission form.  The purpose of XQuery is not to be a query language for messages in the transmission form.  And yes, I should have said “(XML, XQuery) databases”, or perhaps hyphenated the term, so that it would be clear that there were only two items in the list.
 
My point about data discovery is that the data it operates on was originally captured and organized for useful purposes that just happen to be different from the purpose of the discovery exercise.  That data was not captured just because we thought that at some time in the future it might be valuable to have it.  (That is why I called the concern a ‘paradigm for the acquisition mindset.’)  The information that is there was not “obscured due to the complexity of typical systems”; it was obscured by not being a focus of interest at the time the data was captured.
 
And OBTW, you won’t discover anything if you don’t inject the integrating ontology/schema for the new knowledge you want to extract, and in most such papers that I have seen, you also have to inject the schema mapping, one way or another.  The good ones allow interesting functions in the mapping.  If anything, the process for ‘discovering’ the information is technically more complex than the process of storing it was.
 
-Ed
 
 
 
 
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx [mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Rich Cooper
Sent: Tuesday, September 02, 2014 10:50 PM
To: '[ontolog-forum] '
Subject: [ontolog-forum] FW: FW: Looking to the Future of Data Science - NYTimes.com - 2014.08.27
 
Hans Polzer describes more cogently than I did why the data model (schema, what nomenclature have you for) does NOT represent all the information to be discovered.  His post is below,
 
-Rich
 
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx [mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Rich Cooper
Sent: Tuesday, September 02, 2014 1:03 PM
To: '[ontolog-forum] '
Subject: Re: [ontolog-forum] FW: Looking to the Future of Data Science - NYTimes.com - 2014.08.27
 
EJB:>  What is wanted is not a paradigm shift in processing technology – the last two paradigm shifts got us XML databases and XQuery and RDF triple stores, both of which are clumsy repositories that just make the Big Data problem more expensive. 
 You state three items, “both of which” are clumsy.  Actually, the first item, XML, has been a very useful method for communicating within N-tier systems.  It has great value there but is usually converted into the tables, columns and domains of RDBs where the info gets stored.  So XML is not a problem for most systems.  There are even free XML parsers which have been packaged as components for programmers to call so they don’t have to do the parsing themselves.  It has been very, very useful for multiple system interchanges of data. 
EJB:> What is wanted (as Michael Brunnbauer hinted) is a paradigm shift in data acquisition mindset.  I will paraphrase some other contribution to this exploder, which I have since lost:  “If you don’t know what you have when you get it, you will never know it later.”  
Wrong!!!!  The whole point of discovery systems is in recognizing new information that was in the database, but which is obscured from the obvious observers due to the complexity of typical systems today.  You don’t know what it is in advance; you can only discover it through analysis. 
 
The stuff that is already known to be in the database can just be queried.  But bringing out the full range of relationships, which are NOT KNOWN uniquely in the data model, can be found through discovery processes. 
 
See http://www.EnglishLogicKernel.com/ElkForPatents.html for an example of the kinds of things that can be discovered from relational databases containing both structured and unstructured columns, as in the USPTO database of patents. 
 
-Rich
 
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx [mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of Kingsley Idehen
Sent: Tuesday, September 02, 2014 10:34 AM
To: ontolog-forum@xxxxxxxxxxxxxxxx
Subject: Re: [ontolog-forum] FW: Looking to the Future of Data Science - NYTimes.com - 2014.08.27
 
On 9/2/14 11:25 AM, Barkmeyer, Edward J wrote:
I regret to say, I think this definition is about buzzword maintenance.  The idea is clearly:  Big Data is about inventing a new information processing technology that will work better for datasets that RDB technology just can’t handle – “a paradigm shift” in technology. 
 
What is wanted is not a paradigm shift in processing technology – the last two paradigm shifts got us XML databases and XQuery and RDF triple stores, both of which are clumsy repositories that just make the Big Data problem more expensive. 
 
What is wanted (as Michael Brunnbauer hinted) is a paradigm shift in data acquisition mindset.  I will paraphrase some other contribution to this exploder, which I have since lost:  “If you don’t know what you have when you get it, you will never know it later.”  
 
There is a big difference between large volumes of data that must be maintained in order to perform a particular set of business or governmental functions and responsibilities, and large volumes of data that are available and might enable some analytical process that is at best desirable.  Amazingly enough, we have muddled through the support of the former for 50 years with established technologies and state of the art computational resources, and newer technologies have become established as the quality of the implementations and the resources for supporting them became able to carry the increasing load.  We have been able to do this by working around the limitations to deliver satisfactory, if less than ideal, services somehow.  As John Sowa and others have said, this is a recurring problem; it is not a new problem.
 
The problem we have is with our appetite.  There is so much information food out there that we could surely find the taste treats for the most discriminating palates if we could just search it all fast enough.  That is all very exciting, but it is irrelevant to solving the problem of delivering to everyone his daily information bread.  The problem is in focusing on what we need to process, not what we would like to process.  The people who are concerned about data they need to process in order to deliver adequate services and products are experiencing the 2014 version of the 1960 problem.  The rest are just blowing Big Data horns.
 
The would-be ISO definition fails to say: 
Big Data: a data set(s) with characteristics that for *a required function* at a given point in time cannot be efficiently processed using current/existing/established/traditional technologies and techniques in order to *provide adequate support for that function*.
 
It is not about an arbitrary “particular problem domain” or being able to “extract [some perceived] value”.   That is an academic view, and why we have research institutions.
 
-Ed

Ed,

Great addition to this evolving conversation. Naturally, I've incorporated your comments into the "Big Data" description that I am maintaining:

[1] http://linkeddata.uriburner.com/describe/?url=""> -- without the effect of owl:sameAs relation reasoning and inference

[2] http://linkeddata.uriburner.com/describe/?url=""> -- with the effect of owl:sameAs relation semantics reasoning and inference

[3] https://plus.google.com/112399767740508618350/posts/79nHeum5DQR -- how I am using G+ post based nanotations to fit the pieces of this puzzle together, as I encounter new and interesting insights

[4] https://plus.google.com/112399767740508618350/posts/MRsyNtqgTXz -- ditto in regards to comments by John Sowa .


Related:

[1] http://kidehen.blogspot.com/2014/07/nanotation.html -- about Nanotation
[2] https://twitter.com/kidehen/status/506813897043881984 -- Tweet related to paradigm shift re. data acquisition (i.e., RDF sentence based Nanotations that fit into place where text exists) .
--
Regards,
 
Kingsley Idehen      
Founder & CEO
OpenLink Software    
Personal Weblog 1: http://kidehen.blogspot.com
Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this << File: ATT00001.txt >>
 

_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (01)

<Prev in Thread] Current Thread [Next in Thread>