ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] Semantic Enterprise Architecture

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: Kingsley Idehen <kidehen@xxxxxxxxxxxxxx>
Date: Fri, 03 Sep 2010 20:07:05 -0400
Message-id: <4C818DA9.9080103@xxxxxxxxxxxxxx>
  On 9/3/10 7:42 PM, John F. Sowa wrote:
> Kingsley,
>
> That is a statement of W3C strategy:
>
>> Linked Data is an application of Data Access by Reference pattern +
>> Structured Entity Descriptions applied to an aspect of the Semantic Web
>> project (basically Data Web or Web of Data foundation) . Fundamentally,
>> it provides a "Webby" dimension to the time-tested EAV model via
>> de-reference HTTP URI based Names. Thus, its primarily about the "Web"
>> aspect of the "Semantic Web" misnomer.
> But nobody has a clue about what methods of processing LOD will prove
> to be the most useful and successful over the next 5 to 10 years.    (01)

Thing is linked open data (LOD) isn't the same thing as linked data.    (02)

Let's assume you mean publicly available open data published using the 
principles in TimBL's famous meme, in this case, handling this data at 
Web Scale is the major challenge at hand. By this I mean the ability to 
do the following:    (03)

1. Faceted Browsing (using HTML pages for instance) over masses of data 
(DBpedia is small re. scale I have in mind, but that challenges most)
2. Precision Find using SPARQL where patterns include "?p" (any 
predicate) thereby generating extremely wide columns in RDBMS engines 
(typically performing self joins) .
>> MySQL doesn't cut it at all. Neither does Oracle or any other
>> traditional RDBMS. You need a hybrid DBMS e.g OpenLink Virtuoso.
> First of all, Oracle *is* a very efficient hybrid system.  They
> accept both SQL and SPARQL as query languages, and they store
> the data in tables or networks, as appropriate.    (04)

Yes, and they can't deal with the two fundamental problems I outline 
above. This is something we addressed from the get go re. Virtuoso, and 
isn't even part of what you will be seeing in the imminent paper I 
mentioned in my earlier post.    (05)

To deal with #1 and #2 we had to do the following:    (06)

1. Implement a Breakup mechanism for wide columns
2. Implement partial results for aggregate queries that is worked within 
the context of horizontal data partitioning (MySQL and Oracle can do the 
horizontal partitioning but they don't have partial results as part of 
the implementation as far as I know).    (07)

> Furthermore, I want to emphasize that SQL is an *upward* compatible
> extension of SPARQL:  every SPARQL query can be converted to an
> equivalent SQL query, but most SQL queries cannot be converted
> to SPARQL queries.
Yes, we have SPASQL i.e. SPARQL within SQL and even applied to Procedure 
Views of Table-Valued Functions (SQL Server parlance) such that you can 
use SPARQL patterns in SQL joins.    (08)

We also implemented SPARQL-BI (extensions for bringing SPARQL on par 
with SQL) so that you can actually run the TPC-H benchmarks atop 
Virtuoso with the data in RDF. It's this last item that is the central 
thrust of the Column Store innovations covered in the imminent white paper.    (09)


> Oracle (and many other hybrid systems) support
> *both* kinds of queries against either or both of the data structures.
> And they use highly optimized methods for mapping each type of query
> to each way of organizing the data.    (010)

Yes, we know that, we compete against these folks at the DBMS engine 
level. Of course we also compliment them at the virtual/federated 
database level. These optimizations are best tasked when you attempt to 
use SPARQL against large RDF data sets stored in these databases. As for 
SPARQL-BI, they offer nothing (i.e., can venture into TPC-H land against 
RDF stored in these engines).    (011)

> I am not going to get into debates about which vendor has a
> better or more efficient solution than the others.
>    (012)

I am only getting vendor specific because we are a hybrid play in this 
space. We also know that its inaccurate to assume all RDBMS based 
hybrids are capable of hosting data spaces like DBpedia (under a Billion 
Triples), LOD Cloud (17 Billion+ Triples) , or Protein Database (13 
Billion Triples) exposed via public SPARQL endpoints allowing any human 
or agent to pound (which always lays foundation for deliberate or 
inadvertent denial of service via Cartesian products and the like).    (013)

> Arguments about whether one kind of data structure is better
> than another are implementational details that may be critical
> to efficiency, but they should never be confused with semantics.    (014)

I agree.    (015)

> As for MySQL, I used that as an example of a tool that has a lot
> of potential for many LOD applications.    (016)

That's a typical LAMP crowd gut reaction, or should I say "wishful 
thinking". MySQL doesn't cut it, really.    (017)

>    It should not be ignored
> because it may be counterstrategic.  (When I was at IBM, by the
> way, I submitted two definitions to the Dictionary of BM Jargon):
>
>    1. counterstrategic -- what is embarrassingly superior to what
>       is strategic.
>
>    2. strategic -- supported by managers who have reached their
>       level of incompetence.
>
> Unfortunately, one of the strategic managers requested that
> they be deleted.
>
> Fundamental principle:  Always question strategy.  (But if you
> want to get a promotion, you might also need some diplomacy.)    (018)

Yes, of course, so this is why we have the ability to put Virtuoso in 
front of MySQL, PostgreSQL, Oracle and other typical RDBMS or 
Object-Relational engines via the virtual database option. Using this 
option we can use triggers to connect transient Views with materialized 
RDF views in the RDF store aspect of Virtuoso. Thus, to the user (human 
or agent) you have high change sensitivity plus all of the virtues that 
Virtuoso brings to the table.    (019)

Links:    (020)

1. http://www.openlinksw.com/weblog/oerling/ -- Virtuoso Program 
Manger's Blog (technically focused and many of the things are referenced 
are covered in various posts).    (021)



Kingsley
> John
>
> -------- Original Message --------
> Subject: Re: [ontolog-forum] Semantic Enterprise Architecture
> Date: Thu, 02 Sep 2010 08:55:25 -0400
> From: John F. Sowa<sowa@xxxxxxxxxxx>
> To: ontolog-forum@xxxxxxxxxxxxxxxx
>
> Rick,
>
>   >  Note that step three prescribes the use of standards, then
>   >  identifies two standards (RDF*, SPARQL).
>
> I agree that those are de facto standards for many applications,
> but you have to ask a question about how Linked Open Data will be
> used in the future.  Will RDF and SPARQL survive in their present
> forms?  What is the growth path for the future?  What technology
> has proved to be successful in the past?  What made it successful?
>
> Since the 1960s, it has been obvious that you cannot process large
> volumes of data without indexes that provide logarithmic-time access
> to the data.  SPARQL is designed for polynomial-time searches that
> do not scale to even small-scale enterprises.  Its "sweet spot"
> is for short searches that can be performed in a browser.
>
> Note what Franz, Oracle, and other vendors do:  they develop
> tools such as AllegroGraph or extensions to Oracle that suck the
> triples out of the web pages and index them.  Then they translate
> SPARQL to an optimized internal form for high-speed processing
> over indexed data.
>
> Another kind of data that is not yet in LOD form are tables from RDBs.
> Some people have proposed that those highly optimized search and
> retrieval engines be hamstrung by extracting the data from tables
> and mapping them to triples so that they can be queried by SPARQL.
> Anybody who proposed that is a prime candidate for the loony bin.
>
> Oracle has a better idea.  They have high-speed RDBs and high-speed
> triple stores.  They let anyone access data from either source by
> any query form.  They can execute SQL queries against triple stores
> or SPARQL queries against RDBs.  In either case, they optimize the
> queries for the data structures.  The users never need to know
> which kind of data is being accessed or how the data is organized.
>
> That is an idea that Franz, Oracle, and Google understand very well.
> Google became the biggest web company on earth because they developed
> better indexing, searching, and retrieval methods -- and they do *not*
> use RDF or OWL for their processing.  (They accept it when they find
> it in web pages, but that's also true of every notation under the sun.)
>
> Unfortunately, the Google indexes and metadata are not open.
> For LOD, the O means that some open method is necessary to
> integrate high-speed servers with open, standard message formats
> that can be accessed by both browsers and servers.
>
> As an example of where LOD is heading, I'd like to mention a couple
> of open-source projects, which are harbingers of things to come.
> Following is an excerpt from a note I sent to another forum.
> Some people complained that those systems are *not* typical LOD.
> I agree that they are not what LOD advocates are talking about
> today, but they are early versions of the many kinds of systems
> that will be designed to process LOD.
>
> The fact that those systems do *not* use SemWeb software is a
> wake-up call that the current SemWeb software and the current
> strategy for LOD are out of touch with future requirements.
>
> John
> ____________________________________________________________________
>
> The first is The Wikipedia Miner, which illustrates some interesting
> and powerful methods for processing Linked Open Data, but it doesn't
> use any of the tools designed for the Semantic Web:
>
>      http://wikipedia-miner.sourceforge.net/
>
>   >  Wikipedia Miner is a toolkit for navigating and making use of the
>   >  structure and content of Wikipedia. It aims to make it easy for you
>   >  to integrate Wikipedia's knowledge into your own applications, by:
>   >
>   >  * providing simplified, object-oriented access to Wikipedia's
>   >    structure and content.
>   >  * measuring how terms and concepts in Wikipedia are connected
>   >    to each other.
>   >  * detecting and disambiguating Wikipedia topics when they are
>   >    mentioned in documents.
>
> The Wikipedia was designed to be navigated by a reader who uses
> an ordinary browser to follow a few links to related pages.
> For that kind of application, links buried in the text are useful.
> But for more complex applications, very few readers would want to
> wait for the browser on their laptop to wade through SPARQL queries
> that navigate through many megabytes or gigabytes of markup.
>
> For more detail about how Wikipedia Miner works, see
>
> 
>http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf
>
> The author notes that the semantic features and URLs in Wikipedia
> "are buried under 20 GB of cryptic markup."  Just "the link-graph
> summary alone is still almost 1 GB. Instead the toolkit communicates
> with a MySQL database, so that the data can be indexed persistently
> and accessed immediately, without waiting for anything to load."
>
> Wikipedia Miner uses Perl (a language developed before the WWW)
> to gather the URLs and markup and put them in the more efficiently
> indexed tables of a relational DB.  But there are other systems
> that do even more sophisticated language processing. For example,
>
>      http://www.ukp.tu-darmstadt.de/software/jwpl
>      JWPL (Java Wikipedia Library)
>
> This is "a free, Java-based application programming interface that
> allows access to all information contained in Wikipedia."  Following
> is an article that provides more detail:
>
> 
>http://elara.tk.informatik.tu-darmstadt.de/Publications/2008/lrec08_camera_ready.pdf
>
> Note that JWPL also uses MySQL to index and store all the semantic
> information and URLs extracted from Wikipedia.
>
> In summary, these systems suggest future directions for LOD:
>
>    1. Tools like RDF and SPARQL, which can be processed by a browser,
>       are useful for lightweight, local navigation in a few web pages.
>
>    2. But for complex processing, it is essential to extract the
>       information from the source documents and organize it in
>       a high-speed indexed database (which could be an RDB or
>       something like AllegroGraph).
>
>    3. Future developments in LOD will follow the examples of JWPL
>       and Wikipedia Miner:  extract the markup and links from the
>       web pages and index them in databases.
>
>    4. There is no reason why semantic annotations must be stored
>       with the documents.  In fact, storing annotations in separate
>       files or databases is more flexible, because it can support
>       multiple ways of viewing and interpreting the same document
>       for different purposes.
>
> As an example, note the many ways of interpreting the Bible
> or the US Constitution.  For important documents, the semantic
> annotations can grow much faster and become orders of magnitude
> more voluminous than the source texts.  For many documents, the
> different interpretations may be proprietary or conflicting.
> Just imagine the different interpretations of the Bible by
> different religions or the Constitution by different judges
> and political parties.
>
> Summary:  For complex queries and high-performance processing,
> indexed databases are essential.  At present, conventional SQL
> databases run circles around SPARQL.  For the future, many novel
> kinds of databases could be designed.  But indexing is required,
> and annotations stored inside the documents should be optional.
>
> _________________________________________________________________
> Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/
> Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/
> Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
> Shared Files: http://ontolog.cim3.net/file/
> Community Wiki: http://ontolog.cim3.net/wiki/
> To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
> To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx
>
>    (022)


--     (023)

Regards,    (024)

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen    (025)






_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx    (026)

<Prev in Thread] Current Thread [Next in Thread>