ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] Semantic Enterprise Architecture

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: "John F. Sowa" <sowa@xxxxxxxxxxx>
Date: Fri, 03 Sep 2010 19:42:38 -0400
Message-id: <4C8187EE.8010309@xxxxxxxxxxx>
Kingsley,    (01)

That is a statement of W3C strategy:    (02)

> Linked Data is an application of Data Access by Reference pattern +
> Structured Entity Descriptions applied to an aspect of the Semantic Web
> project (basically Data Web or Web of Data foundation) . Fundamentally,
> it provides a "Webby" dimension to the time-tested EAV model via
> de-reference HTTP URI based Names. Thus, its primarily about the "Web"
> aspect of the "Semantic Web" misnomer.    (03)

But nobody has a clue about what methods of processing LOD will prove
to be the most useful and successful over the next 5 to 10 years.    (04)

> MySQL doesn't cut it at all. Neither does Oracle or any other
> traditional RDBMS. You need a hybrid DBMS e.g OpenLink Virtuoso.    (05)

First of all, Oracle *is* a very efficient hybrid system.  They
accept both SQL and SPARQL as query languages, and they store
the data in tables or networks, as appropriate.    (06)

Furthermore, I want to emphasize that SQL is an *upward* compatible
extension of SPARQL:  every SPARQL query can be converted to an
equivalent SQL query, but most SQL queries cannot be converted
to SPARQL queries.  Oracle (and many other hybrid systems) support
*both* kinds of queries against either or both of the data structures.
And they use highly optimized methods for mapping each type of query
to each way of organizing the data.    (07)

I am not going to get into debates about which vendor has a
better or more efficient solution than the others.    (08)

Arguments about whether one kind of data structure is better
than another are implementational details that may be critical
to efficiency, but they should never be confused with semantics.    (09)

As for MySQL, I used that as an example of a tool that has a lot
of potential for many LOD applications.  It should not be ignored
because it may be counterstrategic.  (When I was at IBM, by the
way, I submitted two definitions to the Dictionary of BM Jargon):    (010)

  1. counterstrategic -- what is embarrassingly superior to what
     is strategic.    (011)

  2. strategic -- supported by managers who have reached their
     level of incompetence.    (012)

Unfortunately, one of the strategic managers requested that
they be deleted.    (013)

Fundamental principle:  Always question strategy.  (But if you
want to get a promotion, you might also need some diplomacy.)    (014)

John    (015)

-------- Original Message --------
Subject: Re: [ontolog-forum] Semantic Enterprise Architecture
Date: Thu, 02 Sep 2010 08:55:25 -0400
From: John F. Sowa <sowa@xxxxxxxxxxx>
To: ontolog-forum@xxxxxxxxxxxxxxxx    (016)

Rick,    (017)

 > Note that step three prescribes the use of standards, then
 > identifies two standards (RDF*, SPARQL).    (018)

I agree that those are de facto standards for many applications,
but you have to ask a question about how Linked Open Data will be
used in the future.  Will RDF and SPARQL survive in their present
forms?  What is the growth path for the future?  What technology
has proved to be successful in the past?  What made it successful?    (019)

Since the 1960s, it has been obvious that you cannot process large
volumes of data without indexes that provide logarithmic-time access
to the data.  SPARQL is designed for polynomial-time searches that
do not scale to even small-scale enterprises.  Its "sweet spot"
is for short searches that can be performed in a browser.    (020)

Note what Franz, Oracle, and other vendors do:  they develop
tools such as AllegroGraph or extensions to Oracle that suck the
triples out of the web pages and index them.  Then they translate
SPARQL to an optimized internal form for high-speed processing
over indexed data.    (021)

Another kind of data that is not yet in LOD form are tables from RDBs.
Some people have proposed that those highly optimized search and
retrieval engines be hamstrung by extracting the data from tables
and mapping them to triples so that they can be queried by SPARQL.
Anybody who proposed that is a prime candidate for the loony bin.    (022)

Oracle has a better idea.  They have high-speed RDBs and high-speed
triple stores.  They let anyone access data from either source by
any query form.  They can execute SQL queries against triple stores
or SPARQL queries against RDBs.  In either case, they optimize the
queries for the data structures.  The users never need to know
which kind of data is being accessed or how the data is organized.    (023)

That is an idea that Franz, Oracle, and Google understand very well.
Google became the biggest web company on earth because they developed
better indexing, searching, and retrieval methods -- and they do *not*
use RDF or OWL for their processing.  (They accept it when they find
it in web pages, but that's also true of every notation under the sun.)    (024)

Unfortunately, the Google indexes and metadata are not open.
For LOD, the O means that some open method is necessary to
integrate high-speed servers with open, standard message formats
that can be accessed by both browsers and servers.    (025)

As an example of where LOD is heading, I'd like to mention a couple
of open-source projects, which are harbingers of things to come.
Following is an excerpt from a note I sent to another forum.
Some people complained that those systems are *not* typical LOD.
I agree that they are not what LOD advocates are talking about
today, but they are early versions of the many kinds of systems
that will be designed to process LOD.    (026)

The fact that those systems do *not* use SemWeb software is a
wake-up call that the current SemWeb software and the current
strategy for LOD are out of touch with future requirements.    (027)

John
____________________________________________________________________    (028)

The first is The Wikipedia Miner, which illustrates some interesting
and powerful methods for processing Linked Open Data, but it doesn't
use any of the tools designed for the Semantic Web:    (029)

    http://wikipedia-miner.sourceforge.net/    (030)

 > Wikipedia Miner is a toolkit for navigating and making use of the
 > structure and content of Wikipedia. It aims to make it easy for you
 > to integrate Wikipedia's knowledge into your own applications, by:
 >
 > * providing simplified, object-oriented access to Wikipedia's
 >   structure and content.
 > * measuring how terms and concepts in Wikipedia are connected
 >   to each other.
 > * detecting and disambiguating Wikipedia topics when they are
 >   mentioned in documents.    (031)

The Wikipedia was designed to be navigated by a reader who uses
an ordinary browser to follow a few links to related pages.
For that kind of application, links buried in the text are useful.
But for more complex applications, very few readers would want to
wait for the browser on their laptop to wade through SPARQL queries
that navigate through many megabytes or gigabytes of markup.    (032)

For more detail about how Wikipedia Miner works, see    (033)

http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf    (034)

The author notes that the semantic features and URLs in Wikipedia
"are buried under 20 GB of cryptic markup."  Just "the link-graph
summary alone is still almost 1 GB. Instead the toolkit communicates
with a MySQL database, so that the data can be indexed persistently
and accessed immediately, without waiting for anything to load."    (035)

Wikipedia Miner uses Perl (a language developed before the WWW)
to gather the URLs and markup and put them in the more efficiently
indexed tables of a relational DB.  But there are other systems
that do even more sophisticated language processing. For example,    (036)

    http://www.ukp.tu-darmstadt.de/software/jwpl
    JWPL (Java Wikipedia Library)    (037)

This is "a free, Java-based application programming interface that
allows access to all information contained in Wikipedia."  Following
is an article that provides more detail:    (038)

http://elara.tk.informatik.tu-darmstadt.de/Publications/2008/lrec08_camera_ready.pdf    (039)

Note that JWPL also uses MySQL to index and store all the semantic
information and URLs extracted from Wikipedia.    (040)

In summary, these systems suggest future directions for LOD:    (041)

  1. Tools like RDF and SPARQL, which can be processed by a browser,
     are useful for lightweight, local navigation in a few web pages.    (042)

  2. But for complex processing, it is essential to extract the
     information from the source documents and organize it in
     a high-speed indexed database (which could be an RDB or
     something like AllegroGraph).    (043)

  3. Future developments in LOD will follow the examples of JWPL
     and Wikipedia Miner:  extract the markup and links from the
     web pages and index them in databases.    (044)

  4. There is no reason why semantic annotations must be stored
     with the documents.  In fact, storing annotations in separate
     files or databases is more flexible, because it can support
     multiple ways of viewing and interpreting the same document
     for different purposes.    (045)

As an example, note the many ways of interpreting the Bible
or the US Constitution.  For important documents, the semantic
annotations can grow much faster and become orders of magnitude
more voluminous than the source texts.  For many documents, the
different interpretations may be proprietary or conflicting.
Just imagine the different interpretations of the Bible by
different religions or the Constitution by different judges
and political parties.    (046)

Summary:  For complex queries and high-performance processing,
indexed databases are essential.  At present, conventional SQL
databases run circles around SPARQL.  For the future, many novel
kinds of databases could be designed.  But indexing is required,
and annotations stored inside the documents should be optional.    (047)

_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx    (048)

<Prev in Thread] Current Thread [Next in Thread>