ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] Semantic Enterprise Architecture

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: "John F. Sowa" <sowa@xxxxxxxxxxx>
Date: Thu, 02 Sep 2010 08:55:25 -0400
Message-id: <4C7F9EBD.4000403@xxxxxxxxxxx>
Rick,    (01)

 > Note that step three prescribes the use of standards, then
> identifies two standards (RDF*, SPARQL).    (02)

I agree that those are de facto standards for many applications,
but you have to ask a question about how Linked Open Data will be
used in the future.  Will RDF and SPARQL survive in their present
forms?  What is the growth path for the future?  What technology
has proved to be successful in the past?  What made it successful?    (03)

Since the 1960s, it has been obvious that you cannot process large
volumes of data without indexes that provide logarithmic-time access
to the data.  SPARQL is designed for polynomial-time searches that
do not scale to even small-scale enterprises.  Its "sweet spot"
is for short searches that can be performed in a browser.    (04)

Note what Franz, Oracle, and other vendors do:  they develop
tools such as AllegroGraph or extensions to Oracle that suck the
triples out of the web pages and index them.  Then they translate
SPARQL to an optimized internal form for high-speed processing
over indexed data.    (05)

Another kind of data that is not yet in LOD form are tables from RDBs.
Some people have proposed that those highly optimized search and
retrieval engines be hamstrung by extracting the data from tables
and mapping them to triples so that they can be queried by SPARQL.
Anybody who proposed that is a prime candidate for the loony bin.    (06)

Oracle has a better idea.  They have high-speed RDBs and high-speed
triple stores.  They let anyone access data from either source by
any query form.  They can execute SQL queries against triple stores
or SPARQL queries against RDBs.  In either case, they optimize the
queries for the data structures.  The users never need to know
which kind of data is being accessed or how the data is organized.    (07)

That is an idea that Franz, Oracle, and Google understand very well.
Google became the biggest web company on earth because they developed
better indexing, searching, and retrieval methods -- and they do *not*
use RDF or OWL for their processing.  (They accept it when they find
it in web pages, but that's also true of every notation under the sun.)    (08)

Unfortunately, the Google indexes and metadata are not open.
For LOD, the O means that some open method is necessary to
integrate high-speed servers with open, standard message formats
that can be accessed by both browsers and servers.    (09)

As an example of where LOD is heading, I'd like to mention a couple
of open-source projects, which are harbingers of things to come.
Following is an excerpt from a note I sent to another forum.
Some people complained that those systems are *not* typical LOD.
I agree that they are not what LOD advocates are talking about
today, but they are early versions of the many kinds of systems
that will be designed to process LOD.    (010)

The fact that those systems do *not* use SemWeb software is a
wake-up call that the current SemWeb software and the current
strategy for LOD are out of touch with future requirements.    (011)

John
____________________________________________________________________    (012)

The first is The Wikipedia Miner, which illustrates some interesting
and powerful methods for processing Linked Open Data, but it doesn't
use any of the tools designed for the Semantic Web:    (013)

    http://wikipedia-miner.sourceforge.net/    (014)

> Wikipedia Miner is a toolkit for navigating and making use of the
> structure and content of Wikipedia. It aims to make it easy for you
> to integrate Wikipedia's knowledge into your own applications, by:
>
> * providing simplified, object-oriented access to Wikipedia's
>   structure and content.
> * measuring how terms and concepts in Wikipedia are connected
>   to each other.
> * detecting and disambiguating Wikipedia topics when they are
>   mentioned in documents.    (015)

The Wikipedia was designed to be navigated by a reader who uses
an ordinary browser to follow a few links to related pages.
For that kind of application, links buried in the text are useful.
But for more complex applications, very few readers would want to
wait for the browser on their laptop to wade through SPARQL queries
that navigate through many megabytes or gigabytes of markup.    (016)

For more detail about how Wikipedia Miner works, see    (017)

http://www.cs.waikato.ac.nz/~dnk2/publications/AnOpenSourceToolkitForMiningWikipedia.pdf    (018)

The author notes that the semantic features and URLs in Wikipedia
"are buried under 20 GB of cryptic markup."  Just "the link-graph
summary alone is still almost 1 GB. Instead the toolkit communicates
with a MySQL database, so that the data can be indexed persistently
and accessed immediately, without waiting for anything to load."    (019)

Wikipedia Miner uses Perl (a language developed before the WWW)
to gather the URLs and markup and put them in the more efficiently
indexed tables of a relational DB.  But there are other systems
that do even more sophisticated language processing. For example,    (020)

    http://www.ukp.tu-darmstadt.de/software/jwpl
    JWPL (Java Wikipedia Library)    (021)

This is "a free, Java-based application programming interface that
allows access to all information contained in Wikipedia."  Following
is an article that provides more detail:    (022)

http://elara.tk.informatik.tu-darmstadt.de/Publications/2008/lrec08_camera_ready.pdf    (023)

Note that JWPL also uses MySQL to index and store all the semantic
information and URLs extracted from Wikipedia.    (024)

In summary, these systems suggest future directions for LOD:    (025)

  1. Tools like RDF and SPARQL, which can be processed by a browser,
     are useful for lightweight, local navigation in a few web pages.    (026)

  2. But for complex processing, it is essential to extract the
     information from the source documents and organize it in
     a high-speed indexed database (which could be an RDB or
     something like AllegroGraph).    (027)

  3. Future developments in LOD will follow the examples of JWPL
     and Wikipedia Miner:  extract the markup and links from the
     web pages and index them in databases.    (028)

  4. There is no reason why semantic annotations must be stored
     with the documents.  In fact, storing annotations in separate
     files or databases is more flexible, because it can support
     multiple ways of viewing and interpreting the same document
     for different purposes.    (029)

As an example, note the many ways of interpreting the Bible
or the US Constitution.  For important documents, the semantic
annotations can grow much faster and become orders of magnitude
more voluminous than the source texts.  For many documents, the
different interpretations may be proprietary or conflicting.
Just imagine the different interpretations of the Bible by
different religions or the Constitution by different judges
and political parties.    (030)

Summary:  For complex queries and high-performance processing,
indexed databases are essential.  At present, conventional SQL
databases run circles around SPARQL.  For the future, many novel
kinds of databases could be designed.  But indexing is required,
and annotations stored inside the documents should be optional.    (031)

_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx    (032)

<Prev in Thread] Current Thread [Next in Thread>