Re: [ontolog-forum] IBM Watson on Jeopardy

From: "John F. Sowa" <sowa@xxxxxxxxxxx>
Date: Thu, 10 Feb 2011 11:27:49 -0500
Peter,    (01)

Thanks for the reminder:    (02)

> Dave Ferrucci gave a talk on UIMA (the Unstructured Information
> Management Architecture) back in May-2006, entitled: "Putting the
> Semantics in the Semantic Web: An overview of UIMA and its role in
> Accelerating the Semantic Revolution"    (03)

I recommend that readers compare Ferrucci's talk about UIMA in
2006 with his talk about the Watson system and Jeopardy in 2011.
In less than 5 years, they built Watson on the UIMA foundation,
which contained a reasonable amount of NLP tools, a modest ontology,
and some useful tools for knowledge acquisition.  During that time,
they added quite a bit of machine learning, reasoning, statistics,
and heuristics.  But most of all, they added terabytes of documents.    (04)

For the record, following are Ferrucci's slides from 2006:    (05)

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb--DavidFerrucci_20060511.pdf    (06)

Following is the talk that explains the slides:    (07)

http://ontolog.cim3.net/file/resource/presentation/DavidFerrucci_20060511/UIMA-SemanticWeb--DavidFerrucci_20060511_Recording-2914992-460237.mp3    (08)

And following is his recent talk about the DeepQA project for
building and extending that foundation for Jeopardy:    (09)

http://www-943.ibm.com/innovation/us/watson/watson-for-a-smarter-planet/building-a-jeopardy-champion/how-watson-works.html    (010)

Compared to Ferrucci's talks, the PBS Nova program was a disappointment.
It didn't get into any technical detail, but it did have a few cameo
appearances from AI researchers.  Terry Winograd and Pat Winston,
for example, said that the problem of language understanding is hard.    (011)

But I thought that Marvin Minsky and Doug Lenat said more with their
tone of voice than with their words.  My interpretation (which could,
of course, be wrong) is that both of them were seething with jealousy
that IBM built a system that was competing with Jeopardy champions
on national TV -- and without their help.    (012)

In any case, the Watson project shows that terabytes of documents are
far more important for commonsense reasoning than the millions of
formal axioms in Cyc.  That does not mean that the Cyc ontology is
useless, but it undermines the original assumptions for the Cyc
project:  commonsense reasoning requires a huge knowledge base
of hand-coded axioms together with a powerful inference engine.    (013)

An important observation by Ferrucci:  The URIs of the Semantic Web
are *not* useful for processing natural languages -- not for ordinary
documents, not for scientific documents, and especially not for
Jeopardy questions:    (014)

  1. For scientific documents, words like 'H2O' are excellent URIs.
     Adding an http address in front of them is pointless.    (015)

  2. A word like 'water', which is sometimes a synonym for 'H2O',
     has an open-ended number of senses and microsenses.    (016)

  3. Even if every microsense could be precisely defined and
     cataloged on the WWW, that wouldn't help determine which
     one is appropriate for any particular context.    (017)

  4. Any attempt to force human being(s) to specify or select
     a precise sense cannot succeed unless *every* human
     understands and consistently selects the correct sense
     at *every* possible occasion.    (018)

  5. Given that point #4 is impossible to enforce and dangerous
     to assume, any software that uses URIs will have to verify
     that the selected sense is appropriate to the context.    (019)

  6. Therefore, URIs found "in the wild" on the WWW can never
     be assumed to be correct unless they have been guaranteed
     to be correct by a trusted source.    (020)

These points taken together imply that annotations on documents
can't be trusted unless (a) they have been generated by your
own system or (b) they were generated by a system which is at
least as trustworthy as your own and which has been verified
to be 100% compatible with yours.    (021)

In summary, the underlying assumptions for both Cyc and
the Semantic Web need to be reconsidered.    (022)

John    (023)

