[Top] [All Lists]

Re: [ontolog-forum] Solving the information federation problem

To: "[ontolog-forum]" <ontolog-forum@xxxxxxxxxxxxxxxx>
From: "John F. Sowa" <sowa@xxxxxxxxxxx>
Date: Thu, 03 Nov 2011 13:52:44 -0400
Message-id: <4EB2D4EC.4060309@xxxxxxxxxxx>
Ed and Leo,    (01)

This morning, I decided that a revision to slide 105 could explain
more clearly how the VivoMind Language Processor (VLP) works.    (02)

But before going into more detail about VLP, I'd like to clarify
some points that Ed raised:    (03)

> I agree with Leo that really doing this is a much harder problem than
> just analyzing the code and data structures.  It makes the assumption
> that the data structure and code nomenclature, for example, is
> consistent across multiple programs (which may be the case if there were
> stringent coding rules enforced for initial development and all
> subsequent modification), and that the nomenclature is accurate to    (04)

There was no assumption that the nomenclature was consistent.  The
programs had been developed over a period of 40 years by and for
an ever changing and evolving collection of developers and users.    (05)

The client explicitly asked for a glossary to show the changes in
terminology over the years and to mark each definition with a link
back to the document in which it was stated.  That was one of the
requirements for the CD-ROM generated by this project.    (06)

The problem with statistical parsers is that they mush together all
the data in a large corpus of independently developed documents.
But VLP uses a high-speed associative memory to access conceptual
graphs that have pointers to the documents from which they were
derived.  That enables VLP to distinguish the sources as required.    (07)

> Further, code analysis software is well-advanced, so much so as to have
> many commercial products and several standards.    (08)

Yes, indeed.  Arun and André used off-the-shelf code analysis software
to satisfy some of the requirements.  I said that in my earlier
response to Leo (copy below).    (09)

> The usual problem for the vendor of software analysis tools of the
> 1990s was:  convert programs in language X to programs in language Y.    (010)

The client understood that problem very well, and they did *not* ask
for code conversion.  They had no intention to replace all their legacy
software overnight with a new set of computer generated software.  That
would be a recipe for disaster.    (011)

After they got that estimate of 80 person years just for the analysis,
they called Ed Yourdon as a consultant, since he had been consulting
on many, many Y2K projects.  Ed Y. worked with the Cutter Consortium,
and he knew that Arun and André were two superprogrammers who might be
able to give a second opinion about how to proceed.  So he suggested
them for a short *study project* .    (012)

But instead of just studying the task, Arun and André finished it.    (013)

> OMG, for example, has had a working group in this area for 10 years.
> OSF had such a group in 1990, and I assume there are others...    (014)

If they really want to understand the state of the art, I suggest that
they study the slides and the references in slides 122 and 123.  I also
recommend my article on "Future directions for semantic systems":    (015)

    http://www.jfsowa.com/pubs/futures.pdf    (016)

Note to Leo:    (017)

Following is the revised version of slide 105:    (018)

> An extremely difficult and still unsolved problem:
> ● Translate English specifications to executable programs.
> Much easier task:
> ● Translate the COBOL programs to conceptual graphs.
> ● Those CGs provide the ontology and background knowledge.
> ● The CGs derived from English may have ambiguous options.
> ● VAE matches the CGs from English to CGs from COBOL.
> ● The COBOL CGs show the most likely options.
> ● They can also insert missing information or detect errors.
> The CGs derived from COBOL provide a formal semantics for
> the informal English texts.    (019)

The primary innovation is to use the VivoMind Analogy Engine (VAE)
for high-speed access to background knowledge.  VAE can find exact
or approximate matches in (log N) time, where N is the number of
graphs stored in Cognitive Memory.  (See slides 81 to 102.)    (020)

That speed enables arbitrarily large volumes of data to be applied
to the task of language analysis and semantic interpretation.  The
other steps in the above slide are conventional techniques that
comp. sci. and NLP have been using for years.    (021)

Statistical methods compress large volumes of background knowledge
into numbers.  But when you have logarithmic-time access to all that
background knowledge, you get the benefits of statistical parsers
*plus* the benefits of using the actual background knowledge to
resolve ambiguities, add missing or implicit information, and
link fragmentary partial parses to form a complete interpretation.    (022)

John    (023)

-------- Original Message --------
Subject: Re: [ontolog-forum] Solving the information federation problem
Date: Thu, 03 Nov 2011 02:07:16 -0400
From: John F. Sowa <sowa@xxxxxxxxxxx>    (024)

Leo,    (025)

Arun would be happy to go through the details with you at any time.    (026)

> I simply find this hard to believe, unless you already had very
> elaborate software.    (027)

It most definitely is elaborate.  Section 6 (slides 81-102) covers the
VivoMind Analogy Engine (VAE):  http://www.jfsowa.com/talks/goal.pdf
You'd be hard pressed to find anybody else with such software.    (028)

> Just understanding (analyzing) the domain as a human would takeyou
> 2 weeks, in my estimate.  Not spending the time to understand it,
> but just applying your tools, means you will do it wrong.    (029)

By humans, it would take much, much longer.  Note slide 104, which
says that there were 1.5 million lines of COBOL and 100 megabytes
of English documentation.  The consulting firm estimated 40 people
for 2 years (80 person years) to analyze all that and to generate
a cross-reference of the programs to the documentation.    (030)

What Arun did was to take an off-the-shelf grammar for COBOL and
modify the back end to generate conceptual graphs that represent
the COBOL data definitions, file definitions, and code.    (031)

Note slide 105:    (032)

> An extremely difficult and still unsolved problem:
> ● Translate English specifications to executable programs.
> Much easier task:
> ● Translate the COBOL programs to conceptual graphs.
> ● Use the conceptual graphs from COBOL to interpret the English.
> ● Use VAE to compare the graphs derived from COBOL to the
>   graphs derived from English.
> ● Record the similarities and discrepancies.
> The graphs derived from COBOL provide a formal semantics
> for the informal English.    (033)

As VLP (VivoMind Language Processor) parses the English, it uses the
VivoMind Analogy Engine (VAE) to find conceptual graphs from COBOL
that match something in the sentences.  Any sentence that doesn't
match from COBOL anything is discarded as irrelevant.  But if there
is a match, VAE uses the graphs to resolve ambiguities and to fill
in any missing, but implicit details.    (034)

The COBOL graphs are assumed to be accurate, and the English is
assumed to be an informal approximation to the formal COBOL.
But the English may contain additional commentary, which is
irrelevant for the purpose of generating cross references and
looking for discrepancies.  For some examples of the kind of
discrepancies that VAE found, see slides 109 to 111.    (035)

> I'd like to see the metrics resulting from this project, if possible.    (036)

I don't know what you mean by metrics.  Arun can show you the CD-ROM
that contained the results of the analysis.    (037)

Slide 108 shows what the client wanted the consulting firm to produce:    (038)

> Glossary, data dictionary, data flow diagrams, process architecture,
> system context diagrams.    (039)

Some of that output was generated by off-the-shelf tools that could
analyze COBOL code to generate such information.    (040)

But there were no tools that could do a cross-reference of the
documentation and the programs and check for discrepancies.
That was what VAE did.    (041)

VLP with some heuristics was used to produce a glossary for human
readers.  For example, if a certain phrase X was found in a pattern
such as "An X is a ...", that was considered a candidate for a
definition of X.  It was added to the glossary with a pointer
to the source.  Arun and André proofread the glossary to toss
out any sentences that weren't useful definitions.    (042)

The client said that the CD-ROM contained exactly what they wanted
the consulting firm to do.    (043)

That's a pretty good metric.  And it's a metric that satisfied
a living, breathing customer that was delighted to pay for
15 person weeks of work instead of 80 person years.  That's
a lot more convincing than the kinds of numbers typically
generated for toy problems.    (044)

John    (045)

Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (046)

<Prev in Thread] Current Thread [Next in Thread>