Those statistical engines probably use "knowledge" underneath, embedded
in non-reusuable program code. Statistical NLP (corpus-based) has been
the predominant paradigm for the last 10-15 years because symbolic NLP
hit the wall: you need domain/world knowledge, and there ain't none --
or at least, there was very little back then in
machine-usable/interpretable form. So, the shift to employing weaker,
syntactic methods to "get at" or "simulate" the semantics which is
lacking. Use keywords in search engines, and return ten thousand
documents where those strings co-occur. (01)
But statistical NLP needs huge amount of data (populations) to train on,
create the statistical models, and from those predict on new, similar
data based on the models. There is nothing wrong with statistical
models. But if you had solid semantic models, you would get better
Metadata is just the database world's term for anything that's not data.
In AI we call it "knowledge" or semantics. But unlike the flat metadata
crap, we structure it, relate it to other concepts, specify rigourous
meaning, enable it to be machine interpretable. (03)
Those Bayesian nets assume some semantic structure, largely
hand-generated without logical principles. What do you think the
patterns of "pattern-matching" are about? Most are regular
expression-based (a formally weaker notion, equivalent to finite state
machines, than on the language side -- context-free and -sensitive
languages, on the formal machine side -- pushdown automata and turing
machines). The latter correspond to our logics and our programming
languages. Grammars are patterns. If you want to use simple patterns,
rather than complex patterns, and somehow build up automagically
human-level inferences, that's fine. I'd rather use complex patterns
that correspond to how humans conceptualize the world, start there. (04)
The search engines of tomorrow will also use semantic search, i.e.,
instead of looking for string co-occurrences in documents to "simulate"
the semantics of your query, they'll use string/word-to-concept
mappings, and substitute concepts (from some set of ontologies) and
their relations, to create a conceptual representation of your query and
then compare that conceptual representation against the conceptual
representation of documents that satisfy that query conceptually. Will
these use statistical methods? Absolutely, but they will be combined
with semantic representation in the form of ontologies, knowledge bases,
"metacrap". Why? Because the latter, if represented well, represents our
best "theories" of the world. It's why we formulate scientific theories
in the real world, after all. They model the world succinctly. (05)
"Peter P. Yim" wrote:
> Thank you, Monica, for sharing this.
> Mind provoking ... and definitely worth the bits and bytes this piece
> is consuming -- although I will have to take a position that it is not
> even a matter of "either-or" in this case.
> I'm passing this on ...
> Monica J. Martin wrote Wed, 23 Apr 2003 22:03:43 -0600:
> > -------- Original Message --------
> > Subject: [xml-dev] Statistical vs "semantic web" approaches to making
> > sense of the Net
> > Date: Wed, 23 Apr 2003 21:09:48 -0400
> > From: Mike Champion <mc@xxxxxxxxxxx>
> > To: "xml-dev@xxxxxxxxxxxxx" <xml-dev@xxxxxxxxxxxxx>
> > There was an interesting conjunction of articles on the ACM "technews"
> > page [http://www.acm.org/technews/current/homepage.html] -- one on "AI"
> > approaches to spam filtering
> > http://www.nwfusion.com/news/tech/2003/0414techupdate.html and the other
> > on the Semantic Web
> > http://www.computerworld.com/news/2003/story/0,11280,80479,00.html.
> > What struck me is that the "AI" approach (I'll guess it makes heavy use
> > of pattern matching and statistical techniques such as Bayesian
> > inference) is working with raw text that the authors are deliberately
> > trying to obfuscate the meaning of to get past "keyword" spam filters,
> > and the Semantic Web approach seems to require explicit, honest markup.
> > Given the "metacrap" argument about semantic metadata
> > (http://www.well.com/~doctorow/metacrap.htm) I suspect that in general
> > the only way we're going to see a "Semantic Web" is for
> > statistical/pattern matching software to create the semantic markup and
> > metadata. That is, if such tools can make useful inferences today about
> > spam that pretends to be something else, they should be very useful in
> > making inferences tomorrow about text written by people who try to say
> > what they mean.
> > This raises a question, for me anyway: If it will take a "better Google
> > than Google" (or perhaps an "Autonomy meets RDF") that uses Baysian or
> > similar statistical techniques to create the markup that the Semantic
> > Web will exploit, what's the point of the semantic markup? Why won't
> > people just use the "intelligent" software directly? Wearing my "XML
> > database guy" hat, I hope that the answer is that it will be much more
> > efficient and programmer-friendly to query databases generated by the
> > 'bots containing markup and metadata to find the information one needs.
> > But I must admit that 5-6 years ago I thought the world would need
> > standardized, widely deployed XML markup before we could get the quality
> > of searches that Google allows today using only raw HTML and PageRank
> > heuristic algorithm.
> > So, anyone care to pick holes in my assumptions, or reasoning? If one
> > does accept the hypothesis that it will take smart software to produce
> > the markup that the Semantic Web will exploit, what *is* the case for
> > believing that it will be ontology-based logical inference engines
> > rather than statistically-based heuristic search engines that people
> > will be using in 5-10 years? Or is this a false dichotomy? Or is the
> > "metacrap" argument wrong, and people really can be persuaded to create
> > honest, accurate, self- aware, etc. metadata and semantic markup?
> > [please note that my employer, and many colleagues at W3C, may have a
> > very different take on this and please don't blame anyone but me for
> > this blather!]
> Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/
> Shared Files: http://ontolog.cim3.net/file/
> Community Wiki: http://ontolog.cim3.net/wiki/
> To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx
Dr. Leo Obrst The MITRE Corporation, Information Semantics
lobrst@xxxxxxxxx Center for Innovative Computing & Informatics
Voice: 703-883-6770 7515 Colshire Drive, M/S H305
Fax: 703-883-1379 McLean, VA 22102-7508, USA (07)
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx (08)