Hi Ed, (01)
I agree, the search engine association is a good one. (02)
Of course, as you point out, search engines don't themselves find
meaningful classes. Human users specify a "class" using a collection
of terms. But you can find classes automatically. In search engine
terms you would just collect sets of keywords which occur together
often. (03)
Now, it is true that if "medieval weaponry" occurs together in a lot
of documents with "World of Warcraft" then the system might posit a
class containing both of them. (04)
We can test. Google just gave me a count of 1,290/(44,900 +
105,000,000) documents in this "class" compared to 6,510(44,900 +
68,200,000) for "medieval weaponry" "knight". Compare this with
2950/(44,900 + 445,000,000) for "medieval weaponry" "flowers". So by
this crude model "knights" is already 4 times or so closer to the
concept of "medieval weaponry" than "World of Warcraft", and some 10
times closer than flowers. (05)
Of course there's lots to argue with in this. But about the basic
signal we don't need to argue. We can test our "common-sense"
intuitions and see what the correlation is like for ourselves. (06)
The interesting question to my mind is why search engines don't
cluster the terms they index more in this way already. I found a paper
on a Yahoo research site suggesting something of the kind: (07)
Variable latent semantic indexing
Proceeding of the eleventh ACM SIGKDD international conference on
Knowledge discovery in data mining table of contents
Chicago, Illinois, USA, Pages: 13 - 21, 2005
ISBN:1-59593-135-X (08)
http://portal.acm.org/citation.cfm?id=1081876 (09)
The salient point here being the word "variable": "With this tool, it
is possible to tailor the LSI technique to particular settings". (010)
So they are clustering, but they feel the need to do so in a way which
varies with the domain. (011)
-Rob (012)
On Feb 1, 2008 4:08 AM, Ed Barkmeyer <edbark@xxxxxxxx> wrote:
>
> Rob Freeman wrote:
> > General meaningful classes are accessible by clustering words on their
> > context. Classes found in that way don't have names until you give
> > them names, and we have still have no way of reasoning with them, but
> > basic meaningful classes can be found.
>
> "Can be found" by what kind of agent? This is, after all, what Google
> does, and it works quite well for assisting humans to find "classes" of
> interest. But it works because experienced human users try a group of
> terms they think will yield the "class" they want and are willing to
> modify the search group several times if the first results are
> disappointing. Google also uses several contextual clues that the user
> does not expressly provide. (I found that after several hours of
> researching a military history topic, asking it to search for a "rock
> star" name produced a little-known Confederate general as the first hit!)
>
> I believe that what Rob says is true if you restrict the search space to
> a set of publications that are reliable and focused on particular
> domains and topics. And it can probably work well over the Web for a
> topic that is uniquely characterized by a particular group of terms.
> But over the entire Web, you will find links between cheese and chalk
> (literally). You need a mechanism for filtering that, and Google
> succeeds because its algorithms work well with the experimental filters
> that human agents invent, and because human agents reject a set of
> results that is off the intended topic. But human agents are providing
> the massive context knowledge and familiarity with natural language
> usage that software agents simply don't have. (And codifying that
> knowledge is just building a different ontology.)
>
> And using the Web as a resource has the "argumentum ad populum" problem:
> the result is what the most publications, or the most visited
> publications, provide, not necessarily what the most reputable
> publications provide. (I wouldn't want an ontology for medieval
> weaponry to be based on World of Warcraft, but I might well get just that.)
>
> -Ed (013)
_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/
Subscribe/Config: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx (014)
|