ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] Context and Inter-annotator agreement

To: "'[ontolog-forum] '" <ontolog-forum@xxxxxxxxxxxxxxxx>
From: "Patrick Cassidy" <pat@xxxxxxxxx>
Date: Thu, 1 Aug 2013 14:21:45 -0400
Message-id: <14a101ce8ee3$fad163b0$f0742b10$@micra.com>
John,
   Your comments point out a few issues on which it seems we differ.  To focus 
on those:    (01)

[PC]
>> Those meanings that can be reliably distinguished (>98%) by motivated 
>> (rewarded for accuracy)  human annotators.
>
> There are no such meanings -- except in very special cases.      (02)

  * I think that human performance with real informative text is typically 
above that level, when one is trying to be accurate and not sloppy or hurried.  
If it weren't, communication would be virtually impossible, and I would suggest 
that people usually communicate quite well when they are trying to be clear.   
The difficulty of **testing** that number is, that one has to start with some 
inventory of senses, but the most detailed inventory yet used for such tests by 
NL researchers is WordNet, which is not a good standard for such testing.   
This is pretty much the point I was making, that meaningful sense 
disambiguation needs a lot better inventory of senses than people are using.  
Until we develop a logic-based word sense inventory intended for broad use I 
don't see how the maximum agreement could be tested.    (03)

[JFS]
>  Unfortunately, there is no finite "set of senses" that can be used to 
>achieve "human-level interpretation of a broad range of texts."    (04)

 * That is a bold claim, and even acknowledging the fact that it is difficult 
to prove such a negative, my observations suggest that no remotely applicable 
test has yet been conducted to see if such a claim is even plausible.   One of 
the points I made is that it would be very expensive process to develop such a 
set of senses, and that process has never been funded.   I suspect that Wordnet 
(or its derivative used in  Ontonotes) is used because the statistical NLP 
programs that use Wordnet probably wouldn't perform much better even with a 
perfect set of senses.  So most NLP the effort is focused on other tasks.    (05)

[JFS] > MT researchers have been working for over 60 years (since 1950) on the 
task of designing an Interlingua that could be used for automated translation 
from any NL to any other NL.  All such attempts have failed.
    And will continue to fail until a serious and adequately funded effort  is 
made to develop such an interlingua, in the form of a logic-based ontology that 
is related via an NLU program to a meaningful text corpus.  Efforts prior to 
1990 did not have an adequate basis in logic, and efforts since then have been 
too restricted to have any hope of achieving the goal.  Even in the well-funded 
CALO project the integration of ontology and NLU did not seem to make up a 
major part.    (06)

[JFS] > For the past half century, the most successful MT system has been 
Systran, which is based on the Georgetown Automatic Translator (GAT), for which 
research was terminated in 1963.  It is based on hand-coded word and phrase 
pairs for each of the language pairs it handles.    (07)

  * It is clear (from Google translate among other programs) that one can get a 
somewhat useful translation solely from statistical analysis of parallel 
corpora, but that tactic , however useful, is very far from actual 
understanding of text at a human level - that is, a level sufficient for one 
person to send a message in unstructured  NL to a machine, and to expect that 
the machine will make important or mission-critical decisions ( as well as a 
person would) based on its interpretation without further human input.  No 
current statistical NL program comes remotely close enough.  Statistics might 
be pushed to that level, provided that the system can be trained on a properly 
semantically annotated text - but that also would depend on an accurate 
inventory of word senses.    (08)

[JFS] . Fundamental principle:  People think in *words*, not in *word senses*.
   * Really?  I sure don’t.  Without the textual content to disambiguate 
words, communication would be extremely error-prone.   Where does that notion 
come from?    (09)

[JFS]
> Furthermore, the "senses" of similar words in different languages don't line 
>up.
*  In many cases no, and in other cases, yes.  But when it is true that merely 
indicates that different cultures may emphasize different aspects of entities 
in the largely continuous world.  All such meanings (specifying the differences 
between closely related words) can still be specified with a reasonably small 
inventory of semantic primitives.    (010)

[JFS] > Please note the term 'microsense' by Allen Cruse.  
 * I am aware of that notion, but still find that virtually everything that I 
have an interest in communicating or learning can be described by a discrete 
set of necessary properties in an ontology (though rarely by both necessary and 
sufficient conditions).  The lack of "sufficient" properties along with the 
necessary provides a lot of wiggle room so that  various entities (those 
"microsenses")may be shoehorned into the same category.  **BUT** although the 
speaker (or writer) may have something in mind more detailed than the generic 
entity described by the necessary properties, all the listener can or will 
understand is the necessary properties associated with a word sense, unless the 
context makes clear that a more specific entity is intended.  When people use a 
word with many "microsenses", without disambiguating elaboration, then the 
listener will typically understand only the generic meaning, and leave the 
details unspecified, because, unless the speaker is intending to confuse, the 
details will not be needed to understand the intended meaning, i.e. it doesn't 
matter what subvariety of entity is involved.     If someone said that a dog 
bit her ear, I wouldn't assume what kind of dog it was, and would assume from 
the lack of specificity that the "microsense"  was irrelevant to the idea she 
intended to convey - if it were relevant, she would specify.  The notion of 
"microsenses" as a theoretical concept is reasonable, but in cooperative 
communication rarely is important.    (011)

Pat    (012)

Patrick Cassidy
MICRA Inc.
cassidy@xxxxxxxxx
1-908-561-3416    (013)


-----Original Message-----
From: ontolog-forum-bounces@xxxxxxxxxxxxxxxx 
[mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of John F Sowa
Sent: Thursday, August 01, 2013 9:33 AM
To: ontolog-forum@xxxxxxxxxxxxxxxx
Subject: Re: [ontolog-forum] Context and Inter-annotator agreement    (014)

Pat,    (015)

JFS
>> But it is possible, for texts where clarity and precision are 
>> critical, for the author to use tools that can help detect ambiguity, 
>> avoid words that could be problematical, and suggest simpler syntax 
>> for phrases that are overly complex.    (016)

PC
> Yes, that was the intended implication.  But what I had in mind goes 
> beyond use of controlled natural language, useful as that is.    (017)

The Boeing Language Checker was designed for producing controlled NLs.
But I cited it because the same techniques can be adapted to generate any 
formal logic or KR language.    (018)

All methods of knowledge acquisition depend on NL input.  A controlled NL is 
just a stage on a continuum between language and logic.    (019)

JFS
>> what do you mean by "distinguishable" meanings    (020)

PC
> Those meanings that can be reliably distinguished (>98%) by motivated 
> (rewarded for accuracy)  human annotators.    (021)

There are no such meanings -- except in very special cases.  The following note 
by Adam K. is an example where the human annotators reached a level of 99.4% -- 
but the choice is unusually well defined.    (022)

PC
> this [requires] a progressive iterative effort to develop (at least 
> one) NLU program and a set of senses that it can understand so as to 
> achieve human-level interpretation of a broad range of texts.    (023)

Unfortunately, there is no finite "set of senses" that can be used to achieve 
"human-level interpretation of a broad range of texts."    (024)

MT researchers have been working for over 60 years (since 1950) on the task of 
designing an Interlingua that could be used for automated translation from any 
NL to any other NL.  All such attempts have failed.    (025)

For the past half century, the most successful MT system has been Systran, 
which is based on the Georgetown Automatic Translator (GAT), for which research 
was terminated in 1963.  It is based on hand-coded word and phrase pairs for 
each of the language pairs it handles.    (026)

Over the years, the developers built up millions of such pairs, but no clearly 
defined set of "senses".  It achieved its success purely by brute force, and it 
is still in use as Babelfish.    (027)

Google uses a similar brute-force method with the word and phrase pairs chosen 
by statistics.  They don't have any set of "senses".    (028)

Fundamental principle:  People think in *words*, not in *word senses*.    (029)

The senses found in dictionaries are based on the examples in whatever 
citations were used by the lexicographers who wrote the definitions.    (030)

There is a very "long tail" to that distribution:  the more citations you 
collect, the more senses you get.  It doesn't converge because people are 
constantly using words in new "senses".    (031)

Furthermore, the "senses" of similar words in different languages don't line 
up.  That's why Systran and Google match phrases, not individual words.  Even 
then, there is a high error rate because the patterns are often split in 
different parts of the sentence or in neighboring sentences.    (032)

Please note the term 'microsense' by Allen Cruse.  He learned from long, hard 
experience that the senses very by small increments even in very similar 
documents.  See http://www.jfsowa.com/talks/goal.pdf    (033)

John    (034)

-------- Original Message --------
Subject: Re: [Corpora-List] WSD / # WordNet senses / Mechanical Turk
Date:   Tue, 16 Jul 2013 14:40:50 +0100
From:   Adam Kilgarriff <adam@xxxxxxxxxxxxxxxxxx>
To:     Benjamin Van Durme <vandurme@xxxxxxxxxx>
CC:     corpora@xxxxxx    (035)

Re: the 0.994 accuracy result reported by Snow et al: there was precisely one 
word used for this task, 'president', with the 3-way ambiguity between    (036)

     1) executive officer of a firm, corporation, or university
     2) head of a country (other than the U.S.)
     3) head of the U.S., President of the United States    (037)

Open a dictionary at random and you'll see that most polysemy isn't like that.  
The result, based on one word, provides no insight into the difficulty of the 
WSD task    (038)

Adam    (039)

On 16 July 2013 13:32, Benjamin Van Durme <vandurme@xxxxxxxxxx> wrote:    (040)

Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. Cheap and Fast - 
But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. 
EMNLP 2008.
http://ai.stanford.edu/~rion/papers/amt_emnlp08.pdf    (041)

"We collect 10 annotations for each of 177 examples of the noun 
'€œpresident' for the three senses given in SemEval. [...] performing 
simple majority voting (with random tie-breaking) over annotators results in a 
rapid accuracy plateau at a very high rate of
0.994 accuracy.  In fact, further analysis reveals that there was only a single 
disagreement between the averaged non-expert vote and the gold standard; on 
inspection it was observed that the annotators voted strongly against the 
original gold label (9-to-1 against), and that it was in fact found to be an 
error in the original gold standard annotation.  After correcting this error, 
the non-expert accuracy rate is 100% on the 177 examples in this task. This is 
a specific example where non-expert annotations can be used to correct expert 
annotations."    (042)






     Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch. Expectations
     of Word Sense in Parallel Corpora. NAACL Short. 2012.
     http://cs.jhu.edu/~vandurme/papers/YaoVanDurmeCallison-BurchNAACL12.pdf    (043)


     "2 Turker Reliability    (044)

     While Amazon’s Mechanical Turk (MTurk) has been been considered in 
the
     past for constructing lexical semantic resources (e.g., (Snow et al.,
     2008; Akkaya et al., 2010; Parent and Eskenazi, 2010; Rumshisky,
     2011)), word sense annotation is sensi- tive to subjectivity and
     usually achieves low agree- ment rate even among experts. Thus we
     first asked Turkers to re-annotate a sample of existing gold- standard
     data. With an eye towards costs saving, we also considered how many
     Turkers would be needed per item to produce results of sufficient
     quality.    (045)

     Turkers were presented sentences from the test portion of the word
     sense induction task of SemEval-2007 (Agirre and Soroa, 2007),
     covering 2,559 instances of 35 nouns, expert-annotated with OntoNotes
     (Hovy et al., 2006) senses. Â [...]    (046)

     We measure inter-coder agreement using Krip- pendorff’s Alpha
     (Krippendorff, 2004; Artstein and Poesio, 2008), [...]"    (047)




_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/ Community Wiki: 
http://ontolog.cim3.net/wiki/ To join: 
http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (048)



_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (049)

<Prev in Thread] Current Thread [Next in Thread>