[Top] [All Lists]

Re: [ontolog-forum] Context and Inter-annotator agreement

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: John F Sowa <sowa@xxxxxxxxxxx>
Date: Thu, 01 Aug 2013 09:33:18 -0400
Message-id: <51FA639E.3020007@xxxxxxxxxxx>
Pat,    (01)

>> But it is possible, for texts where clarity and precision are
>> critical, for the author to use tools that can help detect ambiguity,
>> avoid words that could be problematical, and suggest simpler syntax
>> for phrases that are overly complex.    (02)

> Yes, that was the intended implication.  But what I had in mind goes
> beyond use of controlled natural language, useful as that is.    (03)

The Boeing Language Checker was designed for producing controlled NLs.
But I cited it because the same techniques can be adapted to generate
any formal logic or KR language.    (04)

All methods of knowledge acquisition depend on NL input.  A controlled
NL is just a stage on a continuum between language and logic.    (05)

>> what do you mean by "distinguishable" meanings    (06)

> Those meanings that can be reliably distinguished (>98%)
> by motivated (rewarded for accuracy)  human annotators.    (07)

There are no such meanings -- except in very special cases.  The
following note by Adam K. is an example where the human annotators
reached a level of 99.4% -- but the choice is unusually well defined.    (08)

> this [requires] a progressive iterative effort to develop (at least one)
> NLU program and a set of senses that it can understand so as to achieve
> human-level interpretation of a broad range of texts.    (09)

Unfortunately, there is no finite "set of senses" that can be used
to achieve "human-level interpretation of a broad range of texts."    (010)

MT researchers have been working for over 60 years (since 1950)
on the task of designing an Interlingua that could be used for
automated translation from any NL to any other NL.  All such
attempts have failed.    (011)

For the past half century, the most successful MT system has been
Systran, which is based on the Georgetown Automatic Translator (GAT),
for which research was terminated in 1963.  It is based on hand-coded
word and phrase pairs for each of the language pairs it handles.    (012)

Over the years, the developers built up millions of such pairs,
but no clearly defined set of "senses".  It achieved its success
purely by brute force, and it is still in use as Babelfish.    (013)

Google uses a similar brute-force method with the word and phrase
pairs chosen by statistics.  They don't have any set of "senses".    (014)

Fundamental principle:  People think in *words*, not in *word senses*.    (015)

The senses found in dictionaries are based on the examples in whatever
citations were used by the lexicographers who wrote the definitions.    (016)

There is a very "long tail" to that distribution:  the more citations
you collect, the more senses you get.  It doesn't converge because
people are constantly using words in new "senses".    (017)

Furthermore, the "senses" of similar words in different languages
don't line up.  That's why Systran and Google match phrases, not
individual words.  Even then, there is a high error rate because
the patterns are often split in different parts of the sentence
or in neighboring sentences.    (018)

Please note the term 'microsense' by Allen Cruse.  He learned from
long, hard experience that the senses very by small increments even
in very similar documents.  See http://www.jfsowa.com/talks/goal.pdf    (019)

John    (020)

-------- Original Message --------
Subject: Re: [Corpora-List] WSD / # WordNet senses / Mechanical Turk
Date:   Tue, 16 Jul 2013 14:40:50 +0100
From:   Adam Kilgarriff <adam@xxxxxxxxxxxxxxxxxx>
To:     Benjamin Van Durme <vandurme@xxxxxxxxxx>
CC:     corpora@xxxxxx    (021)

Re: the 0.994 accuracy result reported by Snow et al: there was
precisely one word used for this task, 'president',
with the 3-way ambiguity between    (022)

     1) executive officer of a firm, corporation, or university
     2) head of a country (other than the U.S.)
     3) head of the U.S., President of the United States    (023)

Open a dictionary at random and you'll see that most polysemy isn't
like that.  The result, based on one word, provides no insight into
the difficulty of the WSD task    (024)

Adam    (025)

On 16 July 2013 13:32, Benjamin Van Durme <vandurme@xxxxxxxxxx> wrote:    (026)

Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. Cheap
and Fast - But is it Good? Evaluating Non-Expert Annotations for
Natural Language Tasks. EMNLP 2008.
http://ai.stanford.edu/~rion/papers/amt_emnlp08.pdf    (027)

"We collect 10 annotations for each of 177 examples of the noun
'€œpresident' for the three senses given in SemEval. [...]
performing simple majority voting (with random tie-breaking) over
annotators results in a rapid accuracy plateau at a very high rate of
0.994 accuracy.  In fact, further analysis reveals that there was only
a single disagreement between the averaged non-expert vote and the
gold standard; on inspection it was observed that the annotators voted
strongly against the original gold label (9-to-1 against), and that
it was in fact found to be an error in the original gold standard
annotation.  After correcting this error, the non-expert accuracy rate
is 100% on the 177 examples in this task. This is a specific example
where non-expert annotations can be used to correct expert annotations."    (028)

     Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch. Expectations
     of Word Sense in Parallel Corpora. NAACL Short. 2012.
     http://cs.jhu.edu/~vandurme/papers/YaoVanDurmeCallison-BurchNAACL12.pdf    (029)

     "2 Turker Reliability    (030)

     While Amazon’s Mechanical Turk (MTurk) has been been considered 
in the
     past for constructing lexical semantic resources (e.g., (Snow et al.,
     2008; Akkaya et al., 2010; Parent and Eskenazi, 2010; Rumshisky,
     2011)), word sense annotation is sensi- tive to subjectivity and
     usually achieves low agree- ment rate even among experts. Thus we
     first asked Turkers to re-annotate a sample of existing gold- standard
     data. With an eye towards costs saving, we also considered how many
     Turkers would be needed per item to produce results of sufficient
     quality.    (031)

     Turkers were presented sentences from the test portion of the word
     sense induction task of SemEval-2007 (Agirre and Soroa, 2007),
     covering 2,559 instances of 35 nouns, expert-annotated with OntoNotes
     (Hovy et al., 2006) senses. Â [...]    (032)

     We measure inter-coder agreement using Krip- pendorff’s Alpha
     (Krippendorff, 2004; Artstein and Poesio, 2008), [...]"    (033)

Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (034)

<Prev in Thread] Current Thread [Next in Thread>