Hi David,
Thanks for your thoughts on the subject. You
and I have discussed the viability of small vocabularies for application
specific linguistic competence before, though not in this depth.
In my view, the small number of key words
in a specific context (situation) satisfies that requirement without requiring
that a plethora of interpretations has to be understood by the software that
processes patents. This simple fact has been demonstrated by the success
of key word search when there is NO meaning represented, interpreted or even stored
regarding the key word vocabularies.
For example, in analyzing patents, I use
the following two-three dozen frequent words to partition the claim sentence
into claim elements that can be further syntactically analyzed:
A
An
The
Comprise
Comprises
Comprising
Comprised
Each
Either
Every
Exist
Exists
All
And
Or
Plurality
Method
Methods
System
Systems
Said
Wherein
Wherefore
Such
That
and a few others (two-three dozen) that I
don’t remember off hand. But that small number of highly frequent
words are used in that context are adequate. I don’t have to
understand the other words in the claim sentence to be able to organize it into
smaller clumps that can be more easily analyzed.
That small vocabulary of two-three dozen
or so words works for ALL known patent claims on the USPTO web site. That
site must have a high hundred thousand patents, perhaps up to a million patents,
that have each been reviewed multiple times, organized into the PTO database
format, and presented to the viewer with a simple web interface. There is
no need to use other words in partitioning claims into elements other than the two-three
dozen or so I use as a personal standard.
After that, additional rare words have
much more specialized uses. It may require a hundred thousand words of
vocabulary, or even more to represent the vocabulary of ALL patents, but the
slice of two dozen or so is adequate for the claim sentence partitioning task. The
total number of words used in patent claims is huge. That is because
there are typically 10 to 50 claims, averaging about 20 claims as typical.
Consider this claim 13 from the 7,209,923:
I claim: ...
13. A method for extracting contextual
information about a plurality of objects and a plurality of activities from a plurality of texts comprising:
annotating each said text with metadata information; transforming each said text
into one or more database rows that reflect the said annotations through tables and columns of the said database; selecting a number P partitioning columns,
said partitioning columns to be used for partitioning said database rows into
partitions; each said partition comprising zero or more rows, wherein if one or more rows exist, each said row having the same values, or value ranges, in each row and partitioning column in said partition selecting a number R processing columns of
said
database to be used in extracting information from said database; said processing columns comprising
structured columns and unstructured columns; said unstructured columns including one or more columns containing words or phrases expressed in a language; modeling classes and relationships among said plurality of objects and said plurality of activities described by entries in said database; searching the said partitions for said contextual information based
on the modeling; and presenting said contextual information to a user; wherein each of P and R is greater than zero.
Context specific words are much more
numerous. The following 34 words are “rare words” which
describe the technology being claimed for this patent claim 13:
activities
annotating
annotations
classes
column
columns
contextual
database
expressed
extracting
language
metadata
modeling
objects
P
partition
partitioning
partitions
phrases
presenting
R
ranges
reflect
relationships
row
rows
searching
selecting
tables
text
texts
transforming
unstructured
user
Organizing that claim into a structured
form based on the twenty or so frequent claim words, it looks more like this:
Note that the “extracting contextual
information” step means only that, not something that DEFINES each word
in specific terms. That role is up to the patent specification, which
explains the meaning of each word by using each one in sentences in the
specification.
Therefore the twenty word vocabulary is
adequate for constructing context even without defining the semantic meaning of
each word, or even of the twenty! All those words are understood by the
people who write patents, people who review and rule on patent interpretations,
and people who invest in the patents. But the context construed in this
way is construed by the VIEWER, not by the vocabulary itself. That small
vocabulary is a way of constraining claim sentences to meet the requirements of
explaining a method or system in form suitable for readers.
Then, further partitioning the claim
elements into specific references to the patent specification, the viewer shows
how the words are actually used in context. Here is a sample claim chart
for one of those claim elements:
in each row and partitioning column in said partition
selecting a
number R processing columns of said database
to be used in extracting information from said database;
The following referenced sentences in the
specification include, among others:
[D9,58] Additional columns can
be defined, one per row, and added to the Columns
table.
[D10,27] A Primary_Key 301 constraint means that
each row in a table contains distinct values in the combined set of columns
which are collectively labeled as Primary_Key 301 columns for
that table.
[D10,30] Each row has a distinct value in this
set of columns from that of all other rows.
[D10,48] Therefore all columns in
the metadata database named "Domain" 223 satisfy the Foreign_Key constraint
shown in row 2 of the Foreign_Keys table 320.
[D11,17] Eventually, after partitioning has
been iterated sufficiently, partitions
result that are empty of information, or that have only one row in
specified processing columns 402.
[D17,63] The stream of phrases
1203 may come from a single row in a single column, or
it may come from the database nodes 503 that are labeled as being on the marked solution
subtree 504.
[D19,16] Embodied functions can add a row to
a table, delete a row from a table, or change a row in a table.
Simply lexically analyzing the claim
vocabulary is sufficient, as shown above, to extract the full context of a
long, wordy specification into just those references that relate to the rare
words used in the claim, after the two-three dozen words have been used to
partition the claim into elements.
But in any case, the VIEWER (a person)
must interpret even the simplest of these words. Therefore, by John’s
definition, those words comprise rehetoric, not logic!
That definition doesn’t hunt!
It is the viewer ALWYS who interprets the words to have useful meanings.
The next question of just how many
meanings a given word must have is not significant to the task of interpreting
patent specifications. Word definitions are settled in court in a Markman
hearing which puts the proper interpretation into the rare words so that the
two sides can debate the technology issues relevant to the patent.
It isn’t necessary to list the multiplicity
of meanings of rare words to get that task completed satisfactorily.
JMHO,
-Rich
Sincerely,
Rich Cooper
EnglishLogicKernel.com
Rich AT EnglishLogicKernel DOT com
9 4 9 \ 5 2 5 - 5 7 1 2
From:
ontolog-forum-bounces@xxxxxxxxxxxxxxxx
[mailto:ontolog-forum-bounces@xxxxxxxxxxxxxxxx] On Behalf Of David Eddy
Sent: Saturday, June 11, 2011 1:59
PM
To: [ontolog-forum]
Subject: Re: [ontolog-forum] Run,
put, and set
Rich -
On 2011-06-11, at 3:31 PM, Rich Cooper wrote:
In my opinion, there is no abstract language; there is only what you have
designated in a previous email as “rhetoric”. ALL language is
messaging from one person to at least one other person, perhaps even the same
person reasoning with himself over past experience and current issues.
Not all messaging/language is person-to-person. There is also person-to-machine/software.
In the example above, there is only a
machine to process the Word file which contains the patent. That is how
the patent gets represented in the database, how patents get searched and
compared with other patents, and how patent litigation and prosecution are
organized.
I believe that in the person-to-machine "dialogue" the only
concern is to have a unique string of characters within a
"namespace."
That is not the ONLY concern IMHO. And
that, specifically, is something I object to – the SW uses lexical
uniqueness to imply meaning, but I think that use is ineffective for most
applications of linguistically competent software.
A person is using software (that they may or may not understand the
purpose of) & the software makes demands for say name length.
One of the more fun ones is the classic 8.3 style off MS-DOS, which one
can still see in play in people's email addresses. It is not unusual for
someone to work at an organization that still has in place some sort of email
(registration?) system that restricts the left hand name to 8 characters.
With only 8 characters available, things can quickly lurch to the
cryptic.
Agreed, but today’s software is not
limited by storage sizes as MSDOS software was, and 8 character name constraints
are a thing of the past.
Such "artificial" constraints are obviously long past... yet
still in active use.
I would argue these are also a form of abstract language... certainly a
form of language that many people have to wrestle with daily.
I disagree that these example comprise
ABSTRACT language. As I explained above, it is only the lexical scoping
of the claim into elements that is needed to partition the basic claim
sentence. The abstraction of Noun, Adjective, Subject, preposition, etc
need not even be considered for this purpose, yet those are the
fundamental objects which linguists use to process language.
Thanks for your inputs,
-Rich