ontolog-forum
[Top] [All Lists]

Re: [ontolog-forum] [ontology-summit] Estimating number of all known fac

To: ontolog-forum@xxxxxxxxxxxxxxxx
From: John Bottoms <john@xxxxxxxxxxxxxxxxxxxx>
Date: Thu, 24 May 2012 15:45:32 -0400
Message-id: <4FBE8FDC.8050506@xxxxxxxxxxxxxxxxxxxx>
MatthewL,

Your original questions were very broad in scope. And they appear to be types of Fermi questions that can only be answered, if at all, with estimates based upon some assumptions. The DIK pyramid suggests that data, information and knowledge may all be considered facts with some relevance.
(http://en.wikipedia.org/wiki/DIKW)

You seem to have a good idea of what constitutes a fact and no one can dispute your interests. In addressing a community the answers will vary by participants. Also, your facts below mix physical facts with conceptual facts which is fine, but broadens the scope of your questions. I believe drilling down to qualify what a fact is, is appropriate given your broad query. If you could share the basis for your questions it might help.

Your questions have been of significant interest to me, although it has been some time since I looked at them. In response to your question, there is little scholarship in this area as far as I know. But the field is so large that it is difficult to track the work. In addition, the people who have knowledge about information stored on websites are not free about sharing that information.

Google's existing dataset is on the order of 1 Petabyte, but I can't verify that number.
(http://www.quirkeysolutions.com/index.php?option=com_content&view=article&id=103:how-big-is-google-s-database&catid=21&Itemid=224)

I know that at Harvard they are working with satellite systems storing 1.4 TB per day. Are those facts? It is not uncommon to see new data numbers of this magnitude for other systems.

Most of the research in this area traces back to a paper by Blair and Maron. Salton also wrote on the subject but neither of these works have been updated. Blair and Maron's findings were associated with IBM's Stairs product and related to the query response rates of large document sets. Their corpus was on the order of 40,000 documents. Their finding was that the information you search for is a function of how broadly the query is stated. It appears that there is a power curve associated with the distribution of responses to queries of various specificities.
(http://deepblue.lib.umich.edu/bitstream/2027.42/28883/1/0000719.pdf)

You will find references to Salton's papers in this document. His book is "Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer"

Returning to the question of how to respond to the Fermi question, there may be ways to bound estimates based on the size of document sets. If the facts to which you refer are of value at the university level, and you assume universities share a common core set of facts, then you might be able to track the changes in the size or budgets of libraries. This would need to be done for corporate libraries as well. Then it would be necessary to do the same for online documents and private libraries including those of individual professionals. This would be a significant undertaking. You might make an estimate of size against Google's online size. You would also need to determine what percentage is facts.

Another approach would be to estimate how many facts are created each day or year. Again, relevance is important to weed out the insignificant and unimportant facts.

-John Bottoms
 FirstStar Systems
 Concord, MA USA

On 5/24/2012 2:27 PM, matthew lange wrote:
I am really feeling like my thread has been hijacked by people who like to read their own writing:> conjecture. I have purposefully avoided quoting any one person--but you know who you are.

Perhaps folks are afraid to read/respond to my real-world examples of facts, or did my propositions just get lost in the list mud?

Here again are some examples of facts, I would be delighted if someone would attempt to bound factual knowledge so that they could be quantified--or otherwise provide succinct reasons about why my examples are not facts.
Fact examples:
  1. The earth revolves around the sun.
  2. The Greek letter Pi represents the irrational number that is the ratio between a circle's circumference and diameter.
  3. A calorie is the amount of energy it takes to raise the  temperature of 1cc of water 1 deg. C at sea level.
  4. Chemical X contains Y calories of available energy. (of course substituting where appropriate)
Are these not facts? Are they not countable?
Again, aside from bending the space-time continuum, or  dismissing laws of nature like thermodynamics...I fail to see the need for relativism here...or, what am I missing? If you agree that these are facts, then let's get pragmatic and enumerate the properties/boundaries around the nature of a fact.

Also, I must express my displeasure with several members' netiquette on this list:
1) In addition to Mr. West, my name is also Matthew  (this is a FACT)--please use unambiguous identifiers in responses
2) Spell Check is courteous--not a fact, but perhaps an opinion shared by many--one or two misspelled words words I can understand...but some of these posts are ridiculous.

Best,

~mc





 
_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
 


_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (01)

<Prev in Thread] Current Thread [Next in Thread>