[Top] [All Lists]

Re: [ontolog-forum] Average Daily Word Exposure

To: "[ontolog-forum]" <ontolog-forum@xxxxxxxxxxxxxxxx>
From: Ali Hashemi <ali@xxxxxxxxx>
Date: Wed, 18 Aug 2010 10:22:07 -0400
Message-id: <AANLkTikYU9GaNr=VhHc3ng6RtonkHB-euGwHh9UMywi3@xxxxxxxxxxxxxx>

Thank you kindly for that pointer. I should keep in mind in the future that there will likely exist some listserve almost dedicated to any question I might have :P.

In case anyone is interested,

Here is a collation responses. I should note that the experiment I was bandying about is actually being carried out under the moniker "Human Speechome Project" - albeit geared towards babies:

While there doesn't seem to be one conclusive or even complete study on this question, piecing together various studies think yields a decent enough estimate.

See the blog posts below for a fairly thorough (and sourced!) analysis. A rough back of the envelope calculation seems to indicate that taking into consideration conversation, television and reading, an average North American adult is exposed to approximately 2 million word tokens per month. This doesn't take into consideration writing or 'ambient' words etc. But basically, it seems to indicate that we hit the 100 million word token count every ~3-5 years.


Brett Reynolds:

All I can offer you is my own back-of-the-envelope calculation here:




a comment on his post, also led to this very interesting blog as well (with relevant information):


On Tue, Aug 17, 2010 at 12:49 PM, Marco Baroni <marco.baroni@xxxxxxxx> wrote:
Hi there.

I asked a similar question a few years ago, without much success. I paste the summary below.

If you find out more, please keep me posted!



Dear all,

Two weeks ago I asked if somebody knew of work reporting estimates of how
many words/sentences/etc. (adult) speakers of a language hear/write.

I paste below the responses I got.

Thanks a lot to all who responded!



Reinhard Rapp

Dear Marco,

I am also interested in the answer to your question. Some discussion
can be found in a Psychological Review paper by Landauer & Dumais
(1997) which is on the web at


This is a citation from the most relevant part, which is footnote 6:

----------- start citation ------------

> From his log-normal model of word frequency distribution and the
observations in Carroll et al.

(1971), Carroll estimated a total vocabulary of 609,000 words in the
universe of text to which students through highschool might be exposed.
Dahl (1979), whose distribution function agrees with a different but
smaller sample of Howes (1966), found 17,871 word types in 1,058,888 tokens
of spoken American English, compared to 50,406 in the comparable sized
adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's
data imply a total of roughly 150,000 word types in spoken English, thus
approximately one-fourth the total, less to the extent that there are
spoken words that do not appear in print. Moreover, the ratio of spoken to
printed words to which a particular individual is exposed must be even more
lopsided because local, ethnic and family usage undoubtedly restrict the
variety of vocabulary more than published works intended for the general
school-aged readership.
If we assume that our seventh-grader has met a total of 50 million word
tokens of spoken English (140 minutes a day at 100 words per minute for 10
years) then the expected number of occasions on which the she would have
heard a spoken word of mean frequency would be about 370. Carroll's
estimate for the total vocabulary of seventh grade texts is 280,000, and we
estimate below that the typical student would have read about 3.8 million
words of print. Thus, the mean number of times she would have seen a
printed word to which she might be exposed is only about 14. The rest of
the frequency distributions for heard and seen words, while not
proportional, would, at every point, show that spoken words have already
had much greater opportunity to be learned than printed words, so will
profit much less from an additional occurrence.

----------- end citation ------------


With kind regards,


Paula Newman

That's an interesting question.  A little googling suggested that a lower
bound might come from data on the average number of hours of TV watching
per adult  (multiplied by  average words per minute on TV broadcasts).

Paul Bennett

Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44) cite
evidence that by the age of three a child in a professional household might
have heard 30 million word tokens (but far fewer for children in other social
classes). I know this relates to children rather than adults, but presumably
the amount of language heard does not differ much by age.

Their source is B. Hart and T. Risley: Meaningful Differences in the Everyday
Experiences of Young Children (Paul H Brookes, 1995). I haven't read this, but
I guess this would be a place to look for more information.

Paul Bennett

Ilana Bromberg


There is some information regarding how much school-age children (up
through HS I think) read in the following article.  It's possible that some
of the sources they cite may have more information about adults.

Landuaer, Thomas K and Dumais, Susan T.  1997.  A Solution to Plato's
Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction,
and Representation of Knowledge.  Psychological Review, 104:2, 211-240.

Good luck,

* Then, there as somebody who wanted to remain anonymous, who answered:

I was interested in your query to the list, but had nothing scientific to offer. Nevertheless, for many years I have had to make estimates of how much of a person's experience of language is represented by a corpus of such-and-such a size.  It has been necessary to wow the public by suggesting that a query to EDIT scans several years of an individual's language experience, and, on the other hand, to convince sponsors that even half a billion words is just chickenfeed compared with the amount of text produced in a speech community.

In EDIT 15 years ago we established a monitor corpus with 100mw of The Times, and discovered that the weekly output of that paper, including The Sunday Times, was over half a million words.  Genuine neologisms, and not just trivial variations or proper names, were coming in at around a dozen every day. But of course not even the most devoted reader gets through anything like the whole paper.

Back when I was doing discourse analysis I read somewhere that speech is produced at an average of 1500 clauses an hour, and in speech, by my calculations at the time, a clause seemed to average 5/6 words.  I imagine that reading is not very different from that, maybe towards the faster end, but I haven't checked. Then you have to guess how many hours, on average, people are engaged in communicative activity, which I put at 12 hours.  1500 x 6 x 12 gives an estimate of 108000 daily, 39420000 annually.

If you are suspicious about any of the assumptions, you can just change them.


(•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,.,

On Fri, Aug 13, 2010 at 9:14 AM, John F. Sowa <sowa@xxxxxxxxxxx> wrote:
On 8/12/2010 10:43 AM, Ali Hashemi wrote:
> I'm looking for the average number of words that an average western
> adult is exposed to daily.

Corpora list is the one that is most likely to have subscribers that
have actually investigated questions like that.  To subscribe, see


It's a very active list, which I sometimes look at.  If you choose
to subscribe, it's necessary to shunt their messages off to a special
folder to keep them from overwhelming your mailbox.

If you send your question to that list, it would be likely to
trigger a flurry of responses of various kinds, some of which
might be useful.



(•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,.,

Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J
To Post: mailto:ontolog-forum@xxxxxxxxxxxxxxxx    (01)

<Prev in Thread] Current Thread [Next in Thread>