I'm looking for a lossless compression text format that finds repeated
words or patterns in a text and stores them in a dictionary. In the body of
the text the words/patterns are 'transcluded,' to use a new word, by reference
to the dictionary. In other words, if you have the word 'ontology' repeated in
a text (or web/wiki page/site) 1000 times - you only write it out once
in the dictionary. In the text body is just the ID# of the word 'ontology.'
when you read the text, the word is there (transcluded), not the ID# of
course. If you change the word in the dictionary, every location of that word
in the text is changed.
[snip]
I've
got something similar on my to-do list, but I was only thinking about pulling
out proper nouns (as instances). The background is here:
and elsewhere - links here
I personally think there's loads of machine learning/statistical
stuff that could be of benefit in the less scruffy ontology world.
('SemText' is a little side project I've got lined
up).
Sorry for the late night thoughts.
Keep
'em coming!
Cheers,
Danny.