To: | "[ontolog-forum]" <ontolog-forum@xxxxxxxxxxxxxxxx> |
---|---|
From: | Simon Spero <sesuncedu@xxxxxxxxx> |
Date: | Thu, 11 Oct 2012 17:58:22 -0400 |
Message-id: | <CADE8KM4-N_gWYOiprcxZotNu+XfXYrcG0rVBvqQUQY5G0Uf10g@xxxxxxxxxxxxxx> |
On Thu, Oct 11, 2012 at 12:50 PM, Kingsley Idehen <kidehen@xxxxxxxxxxxxxx> wrote:
I'm not sure that that using the word "complexity" in a thread in which P and NPC results are discussed is good for my poor monkey brain. I'm also not sure that John is allowed to both cite worst-case complexity results *and* criticize the direction taken by DAML in the same text :-P.
I'm also not sure if it's better to avoid the word "instance" when referring to triples/quads; it seems more naturally to refer to entities, which I would take to be the number of unique subjects. I would refer to the data set as having 52.4 GigaTrips.
The dataset used in the spreadsheet is interesting, but it has some characteristics that may be problematic for benchmarking purposes. On the other hand, these characteristics may accurately reflect properties of LOD-space.
Reification and triples/entity From the spreadsheet, there are at most 770,973,451 reified statements. There are 770,973,451 rdf:subjects, 756,797,070 rdf:objects, 755,840,432 rdf:predicates, and 755,190,906 entities with type rdf:Statement.
From this I assume (a) rdfs entailment is not in effect, (b) RDF reification is a mess, (c) data is messy, and (d) RDF reification is a mess. There are 52,381,770,554 triples (not distinct-ed, or necessarily unique, but assume snowflake-completeness).
The dataset has 5,275,200,213 distinct subjects/entities (modulo sameAs). This gives an average of ~9.93 triples per entity. Since RDF statements are the second most common entity type in the data set, are of low arity, and are a bit of a meta-distraction in this context, it's worth netting them out. I'll assume that Statement entities only have S/P/O & type asserted of them; this is probably false, but since I would also net out all provenance info that was cloned during triplification-reification
Without reified statements, we have 52,381,770,554 - (770,973,451 + 756,797,070 + 755,840,432 + 755,190,906) = 49,342,968,695 triples. There are 5,275,200,213 - 755,190,906 = 4,520,009,307 entities.
This gives a ratio of ~10.92 triples/entity. There are also 122,811,677 instances of Seq, with presumably 3*122,811,677 = 368,435,031 triples , for a ratio of 48974533664 / 4,397,197,630 = ~11.14 triples/entity.
Type and Property count distribution With reified statements, the first 8 values cover 51.8% of all values of rdf:type . Without reified statements, the first 12 values cover 51.4% of all values of rdf:type .
854,401,149 entities have type foaf:Person (13.7%)
543,212,900 entities have type rpi-data-gov:DataEntry (8.7%) 354,334,157 entities have type ncbi_resouce:reference (5.7%)
No other values get over 5%
The only predicate to occur in more than 5% of all triples is rdf:type (11.9%). The second most common predicate is rdfs:label, which occurs in only 3% all triples.
DataEntry is a rather general class that is basically used as a place to stuff the values extracted from a row in a spreadsheet. The properties are based on the name of the spreadsheet field used to represent them (there may be some disambiguation going on, given that the properties used are made a subproperty of a property based on the name, and some datasets use properties that come from a related dataset, but I didn't explore fully, and many of the original data.gov links are now dead, so it's harder to check up on. Based on a brief look at a couple of data sets (744 and 784), it looks like the bulk of these entries are from HHS Medicare cost reports.
Interestingly, there is a bug in the property definitions generated by the RPI semantic-mediawiki instance; Properties are declared as ObjectProperty's (e.g. http://data-gov.tw.rpi.edu/vocab/p/744/rpt_rec_num ), yet used as DataProperty's -e.g:
<http://data-gov.tw.rpi.edu/raw/784/data-784-00001.rdf#entry1041> <http://data-gov.tw.rpi.edu/vocab/p/744/rpt_rec_num> "55927" .
This cannot be handled in OWL 2-DL (the same IRI cannot be used as an object and a data property in the owl-2 direct semantics).
Simon
_________________________________________________________________ Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/ Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/ Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx Shared Files: http://ontolog.cim3.net/file/ Community Wiki: http://ontolog.cim3.net/wiki/ To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J (01) |
<Prev in Thread] | Current Thread | [Next in Thread> |
---|---|---|
|
Previous by Date: | Re: [ontolog-forum] [SMW-devel] [News] Google, Microsoft, Facebook And Others Launch SMW site, Rich Cooper |
---|---|
Next by Date: | Re: [ontolog-forum] [SMW-devel] [News] Google, Microsoft, Facebook And Others Launch SMW site, Michael Brunnbauer |
Previous by Thread: | Re: [ontolog-forum] [SMW-devel] [News] Google, Microsoft, Facebook And Others Launch SMW site, Peter Yim |
Next by Thread: | [ontolog-forum] Ontologies vs. Web Ontologies, Peter Yim |
Indexes: | [Date] [Thread] [Top] [All Lists] |