ontolog-forum
[Top] [All Lists]

[ontolog-forum] Looking at LOD (was Re: [SMW-devel] [News] Google, Micr

To: "[ontolog-forum]" <ontolog-forum@xxxxxxxxxxxxxxxx>
From: Simon Spero <sesuncedu@xxxxxxxxx>
Date: Thu, 11 Oct 2012 17:58:22 -0400
Message-id: <CADE8KM4-N_gWYOiprcxZotNu+XfXYrcG0rVBvqQUQY5G0Uf10g@xxxxxxxxxxxxxx>
On Thu, Oct 11, 2012 at 12:50 PM, Kingsley Idehen <kidehen@xxxxxxxxxxxxxx> wrote:
On 10/11/12 11:18 AM, John F Sowa wrote:
But note that most vendors of triple stores support SQL as an option for complex queries.

Not disputing that, we do that. But one has to be careful about what "complex" implies. For instance, we support SPARQL, SPASQL (sparql inside sql), and SQL. Each has its own virtues re. complexity handling.

  I noticed that none of the examples on that page use the SPARQL operators FILTER, OPT, or UNION.

That's me trying to keep it simple.

I can make pages with those operators that will hit a 50 Billion+ live instance of Virtuoso. We are bringing that online as a replacement of the older LOD cloud cache which would handle any of the aforementioned operators against a 29 Billion+ instance.

1. http://bit.ly/ONYFDH -- Google Spreadsheet with some benchmark results for the Virtuoso hybrid DBMS engine .

I'm not sure that that using the word "complexity" in a thread in which P and NPC results are discussed is good for my poor monkey brain.  I'm also not sure that John is allowed to both cite worst-case complexity results *and* criticize the direction taken by DAML in the same text :-P.  

I'm also not sure if it's better to avoid the word "instance" when referring to triples/quads; it seems more naturally to refer to entities, which I would take to be the number of unique subjects.  I would refer to the data set as having 52.4 GigaTrips.  

The dataset used in the spreadsheet  is interesting, but it has some characteristics that may be problematic for benchmarking purposes.  On the other hand, these characteristics may accurately reflect properties of LOD-space. 

Reification and triples/entity

From the spreadsheet, there are at most  770,973,451 reified statements. There are 770,973,451 rdf:subjects, 756,797,070 rdf:objects, 755,840,432 rdf:predicates, and 755,190,906 entities with type rdf:Statement. 
From this I assume (a)  rdfs entailment is not in effect, (b) RDF reification is a mess, (c) data is messy, and (d) RDF reification is a mess.  

There are 52,381,770,554 triples (not distinct-ed, or necessarily unique, but assume snowflake-completeness).
The dataset has 5,275,200,213 distinct subjects/entities (modulo sameAs). 
This gives an average of ~9.93 triples per entity. 

Since RDF statements are the second most common entity type in the data set, are of low arity, and are a bit of a meta-distraction in this context, it's worth netting them out.  I'll assume that Statement entities only have S/P/O & type asserted of them; this is probably false, but since I would also net out all provenance info that was cloned during triplification-reification 
Without reified statements, we have  52,381,770,554 - (770,973,451 + 756,797,070 + 755,840,432 + 755,190,906) = 49,342,968,695 triples.  
There are 5,275,200,213 - 755,190,906  = 4,520,009,307 entities.  
This gives a ratio of ~10.92 triples/entity.   

There are also 122,811,677 instances of Seq, with presumably 3*122,811,677 = 368,435,031 triples ,  for a ratio of   48974533664 /  4,397,197,630 = ~11.14 triples/entity. 

Type and Property count distribution

With reified statements, the first 8 values cover 51.8% of all values of rdf:type . 
Without reified statements, the first 12 values cover 51.4% of all values of rdf:type . 

854,401,149 entities have type foaf:Person (13.7%)
543,212,900 entities have type rpi-data-gov:DataEntry (8.7%)
354,334,157 entities have type ncbi_resouce:reference (5.7%)

No other values get over 5%
The only predicate to occur in more than 5% of all triples is rdf:type (11.9%). The second most common predicate is rdfs:label, which occurs in only 3% all triples.

DataEntry is a rather general class that is basically used as a place to stuff the values extracted from a row in a spreadsheet. The properties are based on the name of the spreadsheet field used to represent them (there may be some disambiguation going on, given that the properties used are made a subproperty of a property based on the name, and some datasets use properties that come from a related dataset, but I didn't explore fully, and many of the original data.gov links are now dead, so it's harder to check up on.  Based on a brief look at a couple of data sets (744 and 784), it looks like the bulk of these entries are from HHS Medicare cost reports.

Interestingly, there is a bug in the property definitions generated by the RPI semantic-mediawiki instance; Properties are declared as ObjectProperty's (e.g. http://data-gov.tw.rpi.edu/vocab/p/744/rpt_rec_num ), yet used as DataProperty's -e.g:

 
This cannot be handled in OWL 2-DL (the same IRI cannot be used as an object and a data property  in the owl-2 direct  semantics). 

Simon

_________________________________________________________________
Message Archives: http://ontolog.cim3.net/forum/ontolog-forum/  
Config Subscr: http://ontolog.cim3.net/mailman/listinfo/ontolog-forum/  
Unsubscribe: mailto:ontolog-forum-leave@xxxxxxxxxxxxxxxx
Shared Files: http://ontolog.cim3.net/file/
Community Wiki: http://ontolog.cim3.net/wiki/ 
To join: http://ontolog.cim3.net/cgi-bin/wiki.pl?WikiHomePage#nid1J    (01)

<Prev in Thread] Current Thread [Next in Thread>
  • [ontolog-forum] Looking at LOD (was Re: [SMW-devel] [News] Google, Microsoft, Facebook And Others Launch SMW site), Simon Spero <=