ppy/chat-transcript_unedited_20120209a.txt ------------- Chat transcript from room: summit_20120209 2012-02-09 GMT-08:00 ------------- [09:22] anonymous morphed into Bob Smith [09:23] anonymous morphed into Jim Kirby [09:23] anonymous1 morphed into TomTinsley [09:25] anonymous morphed into MatthewHettinger [09:26] anonymous morphed into John Bilmanis [09:28] PeterYim: Welcome to the = OntologySummit2012: Session-05, Thursday 2012-02-09 = Summit Theme: OntologySummit2012: "Ontology for Big Systems" Track (3) Title: Meeting Big Data Challenges through Ontology Session Topic: I - Big Data domain experts and ontologists; II - Big Data that would benefit from ontological technology Session Chairs: Mr. ErnieLucier (NCO/NITRD) and Ms. MaryBrady (NIST) Panelists: (33WI) * Professor BarrySmith (University at Buffalo) - "Big Data that might benefit from ontology technology, but why this usually fails" * Mr. ChrisMusialek (GSA) (for Dr. JeanneHolm, Evangelist, Data.gov) - "Driving Innovation with Open Data" - slides * Mr. BryanThompson and Mr. MikePersonick (SYSTAP) - "Big Data Challenges: Managing Scale in Ontological Systems" * Mr. JamesKirby (Naval Research Laboratory) - "Ontology for Software Production" Session page: http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2012_02_02 Mute control: *7 to un-mute ... *6 to mute Can't find Skype Dial pad? ... it's under the "Call" dropdown menu as "Show Dial pad" . == Proceedings: == . [09:30] anonymous morphed into DougFoxvog [09:31] anonymous morphed into Adam Montville [09:31] Adam Montville: VNC active for this today? [09:31] James Odell: Am getting a lot of background noise. Is it just me? [09:32] anonymous1 morphed into ChristopherSpottiswoode [09:32] FrankOlken: Hi, this is Frank Olken now on the call and chat room. [09:32] LarryLefkowitz: I'm hearing the background noise as well. Someone in a car or under a fan? [09:32] anonymous morphed into Xavier Lopez [09:32] anonymous1 morphed into ChrisMusialek [09:32] LeoObrst: Yes, background noise from someone. [09:33] anonymous morphed into Bryan Thompson [09:33] anonymous4 morphed into bill mccarthy [09:33] ChrisMusialek: I'm hearing background noise too [09:33] anonymous2 morphed into Tsengdar Lee [09:33] anonymous morphed into Barry SMith [09:34] Adam Montville: I hear the noise as well. [09:34] Barry SMith morphed into Barry Smith [09:34] FrankOlken: I also hear the background noise, and I am muted. [09:34] MikeBennett: +n [09:34] TrishWhetzel: I also hear the background noise, and am muted. [09:34] DougFoxvog: The noise sounds like a distant vacuum cleaner -- constant, not variable like wind. [09:35] AmandaVizedom: The noise will go away once the check-ins are done and Peter mutes everyone from his end. [09:35] Tsengdar Lee: I can hear constant white noise here. I am muted. [09:35] Adam Montville: VNC activated? [09:36] anonymous morphed into Rosario Uceda-Sosa [09:37] JackRing: Also, if you are on Skype and the little mike is red then click on it [09:38] Jim Kirby: Peter I'm dialed in [09:40] DougFoxvog: Next slide? [09:43] AmandaVizedom: FYI, current speaker is GeorgeStrawn http://ontolog.cim3.net/cgi-bin/wiki.pl?GeorgeStrawn [09:45] Tsengdar Lee: Is VNC the only way to see slides? I can't install VNC on my government controlled laptop. [09:45] anonymous morphed into BartGajderowicz [09:46] AmandaVizedom: @Tsengar: All slides are downloadable from the call's wiki page: http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2012_02_09 [09:47] AmandaVizedom: @Tsengdar - sorry for typo on your name! [09:47] anonymous1 morphed into ChristopherSpottiswoode [09:48] Tsengdar Lee: @AmandaVizedom: Thanke very much! I have the slides now. [09:49] AmandaVizedom: You're welcome. [09:52] PeterYim: == BarrySmith presenting ... [09:58] BobSchloss: The combination of incentives for vocabulary re-use, and enabling this with new tools and registries, is an important issue. Within a modest-sized scope, a few of us at IBM Research are starting to document, and encouraging use of, a disciplined approach that encourages this. But we won't know this is effective for at least another year while the number of contributing ontologists goes up from our current small group to a large group. [10:01] JackRing: What happens in GO when a new cognate appears? e.g., in Medlars, 1976 the notion of HIV appeared. More than 1 million documents had to be retroactively tagged. [10:01] Ram D. Sriram: @Barry: What is that about Big Data that is unique, in terms of Ontology Building [10:01] AmandaVizedom: @Bob: I agree, and see such incentives all around. I don't think most people building their own ontologies would rather reuse what they can. It's currently very hard, though, to find ontologies that support such reuse. And it's very hard to assess the reusability of an ontology in a new context or project. [10:05] PeterYim: == ChrisMusialek presenting ... [10:06] BobSchloss: One of my questions for Chris has to do with what metrics Data.gov will use to show the cost effectiveness of cataloging and hosting this data, and also how often each dataset type is updated? Are some updated in real time or every hour? Are some on a daily basis? How many on a monthly basis or quarterly basis? How many are updated once per year? [10:06] James Odell: +1 [10:07] EricChan: @BarrySmith: would you recommend for ontology community to adopt a pattern language? by adapting Christopher Alexanders pattern language and Gamma et al. (gang of four) catalog of OO design patterns, to document ontologies and discuss effectiveness of a proposed ontology, and to evolve it to an authoritative ontology. [10:08] DougFoxvog: Slide 7 states that data.gov provided 6.7 billion triples as of Nov 2011. Were all the resources used defined in data dictionaries? [10:09] Gary Berg-Cross: Building common taxonomies for the metadata vocabularies was part of the enterprise architecture strategy of the last few years. [10:09] DougFoxvog: Oops. I meant slide 11. [10:11] BobSchloss: General comment: What is it about the Data.Gov needs that makes them willing to be early adopters of Ontology approaches, compared to commercial data providers, such as Thomson Reuters, Bloomberg, etc. Is it the heterogeneity of the data feeds? [10:12] PeterYim: == BryanThompson presenting ... [10:12] Gary Berg-Cross: Some material on data ecosystems would be useful. Does anyone have good references? [10:16] BobSchloss: Interesting that some of the social media sites, like Twitter and Facebook, I think are using some kind of NoSQL approach to persistance, but that Bryan doesn't list them as an interested community. They do, in fact, have many Terabytes of data... perhaps not Petabytes yet. [10:17] ChrisMusialek: @bobschloss: I think it's more that we understand the power of linked data to lower the costs of reusing data more than anything. In addition, government data is used quite widely already, so we feel there are huge opportunities in promoting this in the Federal space. [10:18] AmandaVizedom: Question regarding Data.gov: If I understand correctly, one way Data.gov wants to help make the data available and usable is to make it easier for data holders to put their data out into the open gov data space in a usable way. Does that include tools for data holders to share their metadata ontologies (if they have them), or to find and evaluate ontologies they might be able to reuse in their metadata? [10:19] ChrisMusialek: @bobschloss: It depends on the dataset. Some are updated daily, some yearly, and some never. It also at this point, because of our architecture, depends on when data stewards go in to update their dataset in our system, which is also variable. [10:19] JackRing: Slide 5 reflects a presumption of von Neumann machine architectures. It is possible to have both quick and complete findings. [10:20] ChrisMusialek: @boschloss: but we believe in posting as much data about Data.gov metrics as possible. Currently we share a small amount of things, but we want to increase this greatly. [10:20] anonymous morphed into Ulozas [10:21] MikeBennett: Surely an ontology built to express business meanings overall, and an ontology built to be reasoned over, are two different use cases? Should the requirements of the one be imposed upon the other? [10:23] AmandaVizedom: @BryanThompson: Very confused by some of the points on your slide 6. You argue that expressive ontology is not easily partitioned. Why? Can you tell us anything about examples (even an anonymized sketch) where you've experienced a correlation between the two? Your last point says to "[a]void constructs that tell you things you probably already know (e.g. domain/range)." But ontologies are most useful when at least minimal computation is going to be done, or information will be shared across contexts, so the things you already know aren't going to be known by (machine or human) users. Are you talking about applications where this doesn't come into play? Or am I misunderstanding your point? [10:24] ChrisMusialek: @amandavizedom: It is both, but at this point, a concentration on finding and evaluating ontologies that they might want to reuse in their metadata/data. We think that once these different vocabularies become public, that we will begin to see similarities across distinct agencies that happen to store similar-typed information [10:26] AmandaVizedom: @BryanThompson: Provenance is more expensive in some ontology/knowledge base architectures than others. I'm assuming from the SPARQL reference that you are using OWL. Is that correct? Did you look at, or compare, any alternatives? [10:26] AmandaVizedom: @ChrisMusialek: Thanks. I agree. [10:26] PeterYim: == JimKirby presenting ... [10:27] ChrisMusialek: @dougfoxvog: No, I don't think that the resources were defined in data dictionaries, which is another reason we really need the data dictionaries for each of the datasets. At this point, the RDF produced didn't have much context. If we were to publish vocabularies for each dataset, we would be able to produce much better RDF. Not as good as original modeling in RDF, but much better. [10:28] ChrisMusialek: @amandavizedom: Yes, we hope that academics and researchers can help us out with building the tools. :-) [10:29] AmandaVizedom: @MikeBennett: Good point about differing use case requirements. Perhaps we are seeing here, again, recommendations that clash because of unstated differences in usage and the requirements derived therefrom. [10:31] BobSchloss: IBM, with leadership from our Rational brand, is using a community around http://jazz.net and http://open-services.net to have a ground of tool vendors used in software development to agree on the essential data that should be captured during the software development lifecycle. [10:32] LeoObrst: @Bryan: Reification in RDF is syntactic, has no semantic interpretation per se. Why? Because like propositional attitudes (ex: John know/believes/regrets that Ed is a thief), the semantics of the attitude to the "triple" depends on the specific attitude. So provenance, which uses reification, can be of many kinds, and really the ontology (and thus reasoner) must define what the provenance elements mean. [10:32] BobSchloss: Many of us think that this should not be limited just to "software production" but also to the PLM of engineered complex systems -- such as aircraft carriers, airplanes, locomotives, telecom switches and routers, custom earthmoving and construction equipment, custom equipment construction for factories, refineries, chemical processing facilities, etc. [10:33] Bryan Thompson: NoSQL is a very broad category. Many times these platforms are "row stores" which are hash partitions and provide primary key access *only* to the data. This makes it extremely difficult to do high level query against the row store. NoSQL is also used for analytic databases where query plans are hash partitioned across data sets, but this is most often associated with data warehousing rather than graph processing. I also left out many other types of large scale distributed processing, e.g., MPI. I was not trying to be exhausting, just illustrative. [10:33] anonymous morphed into DWiz [10:34] BobSchloss: @Bryan - Thanks; I understand. [10:34] JackRing: This does not acknowledge the critical importance of identifying and converging the invariants across multiple chunks of software. [10:35] Bryan Thompson: There have been some good papers on abox/tbox partitioning for parallelizing RDFS reasoning, which is still within datalog. More expressive ontologies often go beyond datalog. [10:36] MikeBennett: @Amanda that's my thought. When we come to look at quality assurance aspects, one would expect to see these hitherto unspoken assumptions and requirements formally documented. [10:36] Bryan Thompson: RDF reification turns into materialized assertions within the database. If you want datum level provenance, you wind up storing 4x to 5x as many statements if you reify the statements. The more data on the disk, the more expensive the solution. [10:38] PeterYim: @Dwiz - we use real names here, can you identify yourself and swap in your real name, if you please [10:38] LeoObrst: Bryan, we have translated OWL/RDF + SWRL rules into a hybrid logic programming + description logic runtime reasoner, with efficient reasoning. Over huge data stores (and huge onotlogies), one needs to use multiple (distributed) reasoners that interact, a difficult architecture. [10:40] Gary Berg-Cross: Some people use the idea of "small theory" ontologies rather than micro-ontologies. Werner Kuhn is one person who uses the term. [10:41] AmandaVizedom: @BryanThompson: Your last comment is one of the reasons I was asking about how you made your language choices. People have been working hard to enable efficient provenance with OWL, but it isn't really architected to make that easy. There is less explosion in some other languages and techniques, and I'm wondering whether your language choices were part of your design process, or were in some way forced? [10:41] RexBrooks: @Barry: Governance is big hurdle since you sometimes needs to get completeness of large datasets and large classification systems that need to be reuse-able as data churns through on some kind of daily, weekly, hourly basis. CComing to significant governance findings mid-stream causes even more churn. So how to separate out the governance from daily operations is important. [10:41] JackRing: Proposed objective: Demonstrate degree of improvement (quality, parsimony, beauty) in interoperability among humans in diverse disciplines who use computer-based aids. This entails not only an initial ontology but also a convergence method for continually evolving ontologies. Perhaps this amounts to devising a method that fuses (Bohm) dialogue with ontology design (and perpetual re-factoring). [10:41] ToddSchneider: Leo, where can a description of the work you described be found? [10:42] LeoObrst: @Bryan: yes, those materialized reification assertions do cost, which is why many vendors go to quad or quint stores or more. Metadata always adds up. [10:43] DWiz: Who was it who said, I am not a big fan of UML? I didn't know that there are 2 of us. [10:44] ToddSchneider: DWiz, There are many of us who want to supplant UML with something actually useful. [10:44] PeterYim: @BarrySmith and @LeoObrst - since we are talking to folks in the "Big Data" community now, maybe we need to take a chance to clarify where "ontologists" are positioned (wrt communities like semantic web, etc.) [10:44] DougFoxvog: @Amanda: +1 ; @BryanThompson: using quads instead of RDF (or other) triples allows for contexts to be specified with the statements. One can often avoid reification if one is not restricted to triples. [10:45] LeoObrst: @Todd: I can send you offline. Here is a reference: 25)Samuel, Ken; Leo Obrst; Suzette Stoutenberg; Karen Fox; Paul Franklin; Adrian Johnson; Ken Laskey; Deborah Nichols; Steve Lopez; and Jason Peterson. 2008. Applying Prolog to Semantic Web Ontologies & Rules: Moving Toward Description Logic Programs. The Journal of the Theory and Practice of Logic Programming (TPLP), Massimo Marchiori, ed., Cambridge University Press, Volume 8, Issue 03, May 2008, pp. 301-322. [10:45] JackRing: Once you see an ontology as a system that acts as interlocuter among the variety of sources and sinks in Big Data then several system principles may help avoid blind alleys. [10:45] Bryan Thompson: Yes, you can use quads. Bigdata supports triples, triples+provenance, and quads. The right choice is application specific. [10:46] Bryan Thompson: Our provenance mode can be viewed as a special case of quads where every statement is in its own graph, but it uses much less data on the disk. A difference which makes a difference. [10:46] BartGajderowicz: I don't have a microphone, so I'll type my comments below. Barry Smith's and Chris Musiek's presentation touched on the data analysis of raw data. Barry's example was search, Chris' included search, visualization. I believe analyses on large datasets such as machine learning can help with this. My main motivation is the following: Instead of statistical interpretation (mean, standard deviation, mode, etc) we can show semantic relations of data, and analyse it at a more abstract level. Barry's work on ontology granulation is related to this. There is work on creating decision trees using OWL (see Reference below). The Uncertainty and Reasoning for the Semantic Web workshop is also related to this, although dealing with ontologies themselves, not necessarily incorporating data. Proposals to achieve this: - Extending Ontologies with Data. - Associating Data with Ontologies. - Using machine learning on ontologically enhanced data. Are there any efforts along these lines? Is anyone interested in using this approach to analyse data? Reference: I have a more detailed message on the ontology-summit mailing list with references: http://ontolog.cim3.net/forum/ontology-summit/2012-02/msg00158.html [10:47] BartGajderowicz: sorry for the long message [10:48] JackRing: Caution: Reuse is a chimera pending unification of invariants throughout a system being composed. [10:48] BobSchloss: I have some colleagues here at IBM T J Watson Research who are also interested in using machine learning and IR approaches to ontologically enhance data dn to ontologically categorize data. I don't think they've published their work yet, but I will mention your interest to them. [10:48] BartGajderowicz: Thank you for reading it out [10:49] Bryan Thompson: @LeoOrbst: Yes, distributed reasoners is an interesting approach - especially if you can partition the reasoning. We have been looking at GPU co-computing for accelerating certain database operations and as a platform for some kinds of reasoning. [10:49] MaryBrady: You are very welcome [10:50] PeterYim: @Bart - Barry suggests looking at http://obi-ontology.org/ ... [10:50] BobSchloss: @Bart - I don't see your name in the Ontolog Wiki page for this meeting. What is your affiliation, what country are you in, and what is your e-mail address? [10:50] BartGajderowicz: @Bob: see http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2012_02_09#nid34SV [10:51] BartGajderowicz: The term machine learning hasn't been used that often in the Ontolog community, although I know the IBM Watson team does. [10:52] TrishWhetzel: @Bart I think that Gil Alterovitz is working on such a project [10:52] ToddSchneider: Bryan, Partitioning of reasoning can be derived from logical independence. [10:52] DougFoxvog: @JackRing: Reuse does require standardization. That's what GO does at a wide level and the Dublin Core, FOAF, and other tiny ontologies do at a small level. People who develop their own ontologies can inter-relate their terms with standardized terms to help their own terms to get reused. [10:53] JackRing: @BobSchloss, Yes, the notion applies far beyond software production. In fact, the act of conceiving, designing, architecting, etc. any system is a process of "stacking" discovered knowledge. [10:54] BartGajderowicz: @BarrySmit, @TrishWhetzel, @ChrisMusialek: Thanks for the references [10:55] LeoObrst: We are looking more and more at what design patterns mean for ontologies. A simple design pattern is a rule. More complex design patterns are actually ontologies. For example, if you focus on a clawhammer, an artifact, it has physical properties that correspond to its function(s). But the design pattern behind it (and other objects) can be captured so that many objects (instantiated patterns)apply. [10:55] ChrisMusialek: I'm sorry, I need to jump off the call now. Thanks for the invitation. [10:55] JackRing: @DougFoxvog, Standardization is one way, usually onerous and slow to appear. Perhaps proactive semantic converters are relevant. [10:55] Frank Chum: @Bob I am also interested in ML and Statistical IR techniques in auto categorization with ontologies. [10:55] Bryan Thompson: @ToddSchneider Yes, if it such independence exists. If you want to rely on such partitioning, then it should probably inform your ontology modeling efforts. [10:56] BartGajderowicz: I have to leave.. Thank you everyone. [10:58] LarryLefkowitz: I disagree that you need to sacrifice representational power for efficiency. Yes, it takes a more sophisticated inference engine, but this can be done. That's what we've done with Cyc over the past couple of decades; be happy to talk with folks about this further. [10:58] MaryBrady: @Bryan: We have been exploring the use of GPU's in cell tracking applications, with the end goal of real-time experimentation -- that is, analyzing cell images through the use of clustering and machine learning approaches, looking specifically for rare events. We have not yet combined this approach with an onotology, but it's an interesting thought. [10:59] Bryan Thompson: @Leo: Could you expand your comments to comment on how well reasoning scales beyond the machine boundary? [10:59] AmandaVizedom: I can envision a Grand Challenge like this: Create a tool, of the sort that would work with an ontology repository such as OOR, to support the following activities (make them relatively easy and make them reliable/repeatable): (a) someone with an ontology registers it, and either adds it to the repository or provides sufficient information for the tool to access it remotely. The tool provides assistance identifying key properties of the ontology that are relevant to its suitability for various types of usage. This assistance includes some manual entry, some automated validation and metrics generation, and some semi-automated generation of information. (2) Someone looking for ontologies comes to the tool and gets help finding ontologies that might meet their needs. The tool assists them in specifying their need, by entering their ontology-specific requirements to the extent that they know them, and by describing their aspects of the inten! ded usage. The tool makes this process also semi-assisted. Key feature of this that makes it a Grand Challenge: It's not just building a tool; it requires the research and testing to establish some of the relationships between ontology characteristics and usage characteristics. It also requires not just implementation of known evaluation techniques , but also research to develop others. On the other hand, it need not be complete to be valuable. Increments of improvement could be high value advancements over the current state. [10:59] PeterYim: @Barry - we lost you, can you call back in, please? [11:00] ToddSchneider: Barry, can you tells a bit more about the online certificate program? [11:02] JackRing: The Jazz architecture really helps if your objective is to create systems that behave but how well does it work for building systems that discover? [11:04] DougFoxvog: I agree with Larry. One can use terms from an ontology that is defined with higher-level concepts while using a more restrictive system. The result is that you can not make inferences with your limited system, but the more complex system can accept your terms and statements -- and can be used to add more rules or refinements that could not be stated in the more limited system. [11:05] Bryan Thompson: @MaryBrady: One of the GPU applications that I would like to pursue is non-crisp reasoning over uncertain information in support of human decision making processes. Helping people to identify and deal with incomplete information, unreliable conclusions, and conflicting evidence. [11:05] BobSchloss: There is much more work to be done to "auto-discovery" from the artifacts used in software production (such as source code, or models like UML models or BPMN models) the RDF descriptions that should be produced for use by Jazz, using the vocabularies defined by the Open Services for Lifecycle Computing (OSLC) to encode this "discovered information". It is a clear potential, but this is NOT what IBM Rational is now selling. [11:05] ToddSchneider: Barry, did the development of the certificate courses benefit from the Ontology Summit dedicated to education? [11:07] DougFoxvog: @BryanThompson: non-crisp reasoning and uncertain information require a more complex logic than OWL-DL or OWL-Lite provide. Other ontological languages provide this. [11:09] BobSchloss: @Leo or @Peter - please provide a link to this Trento-based course. [11:09] LeoObrst: @Larry: yes, Cyc uses multiple kinds of reasoners, including some 2nd order, and it's the dispatch among them that gets complicated. E.g., we talked with some of the Cyc folks back a few year about using graph partitioning and then analysis of the partitions to "type" the reasoning required for that partition. And then have an inter-communicating framework. [11:10] Bryan Thompson: @DougFoxvog: Yes. My interest here is in computational models of cognition. [11:11] LeoObrst: @Bob: it will shortly be off the IAOA.org main page: iaoa.org. [11:12] JoelBender: (thank you speakers and panelists and Peter) [11:12] LarryLefkowitz: @leo: Be happy to resume this discussion with you. Just give me a shout. [11:13] LeoObrst: Thanks, everyone! [11:13] LeoObrst: @Larry: sure, let's talk soon. [11:13] PeterYim: great session ... thanks, everyone! [11:13] PeterYim: -- session ended: 11:12am PST -- --------------