Ontology for Big Systems

Track 3 - Big Data Challenge

This year's Ontology Summit is titled "Ontology for Big Systems" and seeks to explore, identify and articulate how ontological methods can bring value to the various disciplines required to engineer a "big system." The term "big system" is intended to cover a large scope that includes many of the terms encountered in the media such as:

Established disciplines that fall within the summit scope include (but not limited to) systems engineering, software engineering, information systems modelling, and data mining

The principal goal of the summit is to bring together and foster collaboration between the ontology community, systems community, and stakeholders of some of the "big systems." Together, the summit participants will exchange ideas on how ontological analysis and ontology engineering might make a difference, when applied in these "big systems." We will aim towards producing a series of recommendations describing how ontologies can create an impact; as well as providing illustrations where these techniques have been, or could be, applied in domains such as bioinformatics, electronic health records, intelligence, the smart electrical grid, manufacturing and supply chains, earth and environmental, e-science, cyberphysical systems and e-government. As is traditional with the Ontology Summit series, the results will be captured in the form of a communiqu¨¦, with expanded supporting material provided on the web. 

 
Teleconferences

Date Title Chairs Panelists
2012_02_09 Track-3: "Meeting Big Data Challenges through Ontology - I & II" ErnieLucier & MaryBrady BarrySmith, ChrisMusialek-JeanneHolm, BryanThompson-MikePersonick, JamesKirby
2012_03_15 Track-3: "Challenge: Ontology and Big Data - III" MaryBrady & ErnieLucier TimFinin, KyoungsookKim, MikeFolk, MarioPaolucci, UrsulaKattner, EdinMuharemagic

Synthesis


The goal of "Meeting Big Data Challenges through Ontology" is to identify challenges that will advance ontology and semantic web technologies, increase applications, and accelerate adoption.   

Current State   

Ontology may tame big data, drive innovation, facilitate the rapid exploitation of information, contribute to long-lived and sustainable software, and improve Complicated Systems Modeling.   

Ontology might help big data, but why this usually fails   

  1. easy to create ontologies that myriad incompatible ontologies are being created in ad hoc ways leading to the creation of new, semantic silos   
  2. The Semantic Web framework as currently conceived and governed by the W3C (modeled on html) yields minimal standardization   
  3. The more semantic technology is successful; they more we fail to achieve our goals   

* Just as it’s easier to build a new database, so it’s easier to build a new ontology for each new project   

* You will not get paid for reusing existing ontologies (Let a million ontologies bloom)   

* There are no ‘good’ ontologies, anyway (just arbitrary choices of terms and relations …)   

* Information technology (hardware) changes constantly, not worth the effort of getting things right   

Linked data to lower the costs of reusing data more than anything. In addition, government data is used quite widely already, so we feel there are huge opportunities in promoting this in the Federal space.   

Current Uses / Examples   

Systems Engineering Modeling Languages and Ontology Languages   

Drive Innovation   

Federation and Integration of Systems   

Driving Innovation with Open Data - Creating a Data Ecosystem   

1. Gather data   

* from many places and give it freely to developers, scientists, and citizens   

2. Connect the community   

* in finding solutions to allow collaboration through social media, events, platforms   

3. Provide an infrastructure   

* built on standards   

4. Encourage technology developers   

* to create apps, maps, and visualizations of data that empower people’s choices   

5. Gather more data   

* and connect more people   

6. Energy.Data.gov connects works with challenges across the nation to integrate federal data and bring government personnel to code-a-thons   

7. Data Drives Decisions   

* Apps transform data in understandable ways to help people make decisions   

Rapid exploitation of information   

1. In this world, the benefit is derived from the rapid pace at which new data and new data sources can be combined and exploited.   

2. High level reasoning over curated information In this world, the benefit is derived from non-trivial inferences drawn over highly vetted data.   

3. Many times people try to have both expressivity and scale. This is very expensive   

* Don’t be seduced by expressivity   

* Just because you CAN say it doesn’t mean you SHOULD say it. Stick to things that are strictly useful to building your big data application.   

* Computationally expensive   

* Expressivity is not free. It must be paid for either with load throughput or query latency, or both.   

* Not easily partitioned   

* Higher expressivity often involves more than one piece of information from the abox – meaning you have to cross server boundaries. With lower expressivity you can replicate the ontology everywhere on the cluster and answer questions LOCALLY.   

* A little ontology goes a long way   

* There can be a lot of value just getting the data federated and semantically aligned.   

4. Unfortunately it is now so easy to create ontologies that myriad incompatible ontologies are being created in ad hoc ways leading to the creation of new, semantic silos   

5. The Semantic Web framework as currently conceived and governed by the W3C (modeled on html) yields minimal standardization   

6. The more semantic technology is successful, they more we fail to achieve our goals   

Areas of Use (both current and future) / Areas of non-use   

Ontology Design Patterns for Systems Engineering   

Ontology for Software Production - Instantiating the ontology describes design of a particular system   

  • Decisions considered, rejected, made, changed   
    • Rationale   
  • Formal software artifacts   
    • Source and executable code; specifications; machine-readable models   
  • Structured informal artifacts   
    • Pseudo-code, requirements, graphical models, test plans, email addressing info, subject   
  • Unstructured artifacts   
    • Email body, notes, code comments, etc.   

Cyber-Physical Social Data Cloud Infrastructure   

NIST & NICT Collaboration Project R&D of a cloud platform specialized for collecting, archiving, organizing, manipulating, and sharing very large (big) cyber-physical social data   

Use case 1 - Healthcare data publishing & sharing   

Use case 2 – Location Aware -based Service (e.g., disaster)   

Globally monitoring and locally fencing (safe and rapid evacuation)   

Information and Communication Technology (ict)   

  • Too much data   
  • Too much speed   
  • Too much complexity   

Why a Materials Genome Initiative? Materials Are Complicated Systems Modeling is a Challenge   

The Materials Genome Initiative is a new, multi-stakeholder effort to develop an infrastructure to accelerate advanced materials discovery and deployment in the United States. Over the last several decades there has been significant Federal investment in new experimental processes and techniques for designing advanced materials. This new focused initiative will better leverage existing Federal investments through the use of computational capabilities, data management, and an integrated approach to materials science and engineering.   

Next steps   

* File repository for first principles calculations   

** File repository for CALPHAD calculations   

** General data repository Prototype repository for data used in Calphad assessments   

* Evaluation of data storage formats (e.g. markup language, hierarchical data format)   

Accessibility (i.e., ease of use) / Impediments   

Ontology Quality for Large-Scale Systems   

Ontology Tools and Training for Systems Engineers   

Recommendations   

Some big systems and systems engineering needs and desires of ontology are:   

  • Fast integration of data   
  • Integrated heterogeneous data, linked data, and structured data   
  • Easy exploitation of data   
  • Fine‐grained provenance of federated data.   
  • An Open, Transparent Platform for Everyone   

* More opportunities for social, economic and political participation   

* Open platform for everyone, new public good   

* Non-expert system   

* Crowd sourcing, citizen science   

* Establish new information ecosystem to create new opportunities, services and jobs   

* Benefit from cultural diversity   

* Value-sensitive design   

The European FuturICT (Information and Communication Technology) Paradigm is:   

  • Create a Big Data Commons   
  • Ethical, value-sensitive, culturally fitting ICT (responsive + responsible)   
  • Privacy-respecting data-mining   
  • Platforms for collective awareness   
  • Participatory platforms, new opportunities for everyone   
  • A new information ecosystem   
  • Coevolution of ICT with society   
  • Democratic control   
  • Socio-inspired ICT (socially adaptive, self-organizing, self-regulating, etc.)   
  • A 'trustable web'   

Big data might benefit from ontology technology but why this usually fails   

  • How to do it right   
    • how create an incremental, evolutionary process, where what is good survives, and what is bad fails   
    • create a scenario in which people will find it profitable to reuse ontologies, terminologies and coding systems which have been tried and tested   
    • silo effects will be avoided and results of investment in Semantic Technology will cumulate effectively   
    • ontologies should mimic the methodology used by the GO (following the principles of the OBO Foundry: http://obofoundry.org)   
    • ontologies in the same field should be developed in coordinated fashion to ensure that there is exactly one ontology for each subdomain   
    • ontologies should be developed incrementally in a way that builds on successful user testing at every stage   
  • AmandaVizedom: I can envision a Grand Challenge like this:   
  • Create a tool, of the sort that would work with an ontology repository such as OOR, to support the following activities (make them relatively easy and make them reliable/repeatable):   
    • (a) someone with an ontology registers it, and either adds it to the repository or provides sufficient information for the tool to access it remotely. The tool provides assistance identifying key properties of the ontology that are relevant to its suitability for various types of usage. This assistance includes some manual entry, some automated validation and metrics generation, and some semi-automated generation of information.   
    • (b) Someone looking for ontologies comes to the tool and gets help finding ontologies that might meet their needs. The tool assists them in specifying their need, by entering their ontology-specific requirements to the extent that they know them, and by describing their aspects of the intended usage. The tool makes this process also semi-assisted. Key feature of this that makes it a Grand Challenge: It's not just building a tool; it requires the research and testing to establish some of the relationships between ontology characteristics and usage characteristics. It also requires not just implementation of known evaluation techniques, but also research to develop others. On the other hand, it need not be complete to be valuable. Increments of improvement could be high value advancements over the current state.   

Need a science of multi-level complex systems!   

Linked Open Data (LOD)   

Linked Open Data (LOD) is hard to create   

  • Linked Open Data is hard to query (Natural language query systems a research goal)   
  • Two ongoing UMBC dissertations hope to make it easier   
    • Varish Mulwad: Generating linked data from tables (Inferring the Semantics of Tables)   
  •   
    • Lushan Han: Querying linked data with a quasi-NL interface (Intuitive Query System for Linked Data)   

** Key idea: Reduce problem complexity by having (1) User enter a simple graph, and (2) Annotate it words and phrases   

  • Both need statistics on large amounts of LOD data and/or text   
  • Linked Data is an emerging paradigm for sharing structured and semi-structured data   
    • Backed by machine-understandable semantics   
    • Based on successful Web languages and protocols   
  • Generating and exploring Linked Data resources can be challenging   
    • Schemas are large, too many URIs