Executive Summary (1NYJ)
The science and engineering communities are producing very large data sets that are also increasingly complex and diverse. These data sets are very well suited for particular narrowly-defined, discipline-specific purposes. In principle, these data sets could be used for solving more broadly-defined scientific problems such as understanding whole organisms, ecosystems and human populations. However, incorporating multiple data types from multiple sources to solve these problems remains a significant challenge. For example, a testable macroscopic biological hypothesis might involve the effect of environmental or climatic change on the genomic makeup of a given organism. As another example, a macroeconomic hypothesis concerning the most efficient use of resources to improve the quality of life in a region will depend on cultural and environmental knowledge as well as economic statistics. (1NYK)
While the data sets that are currently being developed typically engender the greatest level of enthusiasm by the communities that are creating them, data sets created in the past can have equal importance for related communities. Biodiversity is a case in point. The painstaking observations by generations of biologists over centuries represent an important resource for modern ecology and biodiversity studies, but those observations are locked in old textbooks and monographs that are not easily accessed by modern computing technology. The problem is not just the differences in recording media (paper versus disks) but also the enormous changes in terminology over time. Current data sets run the risk of an even more rapid obsolescence as the meaning of the data fields is forgotten even by the individuals who introduced them. (1NYL)
We believe in the promise of semantic technologies based on logic, databases and the Semantic Web as a means of addressing the problems of meaningful access to and integration of data over decades and centuries. Such technologies enable distinguishable, computable, reusable, and sharable meaning of information artifacts, including data sets, documents and services. We also believe that making this vision a reality requires additional supporting resources, and that these resources should be open, extensible, and provide common services over the ontologies. Our belief in this vision is based not only on current experience but also on the deep philosophical foundations that underly modern ontological engineering. (1NYM)
We propose to develop an open ontology repository (OOR) of controlled vocabularies and knowledge models that have been encoded in RDF, OWL, and other knowledge representation languages. More specifically, we propose to develop an open repository for the metadata and data sets of the following communities: (1NYN)
1. Biology, especially the genomics, proteomics and other "omics" communities. This will be based on the highly successful BioPortal repository. (1NYO)
2. Biodiversity, especially the species pages in the Encyclopedia of Life. (1NYP)
3. Climate and environmental communities (including both natural environments and built environments). (1NYQ)
4. Human culture and sociology. (1NYR)
The data sets of these communities share a number of characteristics that make them well suited to the proposed OOR: (1NYS)
1. They represent large data sets that are of considerable importance to their respective communities. (1NYT)
2. Some of the data sets, especially those of the biodiversity community, represent data that is very old, sometimes centuries old, yet still of considerable value. (1NYU)
3. The integration of these data sets opens up exciting research opportunities not only for the natural sciences but also for environmental and social sciences. (1NYV)
4. The data sets have complex semantics, and there is no clear distinction between data and metadata. As a result, modern relational database technology is poorly suited for modeling the data sets. (1NYW)
While these data sets provide a compelling case for the proposed OOR, the prospect of broader impacts is even more compelling. As an integral part of the proposed project, we intend to foster a vigorous educational outreach program to bring other data-intensive research communities into the OOR initiative. Since the OOR will be an open, federated architecture and infrastructure, it is intended to be utilized by communities to host their own ontologies as well as allowing the communities to adapt previously established ontologies for their own purposes. (1NYX)
To address the issue of long-term sustainability, we propose to develop a new paradigm for maintaining semantic linkages available through the Internet. Specifically, we will develop a federated knowledge repository that can collectively correct for multiple points of failure and can foster collaborative stewardship of scientific knowledge. Particular emphasis will be given to the development of technological solutions that build on existing, proven architectures for maintaining biological (e.g., BioPortal, OBO Foundry and the International Nucleotide Sequence Data Consortium) and abiotic data (e.g., the National Climatic Data Center), as well as standards for metadata and services (e.g., ISO XMDR, WSDL and UDDI). (1NYY)
The OOR Statement of Purpose (1NYZ)
The purpose of an OOR is to provide an architecture and an infrastructure that supports a) the creation, sharing, searching, and management of ontologies, and b) linkage to database and XML Schema structured data and documents. Complementary goals include fostering the ontology community, the identification and promotion of best practices, and the provision of services relevant to ontologies and instance stores. Examples of anticipated services include automated semantic interpretation of content expressed in knowledge representation languages, the creation and maintenance of mappings among disparate ontologies and content, and inference over this content. We believe that the OOR will ultimately support a broad range of semantic services and applications of interest to enterprises and communities. (1NZ0)
Achieving these goals will help reduce semantic ambiguity whenever and wherever information is shared, thereby allowing information to be located, searched, categorized, and exchanged with a more precise expression of its content and meaning. The artifacts of the repository will provide a semantic grounding for diverse formats and domains, ranging from the conceptual domains and specific disciplines of communities to technical schemas such as WSDL, UDDI, RSS, and XML schemas, and of course expressed in standard ontology languages such as RDF, OWL, Common Logic, and others. Perhaps most importantly, the repository will enable wide-scale knowledge re-use and reduce the need to re-invent the wheel when defining concepts and relationships that are already understood. (1NZ1)
These goals cannot be achieved all at once, and must track the evolution of best practices as well as technology itself. It is also good system development practice to bound complexity by defining a system in terms of a series of short-term, achievable objectives. For this reason, as for other such initiatives, it is envisioned that the OOR will be developed in a series of phases, proceeding from the simple to the complex, with achievable goals that capitalize on previous experience and the emergence of technology over time. It is important to note that in any given phase, planning and prototyping for subsequent phases is always in progress. (1NZ2)
State of the Art (1NZ3)
The purpose of this section is to set out the major design decisions and the technology choices which are important to the creation of ontology repositories. (1NZ4)
Ontology repositories support the storage, search, retrieval and interoperation of multiple ontologies. (1NZ5)
Ontology repositories support macro-level storage, query and retrieval (across the collection of ontologies) and micro-level operations (within individual ontologies). At each level we would like to support both text search, and semantic search (variously faceted search, SPARQL, ontology and ontology language literate search). Some ontology repositories have used the same technologies for both macro-level and micro-level operations. (1NZ6)
A key decision is the choice of a representation of the ontologies. Current practice includes: text, frames (e.g., OBO), graphs (e.g., RDF), and various types of logic, e.g., description logics (e.g., OWL-DL), first order logic (e.g., Common Logic), sorted logics, possibly higher order logic (HOL). Other possibilities include the use of UML (e.g., in the OMG Ontology Definition Metamodel). (1NZ7)
Ontologies have been stored in long narrow relations, e.g., "triple stores" of RDF triples (subject, relationship, object), relational databases, customized data stores. Increasingly implementers are using "quad stores" in order to support Named Graphs. "Column stores" such as MonetDB and Vertica have also been used to store ontologies. (1NZ8)
For the purposes of ontology interoperation it helps to have all of the ontologies in the repository encoded in a common representation. However, this requires the sometimes difficult and lossy translation of ontologies among various representations into the common representation. Some ontology repositories store ontologies in their native representation, with metadata to identify the representation language. (1NZ9)
We also need some way to support ontology interoperation by specifying the mappings among entities, e.g., via relationships such as same_as, is_a, and part_of. Other mapping relationships include: see_also, similar_to. Some ontology mapping consistency checking tools check that mappings between partially ordered ontologies, e.g., taxonomies, preserve the partial orders. (1NZA)
Many ontology repositories which support partially ordered ontologies (taxonomies and partonomies) may decide to materialize the transitive closure of the partial order relation. This provides faster query evaluation at the expense of additional ingestion costs, storage, and maintenance. (1NZB)
Provenance of definitions in ontologies is important to the credibility, scientific attribution, and regulatory compliance of ontologies. In particular, many definitions are embodied in legislation, administrative regulations, court decisions, professional society standards. (1NZC)
Provenance and other metadata are distinguishing features of recent ontology repositories. Such metadata ranges from authorship, and creation date, version information, to evaluation and usage reports. Other metadata may include intended use (context). (1NZD)
Modularization support is useful for large ontologies, and for facilitating the reuse and mapping of portions of ontologies. (1NZE)
In a distributed setting, ontology repository developers increasingly are adopting Service Oriented Architectures (SOA), providing access, search, and other capabilities via web services. Two major approaches to SOA are REST and SOAP. REST is built on HTTP, with a small set of operators (GET, PUT, POST, DELETE) and the use of URL (or URI) addresses for all objects of interest. SOAP is based on XML RPCs. REST is much simpler to implement and should be adequate for typical ontology repository functions. SOAP is supported by a wide variety of software tools. Both SOA approaches are currently being used. (1NZF)
Finally, an ontology repository typically facilitates access to a variety of ontology related tools: creation, editors, pretty printers, visualization tools, differencing tools, modularization tools, import / export, version management, access control, inference engines, explanation, summarization. (1NZG)
Quality and Gatekeeping (1NZH)
We distinguish between gatekeeping and quality control. Gatekeeping criteria are a set of minimal requirements that any ontology within the OOR has to meet. The latter are intended to enable the users of the OOR to find quickly ontologies that fit their needs; the criteria are not supposed to ensure the quality of the ontologies. (1NZI)
Gatekeeping Criteria (1NZJ)
The ontologies in the OOR have to meet the following criteria: (1NZK)
1. The ontology is submitted in a publicly described language and format. 2. The ontology is read accessible. 3. The ontology is expressed in a formal language with a well-defined syntax. 4. The authors of the ontology provide the required metadata as specified under section V. 5. The ontology has a clearly specified and clearly delineated scope. 6. Successive versions of the ontology are clearly identified. 7. The ontology is appropriately named. (1NZL)
It is particularly important that the required metadata include information about the process that is employed to create and maintain the ontology. (Is the ontology maintained in a cooperative and transparent process? Can anybody participate in this process?) Further, the metadata has to include information about the license under which the ontology is submitted. (1NZM)
Quality Control (1NZN)
It is not sufficient for the OOR just to store ontologies, but that it needs to enable the evaluation of the ontologies within it. The OOR will offer functionalities like those on social networking sites which would allow users to comment on ontologies and rank them. Further, the OOR will enable selective views of the repository using tags provided by subcommunities that characterize ontologies with respect to their chosen criteria. For example, such a view might select for ontologies for specific fields of research or industries, or for ontologies satisfying specific quality criteria or levels of organizational approval. (1NZO)
V. Metadata for Ontologies (1NZP)
V.1 Purpose of the Ontology Metadata (1NZQ)
To support the sharing and reuse of ontologies within the repository the OOR will store both ontologies and metadata for ontologies. (1NZR)
The metadata will allow users to: (1NZS)
* determine whether an ontology is suitable for a user purpose; * capture the design rationales that underlie the ontology; * find information about author, author credentials, and source of ontology reference material * retrieve ontologies for use in domain applications; * retrieve ontologies to be integrated with other ontologies; * retrieve ontologies that will be extended to create new ontologies; * determine whether or not an ontology can be integrated with given ontologies; * determine whether a set of ontologies retrieved from the repository can be used together; * determine whether an ontology in the repository can be partially shared. (1NZT)
There will be policies for creation and modification of metadata and documentation of ontologies and the management of the persistence and sustainability of ontologies. (1NZU)
Users (including end-users, ontology and repository developers, subject matter experts, stakeholders) should participate in the collaborative ontology development life cycle and in decisions regarding what metadata are suitable for ontologies in the repository. (1NZV)
The metadata will include both logical metadata (logical properties of the ontology independent of any implementation or engineering artifact) and engineering metadata (properties of the ontology considered as an engineering artifact). (1NZW)
V.2 Logical Metadata (1NZX)
V.2.1 Language (1NZY)
The first logical property is to identify the language used to specify the ontology. (1NZZ)
The report "Evaluating Reasoning Systems" contains a classification of formal languages used to specify ontologies. A formal language has a syntax (logical symbols together with a formally specified grammar) and a model theory (which specifies the conditions under which expressions in the language can be given particular truth assignments). (1O00)
A formalizable language has a syntax, although it does not have a model theory. Examples of such approaches include Topic Maps and folksonomies (which are written in XML) and ISO 15926 (which is written in EXPRESS). (1O01)
Finally, some ontologies are only specified in natural language, including Wordnet, taxonomies, and thesauri. (1O02)
V.2.2 Modularity (1O03)
A second property of ontologies is based on modularity -- is a particular ontology a monolithic set of axioms, or is it composed of a set of smaller modules? Furthermore, is each module considered to be a separate ontology within the repository? If not, what are the relationships between the modules and which modules of an ontology can be used separately? (1O04)
For example, the Process Specification Language (PSL) consists of a set of modules which are extensions of a common core theory PSL-Core. Metadata for each module specifies which other modules must also be included when using the module. (1O05)
V.2.3 Relationships between ontologies (1O06)
We can also specify various logical relationships between ontologies within the repository, including mutual consistency, extension, and entailment, and semantic mappings. (1O07)
V.3 Engineering Metadata (1O08)
In addition to the logical metadata for ontologies, we need to specify metadata for ontologies as considered as engineering artifacts. This includes (1O09)
* provenance * versioning * existing applications of the ontology (e.g. interoperability, search, decision support) * domain-specificity (e.g. biology, supply chain management, manufacturing (1O0A)
V.4 Conclusions regarding Metadata (1O0B)
The Ontology Metadata Vocabulary (OMV), Dublin Core, ISO 11179, ISO 19763, and other existing approaches to provenance and versioning metadata are all candidates for aspects of the metadata for ontologies in the OOR. (1O0C)
We will use an empirical approach to the identification and evaluation of ontology metadata. Proposals for ontology metadata already exist, and we will evaluate them using use-case scenarios. These scenarios both motivate the use of the metadata and help establish best practices. (1O0D)
VI. Repository Architecture (1O0E)
The Architecture of a repository for enabling wide-scale searching and sharing of ontologies will be open and extensible. The Architecture design will be modular in nature and provide for ontology storing, sharing, searching, governance, and management of the repository infrastructure and content. (1O0F)
VI.1 Architecture Approach (1O0G)
The core approach for the Open Ontology Repository is a federated, service oriented architecture. This approach provides for distributed ontology storage, repository management and service support. Metadata will be provided for every ontology in the repository. The repository will also provide connections for logical services, inference engines etc. (1O0H)
Those who engage in the federation must include required metadata. This metadata must include any access constraints. (1O0I)
Over the repository will be an ontology that is inclusive of both the metadata of ontologies and the information required for operational use. (1O0J)
VI.2 Core Requirements (1O0K)
The requirements presented are important to the enablement of wide-scale knowledge re-use. (1O0L)
1. The repository architecture shall be scalable. 2. The architecture shall be optimized for sharing, collaboration and reuse. 3. The repository shall be capable of supporting ontologies in multiple formats and levels of formalism. 4. The repository architecture shall support distributed repositories. 5. The repository architecture shall support explicit machine usable/accessible formal semantics for the meta-model of the repository. 6. The repository shall provide a mechanism to address intellectual property and related legal issues/problems. 7. The repository architecture shall include a core set of services, such as support for adding, searching and mapping across ontologies and data related to the stored ontologies. 8. The repository architecture shall support additional services both directly within the province of the repository and as external services. 9. The repository should support all phases of the ontology lifecycle. (1O0M)
VI.3 Repository Management (1O0N)
An ontology repository requires mechanisms for effective management. The understanding is that as a repository and its infrastructure evolve, more management support mechanisms will be included. (1O0O)
Required mechanisms will provide the capabilities to: (1O0P)
1. enforce access policies 2. enforce submission policies 3. enforce governance policies 4. enforce change management policies 5. control user and administrator access (1O0Q)
Highly recommended mechanisms will provide the capabilities to: (1O0R)
1. create usage reports 2. validate syntax 3. check logical consistency 4. automatically categorize a submission (1O0S)
VI.4 Service and Application Support (1O0T)
OOR interfaces should support internal and external services and applications including: (1O0U)
* Ontology creation tools * Ontology editors * Ontology differencing tools * Ontology modularization tools (clustering, etc.) * Ontology export * Ontology visualization (e.g., graph visualization) * Version management * Access control (1O0V)
VI.5 Discovery Support (1O0W)
To facilitate knowledge discovery the repository shall provide metadata capabilities to support search capabilities, governance process, and management. The repository will support discovery by at least the following: (1O0X)
* domain * author/creator/source * version * language * terminology and controlled vocabularies * quality * mapping * inference (1O0Y)
VII. Roles of key personnel (1O0Z)
KenBaclawski is the PI. He is responsible for overall coordination of the project and directing of outreach to other communities, especially those represented by the DataNet Partners. This will specifically provide for active community input and participation in all phases and all aspects of DataNet Partner activities. In his capacity as a faculty member at Northeastern University, he will develop new tools and capabilities for learning that integrate research and education at all levels. (1O10)
K_Goodier is a co-PI and consultant. She will assist the PI in project coordination and outreach activities as detailed in the responsibilities of Prof. Baclawski. She is specifically responsible for directing a vigorous and comprehensive assessment and evaluation during all phases of the project. She is responsible for the human culture ontology and data set, and will engage in research to integrate this ontology with the biodiversity, environmental and climatic ontologies. (1O11)
NeilSarkar will lead a subcontract at MBL. The MBL sub-team is responsible for coordination with the Biodiversity communities, especially those communities involved in the Encyclopedia of Life, the natural environmental communities, and the climate communities. The sub-team will engage in research on a variety of topics especially methods for dealing with documents and other artifacts as terminology and ontologies evolve over time, and methods for linking terminology in related, but distinct, communities. (1O12)
PeterYim is a consultant and co-PI. He is responsible for managing team collaboration efforts. Although the team is geographically dispersed, it has been engaging in virtual meetings several times each week. He will not only be providing the collaboration infrastructure for the team, but he will also be engaging in research and development of new methods for improving the collaboration experience. This will be especially important for outreach efforts to DataNet Partners, which we envision to be mainly virtual meetings in a rich virtual environment. (1O13)
MikeDean is a consultant and co-PI. He will be responsible for developing a set of modular software interfaces (loosely modeled after the Apache Server) allowing OOR instantiations to choose and configure the OOR capabilities, languages, and policies they want to support. He is also responsible for the data management life cycle, including integration, release packaging, and testing. Finally, he will be engaging in research on federation among OOR and non-OOR registries, repositories, and collaborative development environments (including Semantic MediaWiki). (1O14)
LeoObrst will lead a subcontract at MITRE. The MITRE sub-team would like to address a number of aspects relevant to an OOR which will include a repository of both ontologies, instance data, rules, and potentially raw or source data linked to instances and ontologies, and services to support public access, integration, analysis, adaptation, and preservation of knowledge rich information for science. These aspects are the following: (1O15)
1) Ontology Evaluation. The development of methods, practices, services, and artifacts to support automated and human reviewed evaluation and comparison of ontologies stored in the repository. [2, 7, 8] (1O16)
2) Ontology Architectures, Modularization, and Alignment/Mapping. The development of requisite ontology-based architectures, including ontology lifecycle management, theories and implementations of ontology modularity, upper and middle ontologies, and research and software development of methods for automatically and semi-automatically aligning and mapping ontologies. The use and linking of metadata, controlled vocabularies, and ontologies, for intelligent search and decision support. [1, 11, 12, 17, 21, 22] (1O17)
3) Ontology, Instance, and Rule Reasoning. The development and implementation of efficient logic programming-based reasoning methods that amalgamate Semantic Web-based ontologies and rules with extended Prolog and Answer Set Programming, to be used for reasoning over the ontologies, instances, and rules of the repository. [15, 16, 19] (1O18)
4) Service Orchestration and Optimization to Support OOR Artifacts. Design and implementation of service-oriented architectures and services, including automated and semi-automated service orchestration and parallel optimization to support the repository. [6, 14, 20] (1O19)
5) Outreach to communities in bioinformatics, national command and control, and intelligence. [3, 4, 5, 9, 10, 13, 17, 18, 19, 20] (1O1A)
BruceBargmeyer will lead a subcontract at LBNL. The LBNL sub-team will contribute the results of the eXtended Metadata Registry (XMDR) project to date and extend the work as needed for the OOR project. The XMDR project is concerned with the development of improved standards and technology for storing and retrieving the semantics of data elements, terminologies, and concept systems (thesauri, taxonomies, ontologies, etc.) in metadata registries. Existing metadata registry standards include the ISO/IEC 11179 family of Metadata Registry standards (e.g., ISO/IEC 11179). The XMDR project proposes extensions of the ISO/IEC 11179 family of metadata registry standards to support more diverse types of metadata and enhanced capabilities for semantics specification and queries. The LANL sub-team has already created and tested a prototype extended metadata registry (See xmdr.org). The primary responsibility of the LBNL sub-team will be to work together with the NCBO sub-team to integrate capabilities of XMDR and BioPortal. We would work on standards to make it feasible for wide deployment of the OOR. (1O1B)
JohnGraybeal is a consultant. He will be responsible for outreach to the oceanographic and marine science communities. He is currently engaged in the Marine Metadata Interoperability (MMI) project which is bringing semantic interoperability to marine science through a combination of semantic framework description, ontology registry development, vocabulary discovery, user-centered use cases, workshop hosting, and other community engagement. He will continue this effort in the context of DataNet and the OOR where he will also be engaging more general environmental and climatic communities. In addition, he will engage in research on vocabulary creation, harmonization and term mapping. (1O1C)
ThomasLyndonWheeler is a consultant. He will be working with Katherine Goodier. (1O1D)
MarkMusen will lead a subcontract at the NCBO. The NCBO sub-team will be responsible for enhancing the BioPortal project to serve as the web server and database engine for the OOR. While BioPortal is currently used as a centralized repository for biomedical ontologies, it was actually designed as a general purpose ontology repository. So it is an ideal foundation for the OOR. The NCBO sub-team will work with the LBNL sub-team to integrate the XMDR capabilities into BioPortal. (1O1E)