Ontology for Big Systems

The Big Data Story


The Cambrian explosion occurred 530 million years ago, as life on earth experienced a sudden increase in diversity and the rate of evolution. Over the past half century, we’ve entered the Cambrian age for information, knowledge and systems, coupled with a constantly evolving technology landscape. The amount of knowledge that is produced, published and shared by humanity has been growing exponentially each year. In the past decade more data has been collected, more video has been produced and more information has been published than in all of previous human history.

With greater computing power there is an easy ability to create and track data. Whether it be encoding an organism’s DNA, tracking Internet usage, tracking credit usage, the experiments at the Large Hadron Collider or weather satellite data, each of these activities creates a staggering amount of data.

While the sheer size and scale of these data sets presents its own challenge, knowing how to first understand the data, garnering information and knowledge from it, and then intelligently combine it with other data sets means that there is a need to accurately represent (the portion of) the world this data represents. This in turn necessitates each data source adequately represents itself and makes available information that allows one to interpret the data out of context, for example units of measurements, time-stamps, annotations of data elements with terms of reference ontologies.

Imagine a future where intelligent agents play a more prominent role in the doctor-patient relationship. As a patient describes her symptoms to the doctor, an agent is able to cross-reference these symptoms with aggregated patient data to find similar patient profiles. Unable to determine the exact ailment, the doctor uses this information to prescribe a series of test to further narrow the possibilities. Before the tests are carried out, a new paper is published linking a previously unknown gene to a symptom displayed by the patient. An agent monitoring this publication, extracts this information and flags the patient file for doctor review. The next day, the doctor is alerted to a change and realizes that a number of tests prescribed are unnecessary.

In order to realize such a vision, we require not only the ability to combine multiple sources of big data (patient data, research publications, gene information), but to also understand how they are related to one another. There are limits to statistical analysis. We need a conceptual framework or theory alongside statistical analysis. To effectively combine multiple data sets and systems, we need to be able to represent the assumptions and conceptualizations that underpin knowledge in those domains.

Data creators and publishers need to make explicit what their data represents together with the context of the data and its creation (e.g., the systems that created and transformed it) to be able to effectively use the data and combine it for other useful ends. This requirement necessitates developing theories about those parts of the world relevant to the data and its context. Without such theory and subsequent practice, successful data reuse and adaptability will not be possible.

There are a variety of groups working towards this vision. For example, the Linked Open Data (LOD)  initiative seeks to connect distributed data across the net. While there are many data sources available online today, that data is not readily accessible. The LOD cloud aims to create the requisite infrastructure to enable people to seamlessly build “mash-ups” by combining self-describing data from multiple sources.

Similarly, there has been a surge of work in bioinformatics, including the Open Biological and Biomedical Ontology, Gene Ontology and other sources which annotate big data with explicit semantics. These initiatives allow research groups to publish findings on genes, gene expression, proteins and so in a standardized consistent manner.

Another example is the FuturICT project funded by the European Union. Its ultimate goal is to understand and manage complex, global, socially interactive systems, with a focus on sustainability and resilience. FuturICT will build a Living Earth Platform, a simulation, visualization and participation platform to support decision-making of policy-makers, business people and citizens.

Additionally, the Consortium of Universities for the Advancement of Hydrologic Science (CUAHSI) has been developing an information system to manage the semantics of water related data. This project involves integrating and distributing observations gathered by a myriad of organizations, including 125 universities. The types of observations that are covered include those about water quality, quantity, soil water measurements, meteorology and precipitation, alongside the integration of the models which deploy these observations.