ontology-summit
[Top] [All Lists]

Re: [ontology-summit] [VarietyProblem] Tackling the Variety Problem in B

To: Ontology Summit 2014 discussion <ontology-summit@xxxxxxxxxxxxxxxx>
From: Ali SH <asaegyn+out@xxxxxxxxx>
Date: Tue, 11 Feb 2014 13:24:59 -0500
Message-id: <CADr70E32cBZ24EgQRfrsAGTjYHVKFqpK9Cxg=i5W+hqO3oMRjQ@xxxxxxxxxxxxxx>
Dear Ken and colleagues,

I share below some of my experiences in using ontologies and other AI technologies to tackle the variety problem in medium to big data.

On Fri, Jan 24, 2014 at 2:50 PM, kenb <kenb@xxxxxxxxxxx> wrote:
- What can ontologies best contribute to Big Data and how can this be done?

I've found some success in coupling ontologies to machine learning technologies and also to procedural code. The ontology in these cases often functions as the conceptual glue, providing a reference model for the development of these systems. While fragments of the ontology may be realized as computable components, in this aspect, the ontology is primarily used as a design tool, leading to the specification of interfaces and unit testing for different modules of the broader system implementation.


- What use cases would be the most compelling?

Couple with machine learning technologies, ontologies can provide a powerful complement, yielding significant insight into big data. As we noted in the 2012 Ontology summit communique, one of the limitations of statistical techniques is that they still require interpretation. In the context of Big Data, I consider Machine Learning technologies to be a subset of what is often called (Big) Data Analytics, and find the grouping of ML techs as supervised, semi-supervised and unsupervised to be a useful way to describe how ontologies can complement them. By no means is the following list meant to be exhaustive, but it outlines some of the ways that I've used ML algorithms and ontologies to tackle big data.

In the supervised and semi-supervised cases, you often know what concepts / patterns / complex concept structures you are looking to extract in your dataset. From a semiotic perspective, you are extracting some subset of signs from your dataset. The ontology provides a mechanism to plug in these decontextualized signs into a broader semiotic system, hence providing meaning, and allowing greater use. As an example of this, I developed a global legal update ontology spanning multiple countries, in multiple languages and multiple legal systems (covering both common, civil and napoleanic based legal traditions). A combination of machine learning and NLP technologies were then deployed to extract complex-concept-structures from natural langauge texts. Elements of the extracted structures were then interpreted in the above ontology, triggering procedural code that performed a set of actions based on the interpretation of the signs. A small portion of the ontology was further encoded in OWL, representing a fragment of the semantics of a "legal update" and used to validate and perform rudimentary reasoning over published sets of RDF triples, relating the various versions of legal documents in a knowledge base. In this case, the ontology functioned both as a sort of software specification guide and defined the interfaces between the ML, procedural and database updating components.

In the unsupervised context, while there are numerous ways that this family of ML algorithms (though often the line with standard statistical correlation and prediction algorithms gets a bit blurred) can be deployed, one useful way is that it can suggest concepts or complex-concept-structures that can then be evaluated by humans as to whether they contain human-meaningful semantic content. A simple demosntration of this view is the family of clustering algorithms. One can deploy a clustering algorithm to partition a dataset into smaller chunks, and a human can then inspect the results to see if partitions in the resultant set corresponds to one or more human-meaningful structures. An ontology can then be used to bind the results of the clusters to a broader domain theory and plug into other system modules.

Alternatively, unsupervised algorithms are often used for predictive purposes. An example of this could be a recommendation system, that based on user activity across a variety of input data streams suggests a ranking of additional datasets (imagine Netflix's algorithms). The results of the recommendation system can be pruned yielding higher quality results by either deploying a set of heuristics based on domain knowledge about user behaviour in the domain, or more formally, by making the assumptions of the heuristics explicit in the form of an ontology. The benefit of the latter approach is that it is extensible and can be used to bridge the results and data generated by the ML algorithsm to other elements of the system. The downside of course is that deploying heuristics is much faster, cheaper and if you are on a tight timeline or sprint, then you may not have a time for a more "principled" approach.

Moreover, cutting across the different ML aproaches, combining different ML algorithms is often useful. Drawning on an analogy to electrical circuits, you can arrange them in serial, parallel or sundry combinations thereof. Many of these algorithms often require the definition (or extraction) of feature sets, and may require multiple layers of different types of ML algorithms. For example, in an NLP context, one layer may extract linguistic features, while another layer may use those linguistic features to extract relationships. In these cases, I've found ontologies of use in helping define feature set for different layers of the ML algorithms, or to help manage the deployment of multiple, possibly overlapping ML algorithms. Both of these can help overcome the variability problem in big data. Moreover, they can often be used in an additional layer ontop of the ML outputs to provide better quality results by implementing some heuristics (or formalized theory) to prune the results of the ML algorithms, in a way similar to the Netflix example described above.

Ontologies can be useful in defining what to look for in big data sets. They can also be useful in bridging the variability problem, by providing a global view in different sources. Machine learning technologies can also be used to tackle the variability problem in big data (i.e. sentiment analysis from a vastly varying dataset), and then coupled with an ontolgoy provide an interpretation for a given purpose.

Hope this helps,
Ali


--


(•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,.,

_________________________________________________________________
Msg Archives: http://ontolog.cim3.net/forum/ontology-summit/   
Subscribe/Config: http://ontolog.cim3.net/mailman/listinfo/ontology-summit/  
Unsubscribe: mailto:ontology-summit-leave@xxxxxxxxxxxxxxxx
Community Files: http://ontolog.cim3.net/file/work/OntologySummit2014/
Community Wiki: http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologySummit2014  
Community Portal: http://ontolog.cim3.net/wiki/     (01)
<Prev in Thread] Current Thread [Next in Thread>
  • Re: [ontology-summit] [VarietyProblem] Tackling the Variety Problem in Big Data, Ali SH <=