Tackling the Variety Problem in Big Data (451U)
Please add your comments here... (451V)
GaryBergCross In terms of things I heard today on Synthesis to include in the communique I would suggest we include the point that Track C (Matthew) made about addressing the level of semantics needed by various types of Application Domains. (4BEP)
GaryBergCross stated: " To your starter list I would offer the idea that Ontologies can also offer a basis for better data/metadata annotation. Work includes semantic tagging to help with discovery. This is an idea going back to things like Weinstein, Peter. "Ontology-based metadata." Proceedings of the Third ACM Digital Library Conference. 1998." (469N)
BartGajderowicz stated: "I can propose two ideas which relate to machine learning, data mining, and their relation to ontologies. 1) My MSc work looked at extending ontologies with machine learning, for the purpose of ontology mapping. This can be applied to merging datasets which have associated ontologies. http://www.scs.ryerson.ca/~bgajdero/msc_thesis/ I'd be happy to provide more information. 2) I'm not sure if this would fall under the Variety Problem domain, but the issue of asking the right questions of the data is an important aspect of data science (many other labels for this field but I'm choosing this one for now). Understanding the underlining semantics of data points, data records, etc, allows a data scientist to: - Ask the right questions, which is important when configuring statistical and data-mining models, as well as the experiments. - Interpret experiment results in various ways. - Draw conclusions which may lead to better understanding of the data, as well as better questions for the next iteration of experiments." (469O)
Discussion of Track D Questions and Answers (46BU)
Elaborate the list of sources of Big Data variety that have the greatest potential for benefiting from the use of ontologies. (46BV)
I've found some success in coupling ontologies to machine learning technologies and also to procedural code. The ontology in these cases often functions as the conceptual glue, providing a reference model for the development of these systems. While fragments of the ontology may be realized as computable components, in this aspect, the ontology is primarily used as a design tool, leading to the specification of interfaces and unit testing for different modules of the broader system implementation. (46BX)
Please continue the discussion... (46BY)
What use cases would be the most compelling? (46BZ)
Couple with machine learning technologies, ontologies can provide a powerful complement, yielding significant insight into big data. As we noted in the 2012 Ontology summit communique, one of the limitations of statistical techniques is that they still require interpretation. In the context of Big Data, I consider Machine Learning technologies to be a subset of what is often called (Big) Data Analytics, and find the grouping of ML techs as supervised, semi-supervised and unsupervised to be a useful way to describe how ontologies can complement them. By no means is the following list meant to be exhaustive, but it outlines some of the ways that I've used ML algorithms and ontologies to tackle big data. (46C1)
In the supervised and semi-supervised cases, you *often* know what concepts / patterns / complex concept structures you are looking to extract in your dataset. From a semiotic perspective, you are extracting some subset of signs from your dataset. The ontology provides a mechanism to plug in these decontextualized signs into a broader semiotic system, hence providing meaning, and allowing greater use. As an example of this, I developed a global legal update ontology spanning multiple countries, in multiple languages and multiple legal systems (covering both common, civil and napoleanic based legal traditions). A combination of machine learning and NLP technologies were then deployed to extract complex-concept-structures from natural langauge texts. Elements of the extracted structures were then interpreted in the above ontology, triggering procedural code that performed a set of actions based on the interpretation of the signs. A small portion of the ontology was further encoded in OWL, representing a fragment of the semantics of a "legal update" and used to validate and perform rudimentary reasoning over published sets of RDF triples, relating the various versions of legal documents in a knowledge base. In this case, the ontology functioned both as a sort of software specification guide and defined the interfaces between the ML, procedural and database updating components. (46C2)
In the unsupervised context, while there are numerous ways that this family of ML algorithms (though often the line with standard statistical correlation and prediction algorithms gets a bit blurred) can be deployed, one useful way is that it can suggest concepts or complex-concept-structures that can then be evaluated by humans as to whether they contain human-meaningful semantic content. A simple demosntration of this view is the family of clustering algorithms. One can deploy a clustering algorithm to partition a dataset into smaller chunks, and a human can then inspect the results to see if partitions in the resultant set corresponds to one or more human-meaningful structures. An ontology can then be used to bind the results of the clusters to a broader domain theory and plug into other system modules. (46C4)
Alternatively, unsupervised algorithms are often used for predictive purposes. An example of this could be a recommendation system, that based on user activity across a variety of input data streams suggests a ranking of additional datasets (imagine Netflix's algorithms). The results of the recommendation system can be pruned yielding higher quality results by either deploying a set of heuristics based on domain knowledge about user behaviour in the domain, or more formally, by making the assumptions of the heuristics explicit in the form of an ontology. The benefit of the latter approach is that it is extensible and can be used to bridge the results and data generated by the ML algorithsm to other elements of the system. The downside of course is that deploying heuristics is much faster, cheaper and if you are on a tight timeline or sprint, then you may not have a time for a more "principled" approach. (46C5)
Moreover, cutting across the different ML aproaches, combining different ML algorithms is often useful. Drawning on an analogy to electrical circuits, you can arrange them in serial, parallel or sundry combinations thereof. Many of these algorithms often require the definition (or extraction) of feature sets, and may require multiple layers of different types of ML algorithms. For example, in an NLP context, one layer may extract linguistic features, while another layer may use those linguistic features to extract relationships. In these cases, I've found ontologies of use in helping define feature set for different layers of the ML algorithms, or to help manage the deployment of multiple, possibly overlapping ML algorithms. Both of these can help overcome the variability problem in big data. Moreover, they can often be used in an additional layer ontop of the ML outputs to provide better quality results by implementing some heuristics (or formalized theory) to prune the results of the ML algorithms, in a way similar to the Netflix example described above. (46C6)
Ontologies can be useful in defining what to look for in big data sets. They can also be useful in bridging the variability problem, by providing a global view in different sources. Machine learning technologies can also be used to tackle the variability problem in big data (i.e. sentiment analysis from a vastly varying dataset), and then coupled with an ontolgoy provide an interpretation for a given purpose. (46C7)
Please continue the discussion... (46C8)
DominiqueMariko : An example of NLP used for detecting co-référents in a dataset : http://nlp.stanford.edu/pubs/discourse-referent-lifespans.pdf Certainly there are ways to link such tools/processes with ontologies and social networks analytics, as suggested above ? (4C9V)