[ontology-summit] Track 3 - Big Data, Machine Learning and Ontologies

To:	Ontology Summit 2012 discussion <ontology-summit@xxxxxxxxxxxxxxxx>
From:	Bart Gajderowicz <bgajdero@xxxxxxxxxx>
Date:	Wed, 8 Feb 2012 00:48:48 -0500
Message-id:	<CABw=6A6V7vENVqNOcJEs+_LqYh+AR+N-1VUMX4vJENq-E4jP3w@xxxxxxxxxxxxxx>

During the 02/19 conference call, Lucier Brady presented a sample
problem which can benefit from Automatic Programming. (01)

I'd like to present a similar idea in terms of Ontology Learning, or
Ontology Extension. (02)

*Goal:*
Enable scientist to make maximum use of big data
http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologySummit2012_BigDataChallenge_CommunityInput#nid32TM (03)

*Proposals:*
- Extending Ontologies with Data (04)

- Associating Data with Ontologies (05)

- Using machine learning on ontologically enhanced/extended data. (06)

*Simple Use Case:*
Instead of statistical interpretation (mean, standard deviation, mode,
etc) we can show semantic relations of data, and analyse it at a more
abstract level.
View data at different levels of abstraction and granularity.
How does height relate to weight
How does weight relate to weather
how does weather relate to geographical location (07)

- How does weight relate to geographical location? (08)

Statistical models reveal correlations, with a degree of certainty.
Semantic models may tell us about causation. (09)

*My Background*
I have been researching incorporating ontologies in the field of
machine learning . I have applied to ontology matching (at the system
level it's semantic integration). My MSc thesis [1], as well as paper
[2] in the 2009 Uncertainty and Reasoning for the Semantic Web
workshop proceedings include my work on these topics. As a well known
reference, this work is related to the 2003 work on GLUE by Doan et
al. [3]. (010)

The following snippet taken form a paper I'm working on, is a summary
of the latest work on Ontology Learning and utilizing Decision Trees
with OWL. (011)

If anyone is interested in this approach, I would love to hear your
opinions, use cases, and references in this thread. (012)

*Ontology Learning*
Ontology Learning is the area of research that deals with the
construction and management of ontologies in a systematic way. This
includes automatically adjusting ontologies to accommodate changes in
data patterns as well as reflect variations in the data being
represented [4]. Inductive learning [5] specifically uses machine
learning algorithms to generate ontology extensions and refinements.
Traditionally, work in this field concentrated on text based data
[4][6], such as articles and research papers. Learning ontologies
beyond text is still an open problem. (013)

Inductively derived rules and machine learning have been successfully
applied to many different applications [7]. d’Amato et al. [5] address
how this benefits the Semantic Web, and why it’s important to merge
the worlds of ontologies and machine learning. The authors list
several approaches that use various sources from the Semantic Web,
such as folksonomies and Linked Data [8] in order to construct
ontologies, and call this process Ontology Mining. (014)

For sources that contain information in the form of text, ontology
mining is performed using NLP techniques. However, as the authors
note, semantic rules derived from these sources are not completely
clear, and the expressiveness of the language used to represent these
rules is not as expressive as OWL. Sources with annotated observations
and some background knowledge make it possible to deduce additional
information by extending the provided observations and background
knowledge [9]. Existing annotations act as a starting point in the
form of a small set of simple rules stored in an ontology. The
background knowledge acts as an external source and is introduced
during the learning process. As new information becomes available,
incremental changes are made to the existing rules, extending and
refining them in the process. (015)

Another approach learns concept descriptions for an existing taxonomy
by clustering annotated data to create meaningful groups that
represent similar concepts [5]. These types of methods are used for
ontology evolution, pattern recognition within the ontology, scaling
large ontologies by incrementally inducing them to a manageable size,
and finally for building probabilistic ontologies when uncertainty in
the derived rules is unavoidable. (016)

For structured data, such as numeric data and database records,
Stocker et al. [4] apply ontology learning to create domain-ontologies
of environmental data that originate as numerical measurements. A
taxonomy of lakes was used to classify various bodies of water. A
major problem with this is that building ontologies based on numerical
data is often biased [10]. For example classifying a body of water as
being low or high in nitrogen is highly objective. Stocker et al. [4]
defined two properties richIn and poorIn as identifying whether a body
of water is rich in nitrogen or poor in nitrogen, respectively. Their
data shows that based on variation of nitrogen levels in Finland, the
threshold for a body of water classified as being richIn nitrogen is
0.88. They then compared this value to the Spanish threshold for a
richIn property of nitrogen in a body of water at 8.36. In fact, the
Spanish threshold for poorIn property at 0.78 is closer to the Finish
richIn of 0.88. Clearly defining qualitative property such as richIn
or poorIn with quantitative values can be incredibly misleading. (017)

To remove objectivity and bias from the models, Stocker at al. [4]
propose creating models by deriving semantic rules such as:
poorIn(?i, Nitrogen) ← totalNitrogen(?i,?x) ∧ lessThanOrEqual(?x,?y) (018)

and representing them in RDF. There are no values present in the rule
above, only relations which define the concept poorIn for nitrogen.
These rules were derived using the k-means clustering algorithm, and a
general purpose rule engine was applied to the RDF rules for
rule-based reasoning which generates a static inference model. The
SPARQL [11] query language was then used to query the inference models
for lakes that are richIn and poorIn nitrogen. (019)

*OWL as Decision Trees*
Biological and chemical information is increasingly being published
and shared using semantic technologies [12][13]. Much of the analysis
on this type of information has not caught up to the latest
representation languages such as RDF and OWL. For example, the
toxicity of chemical products is often analyzed using statistical
analysis of chemical features. These features focus on a chemical’s
structure and function. A popular method to achieve this is the
development of decision trees by mining empirical toxicology data. It
is beneficial for the representation and analysis to be done in
compatible, or better yet, the same languages. Chepelev et al. [13]
have created such decision trees represented in the OWL language
specifically for toxicity classification. The result are OWL rules
which classify toxicity features. An OWL reasoner was then used to
characterize the toxicity of various chemical products. Datasets were
compared semantically by examining logical equivalences between the
OWL decision trees. However, the underlying decision trees
differentiating between toxic and non-toxic classes were not easily
created due to significant overlap. The addition of chemical product
structure was required to disambiguate the various classification
rules. (020)

Another use of semantic technologies to represent decision trees has
been conducted by Holford et al. [14], where the Semantic Web Rule
Language (SWRL) [14] was used to create decision trees that classify
human pseudogenes. Specifically, this research focused on the
relationship between pseudogenes and segment duplications (SD), or DNA
patterns that map to multiple locations on a genome. By representing
these trees in a Semantic Web language, researchers in the biomedical
field can share and extend the derived ontologies. In this work, the
Sequence Ontology (SO) [16], which provides terms and relationships
for sequence annotations, was extended with data. The data was
provided by the http://pseudogene.org website. An SWRL reasoner was
used to ensure consistency and satisfiablility of the derived rules.
It was also possible to query these rules for various types of
pseudogene classifications. These queries took advantage of both the
structures in SO and the incorporated data. (021)

As Fanizzi et. al [17] demonstrate, it is not always desirable to
learn ontologies, but also to learn from them. In their work, an
existing OWL ontology is used to generate decision trees called
terminological decision trees which are represented as OWL-DL classes.
Like their traditional data-based decision tree counterparts,
terminological decision trees are based on frequent patterns in the
ontology’s defined OWL roles. Unlike traditional decision trees that
use conditions such as wa:Direction = ‘North‘ or wa:Temp = 30, these
rules, called concept description, use the OWL roles
defined in the ontology, such as ∃hasPart.Worn and
∃hasPart.(¬Replaceable). Such concept descriptions are in the form: (022)

SendBack ≡ ∃hasP art.(Worn ⊓ ¬Replaceable). (023)

An important distinction from traditional decision tree nodes is that
a concept description is made up of the actual roles defined in the
ontology. As a result, it is not as cryptic and specific as the type
of decision tree traditionally created for data classification and
prediction. This type of tree node is better at reflecting manually
created and human readable roles [13]. (024)

* References *
[1] B. Gajderowicz and A. Sadeghian, "Ontology granulation through
inductive decision trees," in URSW, ser. CEUR Workshop Proceedings, F.
Bobillo, P. C. G. da Costa, C. d’Amato, N. Fanizzi, K. B. Laskey, K.
J. Laskey, T. Lukasiewicz, T. Martin, M. Nickles, M. Pool, and P.
Smrz, Eds., vol. 527. CEURWS.org, 2009, pp. 39–50. (025)

[2] B. Gajderowicz, "Using decision trees for inductively driven
semantic integration and ontology matching," Master’s thesis, Ryerson
University, 250 Victoria Street, Toronto, Ontario,
Canada, 2011 (026)

[3] A. Doan, J. Madhavan, P. Domingos, and A. Halevy, "Learning to Map
Between Ontologies on the Semantic Web," in Proc 11th International
Conference on World Wide Web (WWW'02), ACM, New York, NY, 2002. (027)

[14 M. Stocker, M. Ronkko, F. Villa, and M. Kolehmainen, "The
relevance of measurement data in environmental ontology learning," in
Environmental Software Systems. Frameworks of eEnvironment, ser. IFIP
Advances in Information and Communication Technology, J. Hreb ́ıcek,
G. Schimak, and R. Denzer, Eds. Springer Boston, 2011, vol. 359, pp.
445–453. (028)

[5] C. d’Amato, N. Fanizzi, and F. Esposito, "Inductive learning for
the semantic web: What does it buy?" Semantic Web, vol. 1, no. 1, pp.
53–59, 2010. (029)

[6] L. Zhou, "Ontology learning: state of the art and open issues,"
Information Technology and Management, vol. 8, no. 3, pp. 241–252,
Sep. 2007. (030)

[7] I. Witten and E. Frank, Data Mining: Practical machine learning
tools and techniques, 2nd ed. San Francisco: Morgan Kaufmann
Publishers, 2005. (031)

[8] C. Bizer, T. Heath, K. Idehen, and T. Berners-Lee, "Linked data on
the web (ldow2008)," in Proceeding of the 17th international
conference on World Wide Web, ser. WWW ’08. New York, NY, USA: ACM,
2008, pp. 1265–1266. (032)

[9] J. Lehmann and P. Hitzler, "Concept learning in description logics
using refinement operators," Mach. Learn., vol. 78, pp. 203–250,
January 2010. (033)

[10] M. Brodaric and M. Gahegan, "Experiments to examine the situated
nature of geoscientific concepts," Spatial Cognition and Computation:
An Interdisciplinary Journal, vol. 7, no. 1, pp. 61– 95, 2007. (034)

[11] A. S. Eric Prud’hommeaux. (2008, January) Sparql query language
for rdf. [Online]. Available: http://www.w3.org/ TR/rdf-sparql-query/ (035)

[12] F. Belleau, M.-A. Nolin, N. Tourigny, P. Rigault, and J.
Morissette, "Bio2rdf: towards a mashup to build bioinformatics
knowledge systems." Journal of Biomedical Informatics, vol. 41, no. 5,
pp. 706–716, 2008. (036)

[13] D. K. Leonid L. Chepelev and M. Dumontier, "Chemical hazard
estimation and method comparison with owl-encoded toxicity decision
trees," in OWLED 2011 OWL: Experiences and Directions, June 2011. (037)

[14] M. E. Holford, E. Khurana, K.-H. Cheung, and M. Gerstein, "Using
semantic web rules to reason on an ontology of pseudogenes,"
Bioinformatics, vol. 26, pp. i71–i78, June 2010. [Online]. Available:
http://dx.doi.org/10.1093/bioinformatics/btq173 (038)

[15] I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B.
Grosof, and M. Dean, "SWRL: A Semantic Web Rule Language Combining OWL
and RuleML," W3C Member Submission, World Wide Web Consortium, Tech.
Rep., May 2004. (039)

[16] K. Eilbeck and S. E. Lewis, "Sequence ontology annotation guide:
Conference papers," Comp. Funct. Genomics, vol. 5, pp. 642–647,
December 2004. (040)

[17] N. Fanizzi, C. d’Amato, and F. Esposito, "Towards the induction
of terminological decision trees," in Proceedings of the 2010 ACM
Symposium on Applied Computing, ser. SAC ’10. New York, NY, USA: ACM,
2010, pp. 1423–1427. (041)

Thanks
--
Bart Gajderowicz, MSc.
Ryerson University
http://www.scs.ryerson.ca/~bgajdero (042)

_________________________________________________________________
Msg Archives: http://ontolog.cim3.net/forum/ontology-summit/
Subscribe/Config: http://ontolog.cim3.net/mailman/listinfo/ontology-summit/
Unsubscribe: mailto:ontology-summit-leave@xxxxxxxxxxxxxxxx
Community Files: http://ontolog.cim3.net/file/work/OntologySummit2012/
Community Wiki: http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologySummit2012
Community Portal: http://ontolog.cim3.net/wiki/ (043)

<Prev in Thread]	Current Thread	[Next in Thread>
[ontology-summit] Track 3 - Big Data, Machine Learning and Ontologies, Bart Gajderowicz <= Re: [ontology-summit] Track 3 - Big Data, Machine Learning and Ontologies, Bart Gajderowicz

Previous by Date:	Re: [ontology-summit] OS-2012 Problem Space, Yuriy Milov
Next by Date:	Re: [ontology-summit] Roles, Fillers, and Role Relations, John F. Sowa
Previous by Thread:	[ontology-summit] Ontology Summit 2012: session-05 - Thu 2012.02.09, and more, Peter Yim
Next by Thread:	Re: [ontology-summit] Track 3 - Big Data, Machine Learning and Ontologies, Bart Gajderowicz
Indexes:	[Date] [Thread] [Top] [All Lists]