Algorithms and Human Phenotype Ontology
In this part of the course we will review ontologies and algorithms and we will present the Human Phenotype Ontology (HPO), which has been developed by our group since 2007.
Introduction to Bio-Ontologies > (Chapman & Hall/CRC Mathematical & Computational Biology)
Peter N. Robinson and Sebastian Bauer
2011
Gameplan:
Gene Ontology and OBO foundry members have developed a simple ontology language called OBO.
OBO stanzas can refer to universal types [Term], type definitions [Typedef], or instances [Instances].
The stanza begins with the id, which is a unique identifier comparable to an accession number.
Each term also has a name (concise human-readable description):
All other tag:value items are optional.
The is_a keyword indicates subclass links.
Example: mannosylphosphate transferase activity is a subclass of GO:0016740.
Note: Text following an ! is a comment.
Terms may have any number of is_a relations.
For all other relations, the keyword relationship (or specific relation type) is used.
The relation type name must be defined in a [Typedef] stanza.
OWL and the Semantic Web
OWL was developed by the W3C to enable richer integration and interoperability of data on the Web.
OWL is built upon RDF and RDFS.
RDF is an abstract model, but often serialized as RDF/XML.
<rdf:RDF xmlns="[http://purl.uniprot.org/core/](http://purl.uniprot.org/core/)"
xmlns:rdf="[http://www.w3.org/1999/02/22-rdf-syntax-ns#](http://www.w3.org/1999/02/22-rdf-syntax-ns#)">
<rdf:Description rdf:about="[http://purl.uniprot.org/enzyme/1.14.11.2](http://purl.uniprot.org/enzyme/1.14.11.2)">
<rdf:type rdf:resource="[http://purl.uniprot.org/core/Enzyme](http://purl.uniprot.org/core/Enzyme)"/>
<name>Procollagen-proline dioxygenase</name>
<activity>L-proline-[procollagen] + 2-oxoglutarate + O(2) = ...</activity>
<cofactor>Iron</cofactor>
</rdf:Description>
</rdf:RDF>RDF is designed to make computer-processable statements in the context of the Semantic Web.
http://www.example.org, the location is www.example.org and the protocol is http.URLs are one of two specific classes of Uniform Resource Identifier (URI).
urn:isbn:0691141339 (ISBN for a specific book).RDF is an abstract data model. The process of representing it as a computer file is called serialization.
1. <?xml version='1.0' encoding='UTF-8'?>
2. <rdf:RDF xmlns="[http://purl.uniprot.org/core/](http://purl.uniprot.org/core/)"
3. xmlns:rdf="[http://www.w3.org/1999/02/22-rdf-syntax-ns#](http://www.w3.org/1999/02/22-rdf-syntax-ns#)"
4. >
5.
6. <rdf:Description
7. rdf:about="[http://purl.uniprot.org/enzyme/1.14.11.2](http://purl.uniprot.org/enzyme/1.14.11.2)">
8. <rdf:type rdf:resource="[http://purl.uniprot.org/core/Enzyme](http://purl.uniprot.org/core/Enzyme)"/>
9. <name>Procollagen-proline dioxygenase</name>
10. <name>Procollagen-proline 4-dioxygenase</name>
11. <name>Prolyl 4-hydroxylase</name>
12. <activity>L-proline-[procollagen] + 2-oxoglutarate + O(2)
13. = trans-4-hydroxy-L-proline-[procollagen] +
14. succinate + CO(2).
15. </activity>
16. <cofactor>Iron</cofactor>
17. <cofactor>L-ascorbic acid</cofactor>
18. </rdf:Description>
19. </rdf:RDF><?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description
rdf:about="http://purl.uniprot.org/enzyme/1.14.11.2">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Enzyme"/>
<name>Procollagen-proline dioxygenase</name>
<name>Procollagen-proline 4-dioxygenase</name>
<name>Prolyl 4-hydroxylase</name>
<activity>L-proline-[procollagen] + 2-oxoglutarate + O(2)
= trans-4-hydroxy-L-proline-[procollagen] +
succinate + CO(2).
</activity>
<cofactor>Iron</cofactor>
<cofactor>L-ascorbic acid</cofactor>
</rdf:Description>
</rdf:RDF>:1.14.11.2 in the following slides<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description
rdf:about="http://purl.uniprot.org/enzyme/1.14.11.2">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Enzyme"/>
<name>Procollagen-proline dioxygenase</name>
<name>Procollagen-proline 4-dioxygenase</name>
<name>Prolyl 4-hydroxylase</name>
<activity>L-proline-[procollagen] + 2-oxoglutarate + O(2)
= trans-4-hydroxy-L-proline-[procollagen] +
succinate + CO(2).
</activity>
<cofactor>Iron</cofactor>
<cofactor>L-ascorbic acid</cofactor>
</rdf:Description>
</rdf:RDF>rdf:type means that the subject of the triple is an instance of the class represented by the object of the triple.:1.14.11.2 is an instance of Enzyme<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns="http://purl.uniprot.org/core/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description
rdf:about="http://purl.uniprot.org/enzyme/1.14.11.2">
<rdf:type rdf:resource="http://purl.uniprot.org/core/Enzyme"/>
<name>Procollagen-proline dioxygenase</name>
<name>Procollagen-proline 4-dioxygenase</name>
<name>Prolyl 4-hydroxylase</name>
<activity>L-proline-[procollagen] + 2-oxoglutarate + O(2)
= trans-4-hydroxy-L-proline-[procollagen] +
succinate + CO(2).
</activity>
<cofactor>Iron</cofactor>
<cofactor>L-ascorbic acid</cofactor>
</rdf:Description>
</rdf:RDF>rdf:type predicate stating that :1.14.11.2 is an instance of Enzymename of :1.14.11.2 is Procollagen-proline dioxygenase| Subject | Predicate | Object |
|---|---|---|
:1.14.11.2 |
rdf:type |
:Enzyme |
:1.14.11.2 |
:name |
Procollagen-proline dioxygenase |
:1.14.11.2 |
:name |
Procollagen-proline 4-dioxygenase |
:1.14.11.2 |
:name |
Prolyl 4-hydroxylase |
:1.14.11.2 |
:activity |
L-proline-[procollagen] + 2-oxoglutarate + O(2) = trans-4-hydroxy-L-proline-[procollagen] + succinate + CO(2). |
:1.14.11.2 |
:cofactor |
Iron |
:1.14.11.2 |
:cofactor |
L-ascorbic acid |
Each of the predicates indicates a different property of the enzyme 1.14.11.2. The triples with the predicate :name should be read as “:1.14.11.2 has the name Procollagen-proline dioxygenase” etc., and the triples with the predicate :cofactor should be read as “:1.14.11.2 has the cofactor Iron” etc.
RDF models statements as nodes and edges in a graph. For any given triple, there is a node for the subject, a node for the object, and an edge for the predicate, which is directed from the subject node to the object node.
For instance, the triple :1.14.11.2 rdf:type :Enzyme can be represented as shown here:
Figure 1: Graphical Representation of an RDF Triple. The subject, :1.14.11.2, and the object, :Enzyme, are shown as nodes, while the predicate is shown as an edge leading from the subject to the object.
This entire collection of RDF statements correspond to a labeled directed graph.
Figure 2: Graphical Representation of the entire RDF graph.
RDF provides several syntactic constructs for expressing more complex statements that go beyond what can be expressed by the simple subject/predicate/object form of the triples presented up to now.
The statement in the previous slide means \(X\), such that there exists some \(X\) (i.e., somebody) such that \(X\) has a first name Douglas.
rdf:subject.Let’s say we then do a search in our favorite Web search engine and find a Wikipedia article saying that Douglas Adams wrote a book entitled “Hitchhiker’s Guide to the Galaxy.” Furthermore, let’s say we are skeptical about this statement and want to use RDF to express the fact that Wikipedia claims that Douglas Adams wrote a book entitled “Hitchhiker’s Guide to the Galaxy.”
RDF represents a reified statement as four statements using RDF properties and objects: the triplet (S, P, O), reified by resource R, is represented by:
R rdf:type rdf:StatementR rdf:subject SR rdf:predicate PR rdf:object OEssentially, RDF “reifies” (makes) a triple into a “thing” about which other RDF statements can be made. The triple is assigned an identifier and treated as a resource.
Assuming that the appropriate qname prefixes have been defined, we can express our statement about Wikipedia’s claim as follows:
A schema is a formal definition of the syntax of a language, and a schema language is a language for expressing that definition. In SQL, the schema is the structure of the database that defines the objects in the database.
Probably the most important concept for inference in RDFS is that of the class. Classes provide an abstraction mechanism for grouping resources with similar characteristics.
It is important to distinguish between two different usages of the phrase “is a” in the English language:
As we shall see, RDFS has a different syntax for expressing each relation.
Imagine we have defined an ontology and have provided definitions at http://www.my-example.org, which we assign to the qname prefix x.
However, there is no mechanism built into RDF with which we could express the statement that mouse is a subclass of rodent.
rdfs:Class and rdfs:subClassOf can be used for this purpose.x:Mouse and x:Rodent are RDFS classes, and then state that x:Mouse is a subclass of x:Rodent.One of the RDFS inference rules states that if there are asserted triples of the form:
then the following triple can be inferred:
Another inference rule states that the subclass relation is transitive.
That is, if there are asserted triples about classes \(X\), \(Y\), and \(Z\):
then the following triple can be inferred:
mouse is a subclass of the class rodent!then we can infer that mice are mammals:
Similar inference rules apply to properties. For instance, we can define a simple hierarchy of biochemical regulation with the following RDF statements:
@prefix r: [http://www.regulation.org](http://www.regulation.org) .
r:regulates rdf:type rdfs:Property .
r:positively_regulates rdf:type rdfs:Property .
r:negatively_regulates rdf:type rdfs:Property .
r:positively_regulates rdfs:subPropertyOf r:regulates .
r:negatively_regulates rdfs:subPropertyOf r:regulates .If proteins A and B have been defined in a namespace pro, we can state that A positively regulates B by asserting the following triple:
There is an inference rule for RDFS properties stating that if the following triples are asserted:
Then the following triple can be inferred:
Example: Using our regulation hierarchy, we can infer that Protein A regulates Protein B:
Tip
Similarly to the situation with classes, the subproperty relation is transitive.
RDFS provides a syntax for specifying the range and domain of properties. Recall that the domain of a function is the set of values for which a function is defined. The range is the set of all values the function takes for the values in the domain.
P rdfs:domain D means that the subject of any triple using predicate P must be an instance of class D.If the following two triples are asserted:
hen the following triple is inferred:
Similarly, the RDFS statement P rdfs:range R for an rdf:Property P and an rdf:Class R means that the resource denoted by the object of a triple whose predicate is P is an instance of the class R.
If the following two triples are asserted:
Then the following triple can be inferred:
The following highly simplified example was adapted from the National Cancer Institute (NCI) thesaurus.
If we now additionally assert the following triple:
Then we can infer the following two triples:
In fact, Nilotinib is a selective Bcr-Abl tyrosine kinase inhibitor that is used in the treatment of chronic myeloid leukemia.
The figure below displays the graph corresponding to these triples. The inferred triples are shown using dashed lines.
Figure 3: Example of RDFS Inference in an RDF Graph. Because of the domain and range restrictions on the predicate :drug_affects_protein, it can be inferred that :Nilotinib is a drug and :BCR-ABL is a protein.
Yet other forms of inference in RDFS resemble to a certain extent conclusions about set intersections and set unions.
If the following triples are asserted:
Then these two triples can be inferred:
For instance, if we state that:
Then we can conclude that a particular pig is both a mammal and a farm animal. The class :Pig can be considered to occupy the intersection between those two classes.On the other hand, asserting that:
Does not imply that a cow is a pig!
Limits of RDFS
RDFS constructs are not designed to express an exact mathematical equality between sets, such as \(A = B \cap C\). It only expresses “is a” relationships, not “is only and exactly” relationships.
It is also possible to express something similar to set union in RDFS.
If we assert the following triples:
Then, if either b rdf:type X . or b rdf:type Y . is asserted, it can be inferred that:
The acronym OWL stands for Web Ontology Language. While the most natural acronym would have been “WOL,” OWL was chosen because it is easier to pronounce.1
In essence, OWL is an ontology language that builds upon RDF and RDFS but provides more powerful inference rules and built-in constructs.
An important aspect of OWL is its ability to define classes with considerably more expressivity than RDFS.
OWL classes can be defined in two primary ways:
OWL provides six different mechanisms for defining classes:
Aside from the first type, each of these mechanisms defines classes by placing restrictions on the things that belong to the class.
Perhaps the easiest to understand is the enumeration type, which essentially just provides a list of all individuals that belong to the class.
For instance, the following class definition expressed using RDF/XML syntax creates a class containing exactly the four Beatles:
An A-Microtubule can be defined as “A complete cylindrical microtubule that is part of a microtubule doublet in cilia.”
<owl:Class rdf:about="#A-Microtubule">
<rdfs:label>A-Microtubule</rdfs:label>
<rdfs:subClassOf rdf:resource="#Cilium Microtubule"/>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#is Physical Part of"/>
<owl:someValuesFrom rdf:resource="#Cytoskeleton"/>
</owl:Restriction>
</rdfs:subClassOf>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource="#is Physical Part of"/>
<owl:someValuesFrom rdf:resource="#Cilium"/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>A property restriction is a special kind of class description that describes an anonymous class—specifically, the class of all individuals that satisfy the restriction.
In plain English: Every A-microtubule is part of a cilium and simultaneously part of a cytoskeleton.
OWL distinguishes two kinds of property restrictions: Value Constraints and Cardinality Constraints. A value restriction on some property \(P\) has the following general form:
owl:allValuesFrom states that for each instance of the class being described, every value for \(P\) must fulfill the constraint.
owl:allValuesFrom does not necessarily mean that an instance of the class must have the property \(P\). It merely states that if an instance does have \(P\), then the value must be from class \(V\).Classes can be defined by stating what they are not.
owl:complementOf keyword.<owl:intersectionOf rdf:parseType="Collection">
<owl:Class rdf:about="CL:cell"/>
<owl:Class>
<owl:complementOf>
<owl:Restriction>
<owl:onProperty>
<owl:TransitiveProperty rdf:about="#has_part"/>
</owl:onProperty>
<owl:someValuesFrom>
<owl:Class rdf:about="#CD8 receptor"/>
</owl:someValuesFrom>
</owl:Restriction>
</owl:complementOf>
</owl:Class>
</owl:intersectionOf>The “Kitchen Sink” Problem
Unexpected results can be obtained from the naive use of owl:complementOf.
owl:complementOf Man is everything in the universe that is not a Man.
owl:complementOf is almost always combined with other restrictions (intersections) to narrow the scope.Example: To define Woman, you would use the Intersection of Human Being and the Complement of Man.
OWL provides a union operator that is analogous to the set union operator (\(A \cup B\)).
Contrast the owl:unionOf construct to the following incorrect attempt to define Human as a subclass of Man and Woman:
Why this fails:
In formal logic, this is equivalent to the empty set (\(\emptyset\)).
Important
The formula \(A = \{ C \cap \neg C \}\) is unsatisfiable because it implies a direct contradiction.
In OWL, rdf:type is the property that connects an individual to a class of which it is a member.
If we have a class of cities:
We can declare individual cities to be members of this class. The following two syntactical forms have identical meanings:
This uses the class name as the XML element tag.
This uses owl:Thing (the root class of everything) and explicitly states the type.
Semantic Web Layer Cake