Human Phenotype Ontology: Phenomizer
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
In 1995, Resnik introduced a method for evaluating the semantic similarity between two concepts in an ontology with is_a relations.1
pebble is a stone, and every stone is a natural object.entity.
Knowing a random term is a descendent of entity provides no added information. Conversely, the probability of a specific term like pebble is much lower, meaning its Information Content is correspondingly higher.
Logic of Inheritance
Recall that:
\(x\) instance_of \(A\) AND \(A\) is_a \(B\) \(\implies\) \(x\) instance_of \(B\).
is_a \(c_j\), then \(p(c_i) \leq p(c_j)\).The Information Content (IC) of a concept is the negative log likelihood of its probability:
\[\text{IC}(t) = -\log p(t)\]
As the probability of a concept increases, its information content decreases. This mirrors information theory:
Summary
The more specific a term is (lower probability), the higher its Information Content. The more general it is (higher probability), the lower its Information Content.
Mathematical Foundation
Information Content relies heavily on the properties of logarithms. These identities are essential for calculating IC across different branches of an ontology.
The following logarithmic identities are fundamental to the behavior of \(IC(t) = -\log p(t)\):
\[ \begin{aligned} \log 1 &= 0 \\ \log xy &= \log x + \log y \\ \log x^r &= r \log x \\ \log \frac{1}{x} &= - \log x \\ \log_a x &= \frac{\log x}{\log a} \end{aligned} \]
relevance for IC
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Shannon defined the information content of an outcome \(x\) as
\[ h(x) = \log_2 \dfrac{1}{p(x)} \tag{1}\]
Note that because of the properties of logarithms, \(\log_2 \frac{1}{p(x)} = -\log_2 p(x)\), which is the definition of information content. The of the random variable \(X\), written as \(H(X)\), was defined by Shannon as the average information content of all of the possible outcomes of \(X\).
\[ H(X) = \sum_{i=1}^n p(x_i)\log_2\dfrac{1}{p(x_i)} \tag{2}\]
Entropy and information content are measured in units called bits (not to be confused with the definition of a bit as a zero or one in computer science).
\[ H(X) = p(x')\log_2\dfrac{1}{p(x')} = 1\times (-1)\times\log_2 1 = 0 \]
It can easily be shown that the entropy of \(X\) is maximized if there is maximal uncertainty about the outcome – this is the case if no outcome is more likely than the others, or stated differently, if \(X\) follows a uniform distribution.
Entropy of a random variable \(X=\left\{\mathtt{0, 1}\right\}\) representing Bernoulli process such as a coin toss for different values of \(p(X=1)\). Maximum entropy is achieved if \(p(\mathtt{0}) = p(\mathtt{1}) = 0.5\), i.e., for a uniform distribution. The minimum entropy of zero occurs if the probability of one of the two outcomes is 1.
The function \(IC(x) = -\log p(x)\) is a natural one for measuring information content, for several reasons.
\[ IC(x,y) = -\log_2 (xy) = -\log_2 x - \log_2 y = IC(x) + IC(y) \]
Claude Shannon’s definition of entropy is not arbitrary; it is the only function that satisfies three fundamental postulates.
Theorem: Uniqueness of Entropy
Let \(P = \{p_1, p_2, \dots, p_n\}\) be the probabilities of outcomes \(\{x_1, x_2, \dots, x_n\}\), where \(p_i \geq 0\) and \(\sum p_i = 1\). A continuous function \(H(P)\) satisfies the following:
Then \(H\) must take the form: \[H(p_1, p_2, \dots, p_n) = \lambda \sum_{i=1}^n - p_i \log p_i\] where \(\lambda\) is a positive constant (usually \(\lambda=1\)).
A proof of this theorem can be found in Khinchin’s book1. It can also be shown that \(\lambda=1\) if \(\log_2\) is used.
Note
Thus, we conclude that Equation 2 provides a definition of information content that satisfies several important intuitions about the properties that a mathematical definition of information content should have.
In the setting of GO, the probability of a term \(t\) is the probability that a randomly chosen protein is annotated to that term.
Note
This reflects “surprise”: discovering a gene is annotated to the root is expected, while discovering a rare, specific annotation provides significant information.
Assume we have a set of 256 annotated genes:
| Genes Annotated to Term | Probability \(p(t)\) | IC calculation (\(-\log_2 p(t)\)) | IC Value |
|---|---|---|---|
| 256 (All) | \(256/256 = 1\) | \(-\log_2(1)\) | 0 |
| 4 | \(4/256\) | \(-\log_2(4/256) = 8 - 2\) | 6 |
| 2 | \(2/256\) | \(-\log_2(2/256) = 8 - 1\) | 7 |
| 1 (Unique) | \(1/256\) | \(-\log_2(1/256)\) | 8 |
Resnik defined similarity between two terms based on the Information Content of their Most Specific Ancestor (MSA).
\[\text{IC}(t) = -\log p(t)\]
The similarity is the IC of the most specific ancestor \(a\) in the set of common ancestors \(\mathcal{A}(t_1, t_2)\):
\[\mathrm{sim}(t_1, t_2) = \max_{a \in \mathcal{A}(t_1, t_2)} [-\log p(a)]\]
We refer to this as the IC of the Most Specific Ancestor: \(\text{IC}(a')\).
lion and cheetah are similarbeagle and wildcat are not \(\ldots\)\[ \begin{eqnarray*} \mathrm{sim}(g_1,g_2)& =& \displaystyle\frac{1}{|I_1|} \sum_{t_1\in I_1} \max_{t_2 \in I_2} \textnormal{sim}(t_1,t_2) \\ &=& \dfrac{1}{2} \left[ \mathrm{IC}(B) + \mathrm{IC}(F) \right] = 0.7 \label{eqn:resnik.avg.max} \end{eqnarray*} \]
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
generalized amyotrophy, generalized muscle atrophy, muscular atrophy, generalized, (etc)\[ \begin{equation*} \mbox{American Football} = \mbox{Football} \neq \mbox{Football} = \mbox{European Football} = \mbox{Soccer} \end{equation*} \]
A football \(\ldots\)1
A football \(\ldots\)
\[ \begin{equation*} \mbox{muscle fibrillation} = \mbox{fibrillation} \neq \mbox{fibrillation} = \mbox{ventricular fibrillation} \end{equation*} \]
fibrillation \(\ldots\)
fibrillation \(\ldots\)
When you see ``fibrillation’’, your computer sees: 011001100110100101100010011100100110100101101100011011000110000101110100011010010110111101101110
Philosophy
\(\rightarrow\)
Artificial Intelligence
\(\rightarrow\)
Biomedical Research
“An ontology is a specification of a conceptualization.”
Tom Gruber, 1993
To understand what an ontology is, we can view it as a progression of increasing semantic complexity:
| Catalog | Thesaurus | Instances | Value Restrictions | Formal Relations | ||||
| \(\uparrow\) | \(\uparrow\) | \(\uparrow\) | \(\uparrow\) | \(\uparrow\) | ||||
| ● | —— | ● | —— | ● | —— | ● | —— | ● \(\rightarrow\) |
| \(\downarrow\) | \(\downarrow\) | \(\downarrow\) | \(\downarrow\) | |||||
| Glossary | Subclassing | Properties | Logical Constraints |
Evolution of Complexity
As we move from left to right, we move from simple lists of terms (Catalog/Glossary) to hierarchical relationships (Thesaurus/Subclassing) and finally to formal logic (Restrictions/Constraints).
Core Purpose
Precise language (and thinking), interoperability, and improved database models to reliably capture and interpret phenotype information.
A medical phenotype ontology describes the individual manifestations of diseases:
Robinson PN, Webber C (2014) Phenotype ontologies and cross-species analysis for translational research. PLoS Genet 10:e1004268.
The precise and comprehensive analysis of phenotypic abnormalities.
PN Robinson (2012) Deep phenotyping for precision medicine. Hum Mutat 33: 777–780. Special Issue of Human Mutation on Deep Phenotyping
Core Purpose
Ontologies have two major (interrelated) use cases
Phenotypic descriptions that are very evocative for humans1:
The following descriptions mean the same thing to you: generalized amyotrophy, generalized muscle atrophy, muscular atrophy, generalized, (etc)
Many publications have little2 information about the actual phenotypic features seen in patients with particular diseases and variants
Databases cannot talk to one another about phenotypes
Todo: Mainly duplicated from above)
An ontology provides a standardised “Term” for each entity in its domain that can be used for data exchange.
| Field | Description |
|---|---|
| ID | HP:0000256 |
| Name | Macrocephaly |
| Definition | Occipitofrontal (head) circumference greater than 97th centile compared to appropriate, age matched, sex-matched normal standards. Alternatively, an apparently increased size of the cranium. |
| Synonyms | Large head, big head, large cranium, large calvaria, Increased head circumference, … |
| Xrefs | MeSH:D058627, UMLS:C0221355, SNOMED CT:19410003, ICHPT:T0028, EoM:1d53660e657259f0, … |
| Translations |
Ontologies are used for myriad applications in biomedical research and translational applications. The greatest common denominator is: Compute over human knowledge!
Haendel, Chute, Robinson (2018) Classification, Ontology, and Precision Medicine.
N Eng J Med 379:1452-1462
Robinson Am J Hum Genet 2008
Köhler Am J Hum Genet 2009
Robinson Clin Genet. 2010
Schulz BMC Bioinformatics. 2011
Köhler Hum Mutat. 2012
Bauer Bioinformatics 2012
Köhler Nucleic Acids Res 2014
Köhler J Med Genet 2014
Zemojtel Sci Transl Med. 2014
Robinson PLoS Genet. 2014
Robinson Genome Res. 2014
Groza Am J Hum Genet 2015
Groza Database (Oxford). 2015
Smedley Nat Protoc 2015
Smedley Genome Med. 2015
Smedley Am J Hum Genet. 2016
Köhler Nucleic Acids Res 2017
Vasilevsky Nat Genet 2018
Köhler Nucleic Acids Res 2019
Köhler Curr Protoc Hum Genet. 2019
Carmody Orphanet J Rare Dis. 2020
Köhler Nucleic Acids Res 2021
Dhombres Am J Med Genet C 2022
Gargano Nucleic Acids Res 2024
Game plan
This lecture covers a review of artificial neural networks and backpropagation as a prerequisite for understanding transformers and LLMs.
Goal: Calculate a similarity between a set of query terms and computational disease models.
Phenomizer
Köhler et al. (2009) Am J Hum Genet 85:457-64
Exact Score Distribution
Schulz MH et al (2011) BMC Bioinformatics 12:441
BOQA
Bauer et al. (2012) Bioinformatics 28:2502-8
\[\text{sim}(\color{orange}{Q} \rightarrow \color{blue}{d}) = \text{avg} \left[ \sum_{s \in \color{orange}{Q}} \max_{t \in \color{blue}{d}} \text{sim}(s,t) \right]\]
\(\color{orange}{Q}\): Query terms
\(\color{blue}{d}\): Disease terms