Session #2: Gene Ontology and Overrepresentation Analysis
The Gene Ontology (GO) is being developed with the goal of providing a set of structured vocabularies for the annotation of genes and their products. Since the publication of the original paper in Nature Genetics in 2000, GO has become one of the most widely used and mature bio-ontologies, although it is still very much a work in progress.
The following represents the lineage of the term DNA Ligation (number of children in parentheses):
second path through the graphThe term function is sometimes used in a rather vague way..
figure credit: EMBL-EBI
GO Molecular function
Molecular functions generally correspond to the individual steps in a biological process (like the activity of a single protein within a metabolic pathway).
figure credit: EMBL-EBI
Note
A cellular component is not just a “where”—it can be a functional structure like a protein complex that acts as a unit.
figure credit: EMBL-EBI
GO:0003678Each entry (term) in the Gene Ontology contains these standardized metadata fields to ensure consistency across different databases.
The “Annotation” vs. “Term”
Remember: A Term is a definition in the ontology (the “dictionary”). An Annotation is the link between a specific gene and that term (the “usage”).
An annotation record typically includes:
| Field | Value | Description |
|---|---|---|
| Accession | O00115 |
Unique protein identifier |
| DB Name | DRN2_HUMAN |
Standardized database symbol |
| GO ID | GO:0003677 |
The GO term assigned (DNA binding) |
| Evidence | TAS |
Traceable Author Statement |
Metadata
Records also include synonyms, PubMed IDs (for literature support), Taxon (species), and the date the annotation was last updated.
Not all annotations are created equal. Evidence codes tell us how we know what we know.
Caution
Many high-throughput analyses filter out IEA codes because they are automated and may have a higher false-positive rate compared to experimental data.
The Situation
We have performed a high-throughput experiment (e.g., RNA-seq) and identified a list of differentially expressed genes. What are the salient characteristics of the genes in this list?
To calculate the number of unique ways to arrange \(k\) “heads” and \(n-k\) “tails,” we must account for redundant rearrangements:
\[ \binom{n}{k} = \dfrac{n!}{(n-k)! k!} \] {eq-binomial-coefficient}
Note
This value, \(\binom{n}{k}\), represents the number of different ways of choosing \(k\) items from a set of \(n\) items where the order of selection does not matter.
This quantity (read “\(n\) choose \(k\)”) is known as the binomial coefficient. We can now use this to calculate the probability of obtaining \(k\) successes in \(n\) trials, whereby the probability of success in one trial is \(p\), and the probability of failure is \(q=1-p\).
To obtain the probability of any one particular order of trials with \(k\) successes and \(n-k\) failures, we multiply the probabilities of the individual trials to obtain \(p^k(1-p)^{n-k}\).
In order to find the total probability of obtaining \(k\) successes, we simply multiply the probability of one particular trial order with \(k\) successes by the number of trial orders resulting in \(k\) successes, which is given by the binomial coefficient. This probability distribution is called the binomial distribution:
\[ P(k,n,p) = \binom{n}{k} p^k(1-p)^{n-k} \] {eq-binomial-distribution}
For instance, the figure shows A) the probability of \(k=1,\ldots,10\) “heads” in ten coin tosses using a fair coin (\(p=0.5\)). B) Probability of \(k=1,\ldots,10\) “heads” in ten coin tosses using a biased coin (\(p=0.1\)).{fig-align=“center” width=“80%”}
GO overrepresentation analysis can be modeled as a binomial distribution in which each gene in the study set is considered to be a “trial,” and “success” occurs if the gene is annotated to a GO term of interest.1
\[ P(30, 250, 0.06) = \binom{250}{30} 0.06^{30} (1-0.06)^{250-30} = 1.44 \times 10^{-4} \]
In order to use this for a statistical test, we now define a null hypothesis (\(H_0\)) to be that the biological function described by GO term \(t\) is not overrepresented among the differentially expressed genes.
In GO overrepresentation analysis, we are looking for terms that annotate a higher than expected proportion of the genes in the study set.
Note that if there were only 100 genes annotated to a term \(t\) in the entire population of genes on the microarray chip, the probability of observing 101 or more genes annotated to \(t\) is zero.
\[ p(k \geq k^{'},n,p) = \sum_{k=k^{'}}^{\max(p_t,|\textit{DE}|)}\binom{n}{k} k^p\left(n-k\right)^{1-p}. \]
An approximation of the probability mass function for \(k=0,1,\ldots,40\), in which the discrete points are depicted as a smoothed line. The gray region indicates the values of \(k\) which encompass 95% of the probability mass under the null hypothesis.
Although this approach provides a nearly correct answer for terms with many annotations in the population and large DE sets, it is actually only an approximation.
The problem is that the binomial distribution is based on the assumption that the individual trials are independent of one another.
While this is clearly true for coin tosses, it is not for GO overrepresentation analysis, which is rather comparable to a lotto game in which labelled balls are taken from an urn and not replaced.
For instance, if there are 60 balls numbered 1 to 60, the chance of getting a 7 on the first draw is 1/60.
If a 7 was drawn the first time, the chance of getting it the second time is zero.
If another number was drawn the first time, the chance of getting a 7 the second time is 1/59, because some other ball has been removed.
In our microarray experiment, we imagine that the set of all 10,000 genes on the chip is represented by an urn with 10,000 balls.
The statistical analysis basically just counts up all the ways of getting \(k\) genes annotated to \(t\) and \(n-k\) remaining genes not annotated to \(t\), and compares this number to the number of all possible outcomes.
If we let:\(m\) be the number of genes on the microarray chip.
We have: \[ P(k,n,m,m_t) = \dfrac{\binom{m_t}{k}\binom{m-m_t}{n-k}}{\binom{m}{n}} \]
Numerator: \(\binom{m_t}{k}\binom{m-m_t}{n-k}\): The number of ways to choose exactly \(k\) annotated genes AND \(n-k\) non-annotated genes.
Denominator: \(\binom{m}{n}\): The total number of ways to choose any \(n\) genes from the population \(m\).
\[p(k,n,m,m_t) = \dfrac{\binom{m_t}{k}\binom{m-m_t}{n-k}}{\binom{m}{n}}\]
Equation~(\(\ref{eq:hypergeometric}\)) is known as the hypergeometric distribution.
Exact Fisher Test:\[ p(K\geq k^{'},n,p) = \sum_{k=k^{'}}^{\max(K,|DE|)}\dfrac{\binom{m_t}{k}\binom{m-m_t}{n-k}}{\binom{m}{n}} \]
we refer to the set of genes which are investigated in an experiment as the population set.
gene only for simplicity. The items annotated to the ontology terms may also be proteins or other biological entities.study set. It is denoted by \(N\) and has the cardinality \(n\).The standard approach to identify the most interesting terms is to perform Fisher’s exact test for each term separately. For this reason, we refer to this procedure as the term-for-term (TfT) approach
The study set is assumed to be a random sample that is obtained by drawing \(n\) genes without replacement from the population.
\[ X_t \sim h(k|m;m_t;n) := P(X_t=k) = \frac{\binom{m-m_t}{n-k}}{\binom{m}{n}} \tag{1}\]
Note
Null and Alternative Hypothesis
The number of genes in the study set that are annotated to \(t\) represents the observation. This number is denoted by \(n_t\).
The null hypothesis corresponds to the probability of observing a test statistic that is at least as extreme as the one that was observed given that the null hypothesis is true.
Therefore, the null hypothesis corresponding to a one-sided test is rejected if the probability of observing \(n_t\) genes annotated to term \(t\) in the study set by chance is less than \(\alpha\)
By convention, \(\alpha\) is usually set to 0.05
This is given by:
\[ P(X_t\geq n_t|H_0) = \sum_{k=n_t}^{\min(m_t,n)} \frac{\displaystyle{m_t \choose k}\displaystyle{{m - m_t} \choose {n - k}}}{\displaystyle{m \choose n}}. \]
Suppose that there is a population set of \(m=18\) genes, of which \(m_t=4\) genes are annotated to the term \(t\).
The outcome of an experiment conducted on all 18 genes of the population yields a set of 5 genes. This means that the study set consists of \(n=5\) genes.
Moreover, we observe that a total of \(n_t=3\) genes from the genes of the study set are annotated to term \(t\)
We would now like to analyze whether term \(t\) is significantly overrepresented in the study set and thus can be interpreted to represent an important result of the experiment: \[ P(X_t\geq 3|H_0) = \frac{\displaystyle{5 \choose 3}\displaystyle{{13} \choose {2}}}{\displaystyle{18 \choose 5}} + \frac{\displaystyle{5 \choose 4}\displaystyle{{13} \choose {1}}}{\displaystyle{18 \choose 5}} = 0.044. \]
Since \(P(X_t\geq 3|H_0)=0.044<0.05=\alpha\), the null-hypotheses is rejected and the term may be interpreted as being characteristic of the experiment.