Model-based gene-set analysis
\[ \newcommand{\argmax}{\mathop{\mathrm{arg\,max}\;}\limits} \]
We have seen in the previous chapter that a major difficulty of the standard approach to GO overrepresentation analysis is that each term is analyzed in isolation.
MGSA is a completely different approach to GO analysis that seeks to find the best combination of terms that correspond to an experimental result.
The core idea is to fit a model on all terms simultaneously, rather than testing each term in isolation.
The GenGO algorithm can be explained with the use of several categories.
ON gene node that is connected to at least one active GO term (e.g., genes B and C in
ON gene node that is not connected to any active GO term (e.g., gene A).OFF gene node (e.g., genes D and E).active GO term with a node in \(I\) (e.g., edge from GO:2 to gene D).inactive GO term with a node in \(I\) (e.g., edge from GO:4 to gene D).active GO term does not activate all of the genes it annotated; rather, because of noise or other errors, annotated genes are observed to be OFF with a probability of \(1-a\).active GO term are observed to be ON with a probability of \(b\).One can then define a scoring function that is to be maximized, whereby \(C\) is the set of active GO terms in the current iteration. \[
\mathcal{L}(C|p,q,G) = |A_g|\log a + |A_n|\log b + \; |S_g|\log (1-a)
+ |S_n| \log (1-b) - \alpha |C|
\tag{1}\]
The first four terms of Equation 1 are equivalent to the logarithm of the following equation
\[ a^{|A_g|}b^{|A_n|}(1-a)^{|S_g|}(1-b)^{|S_n|} \tag{2}\]
It can be seen that the value of Equation Equation 2 is maximized if \(|A_g|>|S_g|\) (because \(a>1-a\)) and if \(|S_n|>|A_n|\) (because \(1-b>b\)). Thus, maximization of Equation 2 would tend to identify sets of active GO terms that more often than not annotate the ON genes (\(|A_g|>|S_g|\)). On the other hand, the inactive GO terms would more often than not annotate OFF genes (\(|S_n|>|A_n|\)).
The final term in Equation 1, \(- \alpha |C|\) reduces the score linearly in the number of active GO terms. The authors of genGO state that a value of \(\alpha=3\) tends to produce good results. The genGO algorithm seeks to optimize the score for the current set of active GO terms \(C\).
GenGO performs iterations where all possible one-step changes of \(C\) are considered, both changes that add a term \(t_M\) to the current configuration as well as changes that remove a term \(t_L\). In each iteration, the single-step change with the highest improvement of the score is chosen until no further improvement is possible.
The idea behind GenGO appears particularly attractive because it represents a novel way of avoiding the statistical dependency problems associated with the overrepresentation methods described in the previous chapter.
state (such as differential expression), which can be ON or OFF. The true state of any gene is hidden.ON state would correspond to differential expression, and the OFF state would correspond to a lack of differential expression of a gene. Our model hence assumes that differential expression is the consequence of the annotation to some terms that are active.An additional parameter \(p\) represents the prior probability of a term being in the active state. The probability \(p\) is typically low (less than 0.5), which has the effect of introducing a penalization for increasing the number of active terms. This favors results that identify a relatively low number of active terms.
Gene categories, or terms (\(T_i\)) that constitute the first layer can be either active or inactive. Terms that are active activate the hidden state (\(H_j\)) of all genes annotated to them, with the other genes remaining OFF. The observed states (\(O_j\)) of the genes are noisy observations of their true hidden state. The parameters of the model (dashed nodes) are the prior probability of each term to be active, \(p\), the false positive rate, \(\alpha\), and the false negative rate, \(\beta\).
More formally, the model can be described using a Bayesian network with three layers that is augmented with a set of parameters.
active (1) or inactive (0).More formally, the model can be described using a Bayesian network with three layers that is augmented with a set of parameters.
ON (1), or OFF (0).More formally, the model can be described using a Bayesian network with three layers that is augmented with a set of parameters.
More formally, the model can be described using a Bayesian network with three layers that is augmented with a set of parameters.
For didactic purposes, we will initially explain a simplified version of MGSA in which the parameters \(\alpha\), \(\beta\) and \(p\) are considered to have known, fixed values.
The state propagation of the nodes can be modeled using various local probability distributions (LPDs), denoted by \(P\). The joint probability distribution for this Bayesian network can be written as
\[ P(T,H,O) \; = \; P(T)P(H|T)P(O|H)\; = P(T)\prod_{i=1}^n P(H_i|T) P(O_i|H_i). \tag{3}\]
then
\[ P(T) = p^{m_{1|T}}(1-p)^{m_{0|T}}. \tag{4}\]
on.In the following, \(T(H_i) \subseteq T\) is used to denote the set of terms to which gene \(H_i\) is annotated, i.e., the parents of \(H_i\) in the Bayesian network. For the \(T \rightarrow H\) links, any node \(H_i \in H\) is ON (\(H_i=1\)) if at least one of its parents is active. Otherwise it is OFF:
\[ P(H_i=1|T) = \begin{cases}1, & \text{if } \exists \; T_j \in T(H_i): T_j=1 \\ 0, & \text{otherwise.} \end{cases} %P(H_i|\emph{Pa}(H_i)) = \bigvee_{T\in\emph{Pa}(H_i)}T \label{eqn:lpd.hidden.given.t} \tag{5}\]
Note that this transition is deterministic.
ON, then in the MGSA model any annotated gene is (truly) ON.For the \(H \rightarrow O\) connection, the following two Bernoulli distributions are used:
\[ P(O_i=1|H_i=0) = \alpha \label{eq:mgsa-bernoulli-alpha} \]
and
\[ P(O_i=0|H_i=1) = \beta. \label{eq:mgsa-bernoulli-beta} \]
OFF, although the hidden node is ON (false negative) etc.ON.\[ P(O|T) = \alpha^{n_{10|T}} (1 - \alpha)^{n_{00|T}} (1-\beta)^{n_{11|T}} \beta^{n_{01|T}}. \]
\[ P(O|T) = \alpha^{n_{10|T}} (1 - \alpha)^{n_{00|T}} (1-\beta)^{n_{11|T}} \beta^{n_{01|T}}. \tag{6}\]
active terms, because otherwise their probability is zero according to Equation 5.In Bayesian statistics, maximum a posteriori (MAP) estimation is often used to generate an estimate of the maximum value of a probability distribution.
That is, if \(x\) is used to refer to the data (\(x\) can be an arbitrary expression), and \(\theta\) is used to refer to the parameters of a model, then Bayes’ law states that:
\[ P(\theta | x) = \dfrac{P(x|\theta )P(\theta )}{P(x)} \label{eq:bayes-law-for-map} \]
\[ \argmax_{\theta} P(\theta | x) = \argmax_{\theta} P(x|\theta )(P(\theta ) \label{eq:map} \]
active terms as well as values for \(\alpha,\beta\), and \(p\).These considerations motivate the use of the MCMC algorithm to sample from the posterior distribution.
ON state (e.g., differentially expressed).active term, then the observation could be explained by risking an error of one false-negative. The same can be noticed if \(T_2\) is the only active term.active state.\[ P(X=x')=\sum_{i} P(X=x',Y=y_i) \]
or
\[ P(X=x')=\int_Y P(X=x',Y) \mathrm{d}Y \]
MCMC
One of the best known and most effective algorithms for this purpose is the Metropolis-Hasting algorithm, which is a Markov chain Monte Carlo (MCMC) method. The MCMC algorithm performs a random walk over the term and parameter configurations, which asymptotically provides a random sampler according to the target distribution \(P(T|O)\).
Given the current configuration of the terms denoted by \(T^t\), the algorithm proposes a neighbor state \(T^p\) in accordance to a proposal density function \(Q_T(\cdot|T^t)\). A value \(r\) is sampled uniformly from the range (0,1). Then, if
\[ r < P_{\text{accept}}(T^t,T^p) = \frac{P(T^{p}|O)Q_T(T^{t}|T^{p})}{P(T^{t}|O)Q_T(T^{p}|T^t)} \label{eqn:acceptance} \]
the proposal is accepted, i.e., \(T^{t+1} = T^p\), otherwise it is rejected, i.e., \(T^{t+1} = T^t\).
Using Bayes’ law, we have
\[ P(T^{p}|O) = \dfrac{P(O|T^{p})P(T^{p})}{P(O)} \label{eqn:cond.prob} \]
and similarly for \(T^{t}\). Substituting these expressions for \(P(T^{p}|O)\) and \(P(T^{t}|O)\) cancels out the normalization constant \(P(O)\). The acceptance probability is then:
\[ P_{\text{accept}}(T^t,T^p) = \frac{P(O|T^{p})P(T^{p})Q_T(T^{t}|T^{p})}{P(O|T^{t})P(T^{t})Q_T(T^{p}|T^t)}. \label{eqn:accept.prop.2} \tag{7}\]
Equation 7 is used iteratively to define a random walk through the space of configurations. A burn-in period consisting of a certain number of iterations is used to initialize the MCMC chain (in our implementation of the MGSA algorithm in the Ontologizer, the default is 20,000 iterations).
active. Then\[ P(T_i|O)\approx \frac{C(T_i)}{l}. \label{eqn:mgsa-marginal} \]
In order to finish the description of the algorithm, one needs to define classes of operations of which a proposal is chosen, that is, we need to specify \(Q_T(T^p|T^t)\).
Denote by \(T^p \leftrightarrow_T T^t\) the binary relation that states that \(T^p\) be constructed from \(T^t\) by either
active/inactive state of a single term, or byactive term and a single inactive term.Denote by \(N(T)\) the neighborhood of a given configuration for \(T\), that is, the number of different operations that can be applied once to \(T\) in order to get a new configuration. At first, there are \(m\) terms in total, each of which can be toggled. In addition, there are \(m_{0|T}m_{1|T}\) possibilities to combine terms that are active with terms that are inactive. Thus, there are a total of \(N(T)=m+m_{0|T}m_{1|T}\) valid state transitions. We would like to sample the valid proposals with equal probability; therefore, the proposal distribution \(Q_T\) is determined by
\[ Q_T(T^p|T^t)=\begin{cases} \frac{1}{N(T^t)}, & \text{if } T^p \leftrightarrow_T T^t\\ 0, & \text{otherwise.} \end{cases}, \]
which we can use to rewrite Equation 7 to:
\[ P_{\text{accept}}(T^t,T^p) = \frac{P(O|T^{p})P(T^{p})N(T^t)}{P(O|T^{t})P(T^{t})N(T^p)}. \label{eqn:accept.prop.3} \]
describe what we will do here briefly