FuncAssociate: Methods

Let $q$ be the number of genes in the query set $Q$; let $n$ be the total number of genes having attribute $A$; let $m$ be the number of genes in $Q$ having attribute $A$; and let $N$ be the total number of genes in the gene universe. Now, let the null hypothesis $H_0$ be that having attribute $A$ is independent of being in the set of genes $Q$, and let $m_0 = q \frac{n}{N}$ be the expected number of genes in $Q$ having attribute $A$ under $H_0$.

For each GO attribute $A$, FuncAssociate computes the $p$-values $p_+(A)$ and $p_-(A)$ using a one-tailed Fisher's Exact Test:

\begin{displaymath}
p_{+/-}(A) = \sum \;\; \frac{{q \choose a} {{N-q} \choose {n...
...! \; n! \; (N-n)!}{a! \; (q-a)! \; (n-a)! \; (N-q-n+a)! \; N!}
\end{displaymath} (1)

For $p_+(A)$ the summation in (1) ranges over $m \le a \le
\textrm{min}(q, n)$, and for $p_-(A)$, it ranges over $0 \le a \le m$.

To estimate $p_{\mathrm{adj}}$, FuncAssociate first runs 1000 simulated queries in which the query gene sets of $N$ genes are chosen randomly from the same gene space as used in the original query. (Hence, $H_0$ holds for these simulated queries, by design.) When FuncAssociate searches for over-represented attributes, for each simulated query $i$ ( $1 \le i \le 1000 $) and each GO attribute $A$, it computes $p$-value $p_{+,i}(A)$ using (1). Then, for each $i$ it determines $p_{+,i, \mathrm{min}} =
\mathrm{min}(\{p_{+,i}(A)\vert A \in {\cal F}\})$, where ${\cal F}$ is the set of all GO attributes. Then for each GO attribute $A$, it estimates $p_{+,\mathrm{adj}}(A)$ as the fraction of these $p_{+,i,
\mathrm{min}}$ that are less than or equal to $p_+(A)$ (the $p$-value computed using (1) for the user's original query set). We use the analogous procedure when the user is interested in under-represented attributes. When the user is interested in both over- and under-represented attributes, for each simulated query $i$ and each GO attribute $A$, FuncAssociate computes $p$-value $p_i(A) =
\mathrm{min}(\{p_{+,i}(A), p_{-,i}(A)\})$ using (1). Then, for each $i$ it determines $p_{i, \mathrm{min}} =
\mathrm{min}(\{p_i(A)\vert A \in {\cal F}\})$. Then for each GO attribute $A$, it estimates $p_{\mathrm{adj}}(A)$ as the fraction of these $p_{i, \mathrm{min}}$ that are less than or equal to $p(A) =
\mathrm{min}(\{p_+(A), p_-(A)\})$.

Excluded from FuncAssociate's analysis are the GO attributes "obsolete'' from each branch of the ontology (GO ids 0008369, 0008370, and 0008371), the attributes "biological_process unknown'' (0000004), "molecular_function unknown'' (0005554), and "cellular_component unknown'' (0008372), as well as any descendants of these six attributes. Also the excluded from these analysis are the GO attributes "Gene_Ontology'' (0003673), "molecular_function'' (0003674), "cellular_component'' (0005575), and "biological_process'' (0008150).

To show that the null hypotheses in FuncAssociate's analysis are not independent, we can perform the following test. The number of GO attribute pairs in which both member attributes are associated with exactly the same S. cerevisiae genes, is 1,488 (out of a possible 5 million), and on average, these matches are over 6 genes. We can estimate what these numbers would be if the null hypotheses were independent by performing the same analysis many times on data sets obtained by assigning genes to attributes at random, subject only to the constraint that each attribute has as many genes associated with it in the simulated data as in the real data. We did this 10,000 times and analyzed the results. The maximum number of attribute pairs in one single simulation run whose members were associated with exactly the same genes was 108; the average was 67 pairs. Moreover these matches were, on average, over little more than 1 gene.