Materials: Gene-Pair Characteristics and Their Data Sources
Chromosomal Distance. Genes located within 1 kb (26,473 pairs), 3 kb (19,838 pairs), 5 kb (13,083 pairs), and 7 kb (6,561 pairs) of each other in the Saccharomyces cerevisiae genome were paired.
Common Regulator. Pairs of genes whose upstream regions are bound by a common transcription factor were collected from a high-throughput data set in which DNA bound by transcription factors was hybridized to an array of upstream sequences (1). We used three P value thresholds, 0.001 (181,767 pairs), 0.005 (502,684 pairs), and 0.01 (896,504 pairs), to generate three binary characteristics.
Conserved Gene Neighborhood. Pairs of genes (6,397) that are adjacent to each other on the chromosome in at least 2 of 42 genomes (2, 3) were collected from von Mering et al. (4).
Cooccurrence of Genes. Pairs of genes (997) whose orthologs have correlated appearance across 42 sequenced genomes (5, 6) were collected from von Mering et al. (4).
Gene Fusion. Gene pairs (358) each contain domains that are orthologous to separate domains of a common gene in another species. Gene fusions were detected by the presence of a gene in more than one Cluster of Orthologous Genes (COG) and were collected from von Mering et al. (4).
mRNA Coexpression. Pairs of genes (374,822) with correlated mRNA expression were collected from two data sets: (i) mRNA expression in the yeast mitotic cell cycle measured at 17 time points in synchronized yeast cultures (7), and (ii) the Rosetta compendium of mRNA expression profiles drawn from a variety of growth conditions and strain backgrounds (8). For each gene, all mRNA levels were converted to log ratios. We computed Pearson correlation coefficients between all gene pairs to measure their similarity in expression profiles. Positive and negative correlation were mapped to a series of binary characteristics according to a series of alternative thresholds. We chose upper thresholds of correlation coefficient of 0.7, 0.8, and 0.9 and lower thresholds of -0.7, -0.8, and -0.9.
Mutual Clustering Coefficient. Pairs of genes (126,245) with a high mutual clustering coefficient (MCC), a measure of neighborhood cohesiveness around an edge (or pair of vertices) in a graph of physical protein- protein interactions, were collected. The MCC is the negative log of the probability (P value) of obtaining a number of common neighbors between two vertices greater than or equal to the observed number by chance, under the null hypothesis that the neighborhoods are independent and given the neighborhood sizes of the two vertices and the total number of proteins in the organism (9). We used five MCC thresholds: 0, 3, 5, 7, and 10. In addition, we computed MCC separately for networks built on (i) interactions the Curagen database (http://portal.curagen.com/pathcalling_portal/index.htm) (8,350 pairs); (ii) all yeast two-hybrid (Y2H) interactions reported by Uetz et al. (10) (120,833 pairs) and the high-confidence Y2H interactions ("Ito core") reported by Ito et al. (11); and (iii) all high-throughput Y2H interactions (7,815 pairs) (10, 11) .
Physical Interaction: HMS-PCI. Pairs of proteins (26,649) were detected to physically interact by high-throughput mass spectrometric protein complex identification (HMS-PCI) (12). Two data sets were constructed based on the filtered data obtained from www.mdsp.com/yeast. The "spoke" (13) interactions (3,618 pairs) included interactions between a query protein, or bait, and proteins that complexed with it. The "matrix" (13) interactions (26,742 pairs) included all pairwise interactions represented in a purified complex obtained when using a single bait.
Physical Interaction: TAP. Protein pairs (17,314) were detected to physically interact by the tandem affinity purification and mass spectrometry (TAP) experiments of Gavin et al. (14). We collected two datasets based upon all purifications, excluding proteins that appeared in > 3.5% of the purifications. The "spoke" (13) interactions (3,225 pairs) included interactions between a query protein, or bait, and all proteins that complexed with it. The "matrix" (13) interactions (17,314 pairs) included all pairwise interactions represented in a purified complex obtained when using a single bait.
Physical Interaction: Same MIPS Complex (Annotated in the Literature). Gene pairs (64,792) were generated from complexes reported in the MIPS complex catalog (15).
Physical Interaction: YTH. Interacting gene pairs (5,126) were assembled from previous high-throughput YTH results. These consisted of two independent YTH data sets, described by Ito et al. (11) and Uetz et al. (10), respectively. The Ito and Uetz data sets were obtained from http://genome.c.kanazawa-u.ac.jp/Y2H and http://depts.washington.edu/sfields/yplm/data/index.html, respectively.
Posterior Probability of Physical Interaction. Pairs of genes (9,826) with a posterior probability of interacting in the protein physical interaction network (given high-throughput YTH data and the MCC measure of local network topology) were derived from Goldberg and Roth (9).
Protein Sequence Homology. Yeast mRNAs were collected (July 2002) from RefSeq (16) translated, and BLASTed (17) against other. We annotated pairs as having E-values below 10-3 (40,438 pairs), 10-6 (22,170 pairs), and 10-12 (13,182 pairs).
Same MIPS Function. Pairs of genes (229,775) that belonged to the same functional category were collected from the MIPS database (15) in January 2003, excluding categories describing subcellular localization and nonspecific categories (i.e., categories with names including the word "other," categories containing > 200 genes, or categories at the least specific level of the hierarchy).
Same MIPS Protein Class. Pairs of genes (53,038) belonging to the same protein class (typically describing biochemical activity or structural role) were collected from the MIPS database (15) in January 2003, excluding nonspecific categories (i.e., those with names including the word "other" and those that contained > 200 genes).
Same Mutant Phenotype. Pairs of genes (74,660) annotated with the same mutant phenotype were collected from the MIPS database (15) in January 2003, excluding nonspecific phenotype categories, (e.g., categories with names including the word "other" and categories with > 200 genes, or those at the least specific level of the hierarchy).
Same Predicted Protein Complex. Pairs (37,861) were comprised of genes whose protein products were predicted to be in the same complex by a program called MCODE (18).
Same Subcellular Localization. Pairs of genes (634,039) whose protein products are annotated as belonging to the same subcellular compartment were collected from the MIPS database (15) in January 2003. For our global "same localization" category (50,009 pairs), we excluded nonspecific categories (i.e., those with names including the word "other" and those containing > 200 genes).
Synthetic Sick or Lethal (SSL). Pairs of synthetic sick or lethal genes (4,207) were assembled from interactions reported by Tong et al. (19, 20). In addition, 886 SSL interactions were parsed from the MIPS database by von Mering et al. (4).
2hop Characteristics. The 2hop characteristics describe the topology around a pair of genes in various networks of gene pair characteristics. For example, if protein A is homologous to protein C, and protein B physically interacts with protein C, then the gene pair A-B possesses the 2hop homology-physical characteristic. We used five gene-pair characteristics with the following criteria to generate 11 2hop characteristics.
(i) Homology (H) pairs were defined by BLAST E values <10-3 (17).
(ii) mRNA coexpression (X) pairs were defined by Pearson correlation coefficients >0.7 (7, 8).
(iii) Physically interacting (P) pairs were defined as the union of the YTH "Ito core" (11) and Uetz (10) data sets and the mass spectroscopy HMS-PCI spoke (12, 13) and TAP spoke (13, 14) data sets.
(iv) Common regulator (R) pairs were defined by P values <0.001 in chromatin immunoprecipitation experiments (1).
(v) Synthetic sick or lethal (S) pairs were reported by Tong et al. (19, 20) or in MIPS (15).
The 11 2hop characteristics were: H-S, P-S, S-S, S-X, H-H, H -P, H -R, H-X, P-P, P-R, and X-X. In cross-validation and experimental validation, only SSL interactions of gene pairs in the training set were used to generate annotations for 2hop characteristics.
1. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002) Science 298, 799-804.
2. Overbeek, R., Fonstein, M., DíSouza, M., Pusch, G. D. & Maltsev, N. (1999) Proc. Natl. Acad. Sci. USA 96, 2896-2901.
3. Huynen, M., Snel, B., Lathe, W., III, & Bork, P. (2000) Genome Res 10, 1204-1210.
4. von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S. & Bork, P. (2002) Nature 417, 399-403.
5. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. (1999) Proc. Natl. Acad. Sci. USA 96, 4285-4288.
6. Huynen, M. A. & Bork, P. (1998) Proc. Natl. Acad. Sci. USA 95, 5849-5856.
7. Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., et al. (1998) Mol. Cell 2, 65-73.
8. Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffey, E., Dai, H., He, Y. D., et al. (2000) Cell 102, 109-126.
9. Goldberg, D. S. & Roth, F. P. (2003) Proc. Natl. Acad. Sci. USA 100, 4372-4376.
10. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. (2000) Nature 403, 623-627.
11. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. & Sakaki, Y. (2001) Proc. Natl. Acad. Sci. USA 98, 4569-4574.
12. Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Nature 415, 180-183.
13. Bader, G. D. & Hogue, C. W. (2002) Nat. Biotechnol. 20, 991-997.
14. Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., et al. (2002) Nature 415, 141-147.
15. Mewes, H. W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S. & Weil, B. (2002) Nucleic Acids Res. 30, 31-34.
16. Pruitt, K. D. & Maglott, D. R. (2001) Nucleic Acids Res. 29, 137-140.
17. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., MIller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25, 3389-3402.
18. Bader, G. D. & Hogue, C. W. (2003) BMC Bioinformatics 4, 2.
19. Tong, A. H., Evangelista, M., Parsons, A. B., Xu, H., Bader, G. D., Page, N., Robinson, M., Raghibizadeh, S., Hogue, C. W., Bussey, H., et al. (2001) Science 294, 2364-2368.
20. Tong, A. H., Lesage, G., Bader, G. D., Ding, H., Xu, H., Xin, X., Young, J., Berriz, G. F., Brost, R. L., Chang, M., et al. (2004) Science 303, 808-813.