The Synergizer

What is The Synergizer?

The Synergizer is a service for translating between sets of biological identifiers. It can, for example, translate Ensembl Gene IDs to Entrez Gene IDs, or IPI IDs to HGNC gene symbols, and much more. Unlike some other tools for this purpose, The Synergizer is simple and easy to learn. The Synergizer works via a web interface (for users who are not programmers) or through a web service (for programmatic access).

Introduction

The Synergizer project aims to collect sets of "synonym relationships" between biological database identifiers that have been published by authoritative source (or "authorities" in our usage), and make them freely available to the research community through a range of interfaces, both interactive and programmatic.

The Synergizer project consists of two main components: 1) a backend database of "synonym relationships" gleaned from various authorities; and 2) a web service with a very simple API, which allows programmers to write client applications to access the data in the backend database. These applications can range from one-time programs written by researchers for their individual use to interactive applications for wide research community.

Any authoritative organization interested in having their synonym relationships served through the Synergizer system, please drop us a line.

Interfaces

The Synergizer service is currently available through two basic interfaces: an interactive web interface and a programmatic web service interface. All interfaces use the same database of synonym relationships as backend, and serve the same mappings using different formats.

Interactive Web Interface

The Synergizer's interactive web interface is the quickest way to become acquainted with the Synergizer service. It is available at

http://llama.mshri.on.ca/synergizer/translate

Input Basics

The interactive web interface for the Synergizer looks like this:

To see the tool in action, click on the button that says "Insert sample inputs" and then hit "Submit" near the bottom of the page. Try also changing the selection in the '"TO" namespace' field before hitting the "Submit" button.

Input Fields

Following is a brief description of the fields in this form. Note that the choices available for some fields depend on the choices made on fields above it. Also, in some rare cases, it may happen that only one choice is available, in which case it will be pre-selected for you.

  1. Authority. We use the term "authority" to refer to any authoritative sources of identifier-mapping information. Authorities may differ significantly in the species and the namespaces they support. (They even differ sometimes in the name they give to these namespaces.) Examples: ensembl, ncbi.
  2. Species. Note that the range of species supported depends on the choice of authority. Examples: Homo sapiens, Mus musculus.
  3. "FROM" namespace. This is the "namespace" (naming scheme) of the database identifiers the user wishes to translate. Hence it is also called the "domain namespace", in analogy with the common mathematical terminology for functions and relations. The choices available for this field depend on the choices made for authority and species. For some authorities, the pseudo-namespace "__ANY__" may be used when there's uncertainty about the correct namespace, but this carries some risks (as discussed further below). Next to each choice in this selection box there is a string in square brackets. This is a sample identifier in that namespace, to help users figure out which namespace is the one that contains their identifiers. Examples: embl, ipi.
  4. "TO" namespace. This is the "namespace" (naming scheme) to which the user wishes to translate the input identifiers. Hence it is also called the "range namespace", again, in analogy with the common mathematical terminology for functions and relations. The choices available for this field depend on the choices made for authority, species, and sometimes for the domain namespace as well. As in the previous field, the strings in square brackets are sample identifiers in the corresponding namespaces. Examples: ipi, embl.
  5. File containing IDs to translate. Please make sure that the file has the right permissions to be uploaded by your browser. The Synergizer service expects the identifiers in this file to be separated by whitespace (that is, spaces, tabs, or line separators), and it ignores their case. The Synergizer will preserve the original ordering of the identifiers, and will not remove duplicates. Examples of IDs: U18670, stat1, 6772.
  6. IDs to translate. This field is very similar to the previous one. The only difference is that here the identifiers are entered directly into the input form, rather than being in a file. The same considerations apply: whitespace to separate the identifiers, case-insensitivity, and preservation of the original order and of duplicates, if any.
  7. Spreadsheet output. Optionally users my request to receive the results in spreadsheet form. If so, the document sent by Synergizer will be a simple tab-delimited plain text file, although it tags this file with a MIME type that causes many browsers to open the file with a spreadsheet program, such as Microsoft Excel, Open Office, or Gnumeric.
Note that the first four fields (that is, all those that have drop-down menus) are mandatory, and that at least one of the remaining two fields ("File containing IDs to translate" and "IDs to translate") must be filled. (If both of these fields are filled, the two sets of identifiers will be concatenated, starting with those contained in the file.)

Output

Upon hitting the "Submit" button, the translation request is sent to the Synergizer server. After a few seconds (the actual time may vary based on server load and the number of IDs to translate) a two-column table showing the desired translations will appear to the right of the input fields, as shown on the illustration below. (If your window is too narrow, you may have to scroll to the right, or widen the window, to see this table.)

The results table itself is largely self-expanatory. But note the following.

Spreadsheet output

The spreadsheet version of the same result is illustrated in the accompanying figure. Each row of this output is a tab-separated, newline-terminated list. The first row consists of the names of the domain ("FROM") and range ("TO") namespaces. In all the subsequent rows, the first element is an identifier in the domain namespace, and the second one is a space-separated list of translations.

In contrast to the web browser output, the spreadsheet form does not include any warning messages. Furthermore, input identifiers that are not found in the domain namespace are flagged with the string "!?".

Web Service (JSON) Interface

This is the most convenient interface to use the Synergizer service programmatically. It uses the popular JSON data interchange standard for the communication between server and client, and adheres to the format specified by the proposed JSON-RPC standard. The service is hosted at

http://llama.mshri.on.ca/cgi/synergizer/serv

IMPORTANT: The Synergizer has been designed for bulk translation of identifiers. A typical use of this service should require only one call to the server per execution of a client-side program, and never more than one call per ordered pair of namespaces. If your program needs to make more than one call to the server, please follow the following guidelines.

Failure to follow these guidelines may result in having your host machine banned from using this service.

Examples

For those unfamiliar with JSON or JSON-RPC, the following short Perl script illustrates a simple JSON-encoded interaction with the Synergizer service. The script uses the core Perl module LWP::UserAgent to handle the HTTP communication with the Synergizer server.


 1  use constant SERVER_URL =>
 2    'http://llama.mshri.on.ca/cgi/synergizer/serv';

 4  use LWP::UserAgent;

 6  my $req_content = '{"method":"version","params":[],"id":0}';
 7  my $ua = LWP::UserAgent->new;
 8  my $res =
 9    $ua->post(
10               SERVER_URL,
11               'Content' => $req_content,
12               'Content-Type' => 'application/json',
13               'Content-Length' => length( $req_content )
14             );

16  my $res_content;

18  if ( $res->is_success ) {
19    $res_content = $res->content;
20  }

22  if ( defined( $res_content ) ) {
23    # next line should print something like
24    # {"error":null,"id":0,"result":"0.01"}
25    print $res_content . "\n";
26  }
27  else {
28    die 'server error';
29  }

The JSON-encoded method call is defined on line 06. It consists of a JSON object with various members. The member called "method" has as its value the string "version", which is the name of one of the methods in the Synergizer API (see below). The member called "params" should always take an array as argument, containing the arguments to be passed to remote method. The "version" method happens to take no arguments, so this array is empty in this case. The "id" member is currently not used by our server; it is included only to conform to the JSON-RPC specification, which specifies that its value must be an integer.

The content type of the request is set to application/json on line 12. The content of the response is accessed on line 19. It is a string representing a JSON object, having members "result", "error", and "id". As already mentioned, the "id" member is not currently used by our server. If the remote call was successful, the value of the "result" member will be the returned value (in this case, a string like "0.01"), and the value of the "error" member will be null. Otherwise, the value of the "result" member will be null, and the value of "error" member will be an Error object. The description of the Error class is given in the JSON-RPC spec proposal page.

For production work, it is best to use existing tools for encoding and decoding between JSON and the language of your choice. For Perl, for example, one can use the excellent JSON::XS module. For example, the following listing is another Perl script that uses a simple Synergizer client module, Synergizer::DemoClient (which uses JSON::XS) to perform a translation.


01  use Synergizer::DemoClient ':all';

03  my $args =
04    +{
05       authority => 'ensembl',
06       species   => 'Homo sapiens',
07       domain    => 'hgnc_symbol',
08       range     => 'entrezgene',
09       ids       => [ qw( c1ql4 scn5A IL1RL1 il1rl1
10                          ?test? pxn RORC MYC lnx1 ) ]
11     };

13  my $translated = translate( $args );
14  my @unrecognized = cull( $translated );

16  print fmt( $_ ) for @$translated;
17  print "\nunrecognized:\n";
18  print "$_\n" for @unrecognized;

As a further example, Dr. Murat Tasan has written a suite that implements a Synergizer client in Java.

API

Every request to this service must be a POST, must have application/json as the value of the Content-type header, and must have as payload a string representing a JSON object with three members: "method", "params", and "id". E.g.:

{"method":"my_method","params":["some","args"],"id":123}
The member called "method" has as its value a string representing the name of one of the methods in the Synergizer API (see below). The member called "params" should always take an array as argument, containing the arguments to be passed to remote method. The "id" member is currently not used by our server; it is included only to conform to the JSON-RPC specification, which specifies that its value must be an integer.

The output of every method of this service will be a JSON object with three members, "result", "error", and "id". (Again, this structure is in conformance with the current JSON-RPC proposed standard.) E.g.:

{"result":["a","b","c"],"error":null,"id":123}
When the remote call is successful, the value of the "result" member will be the value returned by the call, and the value of the "error" member will be null. Otherwise, the value of the "result" member will be null, and the value of "error" member will be an Error object. The description of the Error class is given in the JSON-RPC spec proposal page, but briefly, it has at least two members, "code" and "message", and optionally a third one, "data". Of these three members, our server currently uses only "message".

The main method in the service's API is translate.

translate( args )

The sole parameter args is a JSON object whose keys correspond to the Web interface's fields. Namely: authority, species, domain, range, and ids. With the exception of ids, which takes as value a JSON array of strings, the values for all keys in args are simple strings.

Here is an example of a JSON request payload, with extra whitespace added for clarity:

{
  "method": "translate",
  "params": [
              {
                "authority": "ensembl",
                "species": "Homo sapiens",
                "domain": "hgnc_symbol",
                "range": "entrezgene",
                "ids": [ "snph", "chac1", "actn3", "maybe_a_typo",
                         "pja1", "prkdc", "RAD21L1", "Rorc", "kcnk16" ]
              }
            ],
  "id":0
}
The actual payload may look something closer to this:
{"method":"translate","params":[{"authority":"ensembl","species":"Homo sapie
ns","domain":"hgnc_symbol","range":"entrezgene","ids":["snph","chac1","actn3
","maybe_a_typo","pja1","prkdc","RAD21L1","Rorc","kcnk16"]}],"id":0}

Valid values for the species, authority, domain, and range parameters are exactly as those displayed in the Synergizer's web interactive interface form. (More precisely, this page shows all the valid combinations of values for these parameters, as the choice of authority determines the possible choices for the species, the authority and the species determine the possible choices for domain namespace, and, the authority, the species, and the domain namespace determine the possible choices for range namespace.)

When the translate method executes successfully, its output is a JSON array of JSON arrays of strings. For each identifier in the input list (i.e. the ids parameter) there is one inner JSON array, whose first element is this identifier, and whose remaining elements, if any, are either its translations to the range namespace, or the single JSON value null to indicate that the identifier was not found in the domain namespace.

Here's the payload of the server's response for the translate request shown earlier:

{"result":[["snph","9751"],["chac1","79094"],["actn3"],["maybe_a_typo",null]
,["pja1","64219"],["prkdc","5591","731751"],["RAD21L1","9751"],["Rorc","6097
"],["kcnk16","83795"]],"error":null,"id":0}

In addition to translate, the API supports five additional "meta-methods".

version()

Returns current version of the Synergizer server as a string.

available_authorities()

Returns a JSON array of strings corresponding to the currently available authorities.

available_species( authority )

This method takes as parameter a single string, representing an authority, and returns a JSON array of strings corresponding to the currently available species for the chosen authority.

available_domains( authority, species )

Takes as parameters two strings, representing an authority and a species, and returns a JSON array of strings corresponding to the currently available domain namespaces for the chosen authority and species.

available_ranges( authority, species, domain )

Takes as parameters three strings, representing an authority, a species, and a domain namespace, and returns a JSON array of strings corresponding to the currently available range namespaces for the chosen authority, species, and domain namespace.

Synergizer design and methods

Namespaces

What is a namespace? By this term we refer to the collection of identifiers used as keys in some database. The set of all Entrez gene ids, for example, is a namespace. So is the set of all IPI protein IDs, or Affymetrix HG-U133A probe set IDs. For our purposes, an ideal namespace contains one, and only one, identifier for each member in a set of biological entities of interest. For example, if we consider the set of all human genes, we are interested in all collections of database identifiers that assign a single identifier to each one of these genes.

In practice, however, there is no such thing as an ideal namespace. The most important reason for this is simply that we cannot yet speak unequivocally of collections like "the set of all human genes," even though the ability to think about such concepts is arguably what characterizes the young field of genomics. Indeed, even though researchers in genomics routinely refer to such notional collections, there is always a tacit understanding among them that the exact characterization of such collections remains the subject of active research, and could still be far from being definitive. Since our understanding of such collections is still very much in flux, it is not surprising that our schemes for identifying these entities in our databases, i.e. our namespaces, are imperfect. In particular, we can expect that these namespaces will have gaps and redundancies.

In sum, the notion of a namespace, as we use it here, is ultimately somewhat fuzzy, and inherits many of the caveats that we must still attach to concepts such as "the set of all human genes."

But what makes namespaces truly problematic is not this conceptual fuzziness, but rather their great proliferation. The move from thinking about individual genes or proteins to entire genomes or proteomes has been accompanied by an ever-growing number of publicly accessible databases of biological information, and it is common for several of these databases to independently assign identifiers to the same underlying collection of biological entities. This gives rise to the problem of mapping identifiers across these various naming schemes. This is the problem that the Synergizer service aims to address.

Pegs

The Synergizer scheme for performing this mapping is based on the metaphor of a "peg" (a place to hang other objects). A peg is simply an internal Synergizer identifier that is assigned to all the external database identifiers that refer to the same biological entity. The process of mapping one identifier from namespace X to the corresponding identifier (or identifiers) in namespace Y amounts to first determining its "peg id", and then retrieving all the identifiers in namespace Y that have that same peg id.

Generally, for each combination of authority, species, and namespace, the backend database has a table that associates each identifier in the namespace with at least one peg id. The procedure for translation from namespace X to namespace Y consists of finding all the input identifiers in the table for namespace X, and then finding those identifiers in the table for namespace Y that have the same peg ids.

Methods

The data for the Synergizer database comes from established authoritative sources, or "authorities" for short, that publish mappings between namespaces. Currently these are Ensembl and NCBI, but the system has been designed to accommodate an unlimited number of such authorities.

In all cases we use a collection of scripts and custom Perl modules to download the data from the authority and load it onto the database. However, the specifics of these data loading pipelines vary radically from one authority to the other due to the large differences that exist in the ways these authorities organize and publish their information. (In fact, it could be argued that the most fundamental utility of the Synergizer service is to provide a uniform and simple interface to an otherwise exceedingly heterogeneous, unwieldy, and very large body of information.) In all cases, however, we keep our processing of the downloaded information to the minimum.

We obtain the data from Ensembl using its XML-based BioMart programmatic interface. We first download the table of all Ensembl gene ids, and assign a unique internal peg id to each of them. Then we download one two-column table per additional namespace of interest (X): the first column is always the Ensembl gene id, and the second column the corresponding identifier in namespace X. We use the mappings encoded in these downloaded tables and the prior assignment of peg ids to Ensembl gene ids, to assign peg ids to the identifiers on the second column of these tables.

The information from NCBI comes primarily from their file gene2accession, which they publish regularly via their FTP site (ftp://ftp.ncbi.nih.gov/gene/DATA). For most of the NCBI namespaces covered we use a procedure analogous to the one described for Ensembl, but in this case we use the Entrez gene id as the basis for assigning peg ids, and we derive the desired mappings directly from the records in gene2accession. (The mapping between versioned accession numbers and gi numbers is stored and served directly, without the use of a peg id.)

Translation errors

The most general solution to the translation problem would require, for each authority and each species, on the order of ~ N2 two-column tables, where N is the number of namespaces. That is, one table per each unordered pair of distinct identifiers.

In contrast, the peg-based organization requires only one table per namespace, so its space requirement grows linearly with the number of namespaces.

There is a price to pay for this efficiency, though, and that is the potential for a loss of resolution. For example, HGNC symbols SNPH and RAD21L1 map to HGNC IDs 15931 and 16271, respectively. However, The Synergizer gives both 15931 and 16271 as translations for SNPH, and also as translations for RAD21L1. This translation is correct at the gene level (both of these correspond to the same Ensembl Gene ID ENSG00000101298). However, the two IDs correspond to two distinct alternatively spliced transcripts of the same gene, and that resolution is lost in translation.

To see how this loss of resolution can occur, imagine that two identifiers in namespace X, say, x1 and x2, both map to the same peg id p, a situation we call a collision on p. And now imagine that also two identifiers in namespace Y, say, y1 and y2 also all map to peg id p. In this case, even if the correct mapping for x1 in Y were y1 the Synergizer would incorrectly say that x1 maps to both y1 and y2. Similarly for x2.

If we want to preserve the peg-based scheme, the only way to avoid the problem described above amounts basically to increasing the average number of peg ids per database identifier. This would come at a cost of increased space and time requirements, but would minimize the potential for loss of mapping resolution. We have selected the current number of pegs to provide the fastest service to the greatest number of users given the limited computational resources. We will consider increasing the resolution based on user input in the future releases of The Synergizer.