Annotation Standards:

The following standards are provisional and recognize the fact that current annotation tools are not perfected for rice. Nevertheless, it is the consensus of the annotation community that some form of annotation should be done on newly sequenced PACs or BACs. Each annotation group should post their standards.

1) Each gene will receive a unique identifying number. The identifying number consists of the clone number/name followed by a decimal and a gene number (Example: PAC#.X). The PAC and BAC names should be those assigned by the RGP and Clemson groups, respectively. The genes do not have to be numbered in order. This unique gene identifier should be used in submissions to databases and should not be changed throughout the course of the genome project.

For genes appearing in their entirety on overlapping BACs/PACs, or spanning multiple BACs/PACs, annotators should use "synonyms" to link the gene models in data submissions. The gene may have two identifiers as it is actually present on two separate physical clones, but these two identifiers should be linked in the databases or submissions. Every effort should be made to assure that the gene be assigned the same common name on each clone and that the gene structure (intron/exon coordinates) be identical on each clone.

2) All groups agree on a standard nomenclature for predicted proteins:

Sequences with 100% identity at the amino acid level to known proteins will receive the same, original gene name.

Sequences with less than 100% identity but with significant homology to known proteins will be called "putative" proteins of the same name. The name of the nearest hit will be included as a note. Sequences that are clearly related to a gene family can be called "XXX-like" or "similar to XXX" protein.

Protein matches with BLASTP bit scores of >100, e-values of < e-20, or equivalent criteria, will be regarded as significant homologies.

Sequences with homology to unknown ESTs will be called "unknown." The EST hit will be included in a note. The homology standard is at least 95% identity at the nucleic acid over ~90% of the length of the entire EST , and should cover two adjacent
exons.

Sequences predicted by multiple gene prediction programs with no homology to an EST will be called "hypothetical protein." The gene prediction programs will be included in a note.

Homology to proteins with higher, but still significant, e values or bit scores should be examined to estimate the function of as
many predicted genes as possible.

3) Coordinates of predicted proteins or other recognizable features such as repeated sequences including simple repeats, transposable elements, and markers will be deposited in the databases.

4) A repeated sequence database will be created at TIGR. All groups are urged to submit their discoveries.

5) An annotator e-mail group will be started. Annotators should submit e-mail addresses to Maria-Ines Benito at mbenito@tigr.org.
 


Posted March 11, 2000, Maria-Ines Benito, Takuji Sasaki, and Ben Burr
Ammended by the IRGSP February 5, 2002

RICE GENOME RESEARCH PROGRAM (RGP) HOME PAGE
webmaster@staff.or.jp
Copyright (C) The International Rice Genome Sequencing Project (IRGSP). 2005 All rights reserved.
RGP NIAS STAFF IRGSP