IRGSP Meeting, Clemson September 19-20, 2000
The following is a synthesis of the major events of the Clemson meeting a the subsequent decisions and actions that have been taken as a result of the meeting. A detailed report of the meeting, including progress reports by the participants, by Nancy A. Eckardt, appears in Plant Cell 11: 2011-2017 (November, 2000)
Major points:
1) First look and preliminary assessment of the Monsanto data. Tomoya Baba and his colleagues end sequenced the sequenced and fingerprinted the sequenced Monsanto BACs analyzed all of the Monsanto sequence information assigned to chromosome 1 at the time of the meeting. Their results give an idea of the level of tracking error, the quality of the chromosomal assignments at that time, and the degree to which the sequence will be useful to the RGP effort on chromosome 1. Monsanto has subsequently transferred more data to the RGP and has refined the chromosomal assignments. An important consequence of Baba's analysis is Monsanto's agreement to allow IRGSP members to search the sequence information directly rather than relying solely on chromosomal assignment.
2) Revised finishing standards. As a result of extensive discussions, the finishing standards published in the IRGSP guidelines have been revised. A major change derives from the RGP's demonstration that a base quality value equivalent to phrap 30 will still give an error of less than one base in 10,000. The finishing standards also permit rare, short gaps.
3) New sequence release guidelines. To bring the IRGSP policies more in line with the Bermuda Sequence Agreements, immediate submission of preliminary assemblies - at least the phase2 stage - to the HTG divisions of DDBJ, EMBL, or GenBank will be required as of February 8, 2001.
4) IRGSP access to Monsanto data. In addition to the ability to request BAC by BAC sequence information based on chromosomal assignment, IRGSP members now will have the ability to use a BLAST server at the IRGSP to search by sequence.
5) Revised agreement for public release of Monsanto sequence information. Monsanto expects that the sequence information contributed to the IRGSP will be released to the public when combined with publically generated sequence. Initially, it was stipulated that this would be at the stage of complete BACs. As a result of the Clemson meeting, and particularly the results presented by Robin Buell, Monsanto has agreed to public release at the HTGS phase2 stage.
Monsanto data:
Takuji Sasaki reported that the following materials had been received at the RGP from Monsanto:
3,416 BAC clones on 38 plates in 96-well format Sequence
.fasta and .qual files for 3,315 clones
Physical map data for 1,985 clones
125,619 STC sequences
Band size fingerprint data for 3,227 clones
Trace files have recently been transferred to the RGP, and Monsanto scientists at Monsanto are continuing to refine physical map information and communicate these updates to the RGP.
As the RGP receives notice of the signing of transfer agreements, the RGP will send the data pertaining to chromosomal regions claimed by IRGSP members. BACs will be shipped from a third party contracted by Monsanto, and can be requested directly from Gerard Barry.
The Monsanto physical map.
Brad Barbazuk briefly described the sequencing strategy. BACs identified through their STCs to specific chromosomal regions were used as "seeds" and further BACs were chosen by walking off of BAC ends. The starting material was two BAC libraries, a HindIII library of 76,229 clones with an average size of 122 kb, and an SphI library of 9984 clones. Most of the sequenced clones came from the HindIII library. An attempt was made to sequence both ends of each clone. A total of 3434 BACs were sequenced but only 3,391 were used because of either low quality or contamination. The assembled sequence covers 393 Mb and is comprised of 1106 contigs. Within the contigs there is a coverage of about 12 reads/kb.
A physical map scaffold was assembled by identifying a total of 1035 STS matches to the assembled sequence. The data was reanalyzed resulting in the association of 762 markers to 940 clones. As the anchored clones are frequently members of clusters, 1790 clones are anchored with respect to the RGP and Cornell maps. Subsequently, 2804 clones have been anchored to date.
The physical map was further refined by fingerprinting all of the BACs which were then assembled with fpc software. Brad further showed an integration with the CUGI physical map based on joint assembly through the CUGI STCs.
Preliminary analysis of Monsanto data:
The RGP received the first batch of Monsanto data on July 1 and Tomoya Baba and his colleagues immediately began an evaluation which included resequencing the BAC ends and refingerprinting the 3416 BACs received.
A number of problems were noted with the materials as received. All of these problems have been resolved. Because of duplications, the actual number of uniquely named clones delivered was 3,246. The relationship between BAC number and sequencing project was only transferred for the BACs that had been assigned chromosomal location. The physical map data file also contained apparent errors for 28 clones where either the clone and its associated sequence file was assigned to two chromosomal locations or the same clone was associated with two locations and two sequence files. A number of BAC contigs are anchored by multiple markers. When these were displayed graphically for chromosome 1, it was evident that there were a number of discontinuous contigs. Most of this problem had a trivial solution in that the Monsanto map order was inconsistent with the RGP genetic map. When the markers were placed in original order most of the contigs were contiguous with the genetic map. There remained only two inconsistencies for chromosome 1.
Independent BAC end sequencing by the RGP was done and as an additional check. For the 3,245 non-redundant clones, Monsanto reported 5,886 STCs and the RGP obtained 6,057. Both groups reported identical sequences for 2,676 clones or 82.5% at one or both ends. 260 clones (8%) were discrepant at both ends. The STCs also revealed that 16 contaminant clones with either E. coli or chloroplast sequences at both ends, as noted by Brad Barbazuk, had been included in the collection. The correspondence between the independently obtained STCs and the assembled sequence data assigned to the clone was analyzed for a subset of 59 clones in 13 contigs and singletons on chromosome 1. Twelve of the 59 had discrepancies between two or all three sets of data.
The corrected Monsanto physical map for chromosome 1, with 324 BACs in 87 contigs or singletons, was compared in detail with the RGP PAC/BAC physical map of which about 32% has been completely sequenced. Sixty three clones in 9 contigs have the potential to extend previous obtained PAC contigs. A further 13 clones in 5 contigs may also fill gaps in the RGP map, but their location is being confirmed. Sixteen BACs in 7 contigs actually can be assigned to other chromosomes. Of the remainder which overlap with the RGP contiguous map, the location of 78 clones in 16 contigs was confirmed and 154 clones in 50 contigs are currently being checked.
A very detailed look of the Monsanto sequence that covers 9.27 Mb of continuous complete sequence on 1S was obtained by screening the 1,672 BACs where there was agreement between the Monsanto and RGP STCs at both ends. 63 clones were identified by high stringency BLASTN search that covered 4.12 Mb of this sequence. Of these, 37 had been assigned to chromosome 1, 11 were previously assigned to other chromosomes, and 15 clones and been unassigned.
Resolution:
In the agreement with MAFF, Monsanto directed that the RGP would give members of the IRGSP data associated with their claimed chromosome or chromosome region based on the chromosomal assignment of the data. This will still be the case. However, from the preliminary analysis, it is apparent that the assignments are sometimes incomplete or inaccurate. Subsequent to the meeting, it was agreed that the RGP could provide an additional entry into the data. The RGP has set up a BLAST server for use by IRGSP members against both assembled and unassembled Monsanto BAC sequences to identify Monsanto clones and the associated sequence projects for themselves based on sequence identity.
Public release of Monsanto sequence data:
Robin Buell presented a simulation for three rice BACs for the quality of data that was available after 4X and 8X production sequencing coverage. The important point of this demonstration was that an average of 16.7% of the completed sequence was not included in assemblies of less than 2 kb in the 4X coverage whereas an average of 2.4% was not present in similar assemblies for 8X coverage. The US groups, and most groups following the Bermuda recommendations, practice posting of automated assemblies of greater than 2kb. The results from the simulation suggests that if publically generated 4X coverage can be combined with Monsanto coverage - generally 5X - then there would be 14.3% more sequence in Genbank.
At the meeting Dr. Buell estimated that it took her group about three weeks to reach phase2 coverage whereas final closure might take months. Phase2 release to the HTGS section of GenBank occurs when all of the contigs within a BAC are correctly ordered and oriented whether or not there are gaps. This makes for a convincing argument to release the combined public and Monsanto sequences prior to completion of the BACs as it will greatly shorten a lag in releasing reasonably high quality sequence to the public. Monsanto has accepted this argument and has agreed to these terms for public release.
Finishing standards:
We need to balance our stated goal of obtaining a high quality reference sequence for the cereal genomes with the necessity of timely and cost-effective completion of the rice sequence. To that end the finishing standards for completed sequences were discussed at length at the Clemson meeting and in subsequent e-mails.
Prior to the meeting, the Guidelines specified that a base quality value equivalent to phrap 40 was required to obtain less than one error in 10,000 bp. Takuji Sasaki and his colleagues at the RGP suspected that this quality value was unnecessarily high to obtain the stated error level. Kimiko Yamamoto analyzed results for 27 PACs covering 3.7 Mb. She identified all nucleotides with a quality score less than 40 after the first assembly, asked what their original quality score was, and then looked among these for those whose base call had changed after to quality score was raised to 40 or above in the subsequent finishing processes. A total of 229 nucleotides in the 27 PACs had their base call revised during finishing. The bulk of these changes occurred where the original score was 15 or less; the highest was 28. No nucleotide assignment where the original score was 30 or above after the first assembly changed. Therefore it doesn't appear to be useful to spend the extra effort to raise phrap scores above 30 during the finishing process.
Gaps are another problem and require considerable effort to close. In rice, the biggest problems appear to occur in GC-rich regions. The problem with ignoring these regions is that coding sequences in the grasses are also GC-rich and ignoring these regions might lead to gaps in gene sequences. The RGP also examined this problem for two problem PACs that were subsequently finished. In these two PACs, 67% and 88% of the GC-rich regions overlapped ORFs.
Nevertheless, Robin Buell and her colleagues at TIGR argued persuasively for agreement to leave rare gaps in otherwise completed sequences. It may take a long time for difficult gap regions to be filled at the requisite quality level. It was eventually agreed to permit no more than one small gap for each completed BAC or PAC and that the sequencing groups had to make a good faith effort to continue to close these gaps. It was also agreed that these finishing standards would be revisited in a year to see if further changes were required and to evaluate how they were being applied.
The Revised Finishing Standards, effectively immediately, read as follows:
Minimum Standards (exceptions are noted in annotation comments):
(i) A single contig is generated.
(ii) The bulk of the sequence should be derived from multiple subclones sequenced from both strands. Less than 3% of the sequence should be derived from multiple subclones sequenced from the same strand with the same chemistry. These regions must pass manual inspection by the finisher for any sequence problems, but do not need to be annotated unless the sequence quality falls below phred 30. Less than 1% of the sequence should be derived from a single subclone. In the case of a region covered by a single subclone, the clone must be sequenced either on both strands or with two different chemistries, and the region must be annotated.
(iii) More than 99% of the sequence has less than one error in 10,000 base pairs as reported by phrap or other sequence assembly consensus scores. The RGP has empirically determined that a phrap score of 30 or above exceeds the standard of less than one error in 10,000 bp. Exceptions must be be manually checked and have passed inspection for possible sequencing problems. These areas must be annotated.
(iv) The assembled sequence is confirmed by restriction enzyme digestion.
Exceptions, all of which require an annotation note:
(i) In instances where gap closure/finishing is difficult to complete, sequences should be submitted to DDBJ/EMBL/Genbank as complete under the following conditions:
a) the sequence within a single BAC or PAC clone contains at most one gap of less than 500 bp
b) the contigs on either side of the gap are oriented and ordered correctly
c) all currently available closure/finishing techniques have been attempted to close the gap
In addition, the sequencing group is strongly encouraged to continue making a good faith effort to close the gap as long as possible, and to revise their submission to DDBJ/EMBL/Genbank if and when they close it.
(ii) In the case of regions consisting only of PCR fragments (including PCR products from subclones), high fidelity polymerase should be used and if the PCR products are cloned before sequencing, at least two PCR clones are necessary.
(iii) In the case of simple repeat sequences, including single nucleotide repeats, where the number of repeats can not be determined, the length of the repeat region should be estimated by restriction enzyme digestion or PCR.
(iv) Every effort must be made to resolve large repeats, particularly if they contain unique sequence. Should problems persist, the size of the repeat region, confirmed by restriction enzyme digestion or PCR, the nature of the repeats, the size of repeats, and the finishing problem should be indicated.
(v) Sequences of bacterial transposons and other obvious contaminants are screened and deleted from the finished sequence; the size, sequence and position of the deleted region are indicated.
(vi) Where the confirmed sequences of overlapping regions between adjacent PACs or BACs differs, these differences should be indicated.
Working Group decisions:
The Working Group, composed of a single representative from each participating country met for a very productive meeting. The following is a summary of the decisions taken:
1) Avestha Gengraine Technologies located in Bangalore, India and headed by Dr. Villoo Morwalla Patel applied to be an independent member of the IRGSP and wishes to sequence a megabase on chromosome 8 while also doing comparative sequencing on Basmati (indica) rice.
The Working Group asked that Avestha Gengraine integrate its activities with the other Indian groups. The Working Group is reluctant to have independent members from the same country.
2) Canada, once funded, wants to switch its efforts to chromosome 9 from 2. They will coordinate their efforts with Thailand and the two will integrate their work rather than working on separate sections. The Working Group welcomes this level of cooperativity.
3) Dr. Antonio Oliveira from Brazil petitioned to join the IRGSP to sequence on chromosome 12. This move is welcomed by France. The Working Group welcomes Brazil to the IRGSP.
4)The RGP, in anticipation of completion of their current work on chromosomes 1 and 6, would like to claim chromosomes 7 and 8. This claim was approved.
5) The status of inactive members - those that have not obtained funding and those that are not contributing a megabase of complete sequence a year to public databases - was discussed. It is anticipated that the unfunded groups that do not receive funding shortly will be expected to withdraw from the IRGSP. Those groups that are not contributing a megabase a year in the near future will be asked to withdraw although a deadline was not established.
6) The nature of the data that Monsanto has so far sent to the IRGSP was summarized, and Dr. Sasaki described the distribution mechanism. We also talked about the additional data we expect to receive from Monsanto in the near future. Tomoya Baba and his colleagues have done an extraordinary amount of work over the last month analyzing the data. We discussed in detail all that was known about the data received by the RGP to date. Based on this information, the members talked about the best way that the Monsanto data can be used and how it might be distributed. All members agreed that the Monsanto contribution will be very beneficial to the progress of the IRGSP.
7) Rod Wing and his colleagues proposed that the release policy published in the IRGSP guidelines be changed. The revised release policy reads as follows, and this change will become effective February 8, 2001:
The Rice Genome Sequencing Project agrees to the immediate release of finished, but not necessarily annotated, sequence in units of intact BAC or PAC inserts. These finished sequences will conform the accuracy standards described above. Release means submission to a public database such as DDBJ, EMBL, or GenBank. Immediate submission of preliminary assemblies - at least the phase 2 stage - to the HTG divisions of DDBJ, EMBL, or GenBank is also required. Phase 2 sequences are unfinished BACs or PACs, in ordered, oriented contigs, with or without gaps. Furthermore, members are encouraged to follow the Bermuda guidelines (http://www.gene.ucl.ac.uk/hugo/bermuda.htm) for data release and intellectual property.
The next IRGSP meeting will be held on February 8, 2000 in Tsukuba, Japan.
Posted January 10, 2001 by B. Burr and T. Sasaki.
|
|
|