Nomenclature of Libraries

Home | Query | Browse | BLAST |Downloads | Help

Sequence Cleaning and clustering protocol

The EST sequences were obtained partly as sequence and qual files and partly as trace files.The trace files were extracted using the phred command line option –trim_alt and the cutoff value was set at 0.1.The resulting sequences were separated according to the library names.

  1. The sequences were quality trimmed using an in house algorithm. The quality values were averaged in a sliding window of 20 bases and the cutoff was set at 25. If the average value dipped below 25, the first base of the sequence removed, likewise the sliding window moves till it no longer finds average value less than 25 for 10 consecutive windows.
  2. The vector and adaptor removal was done in 2 different steps since there were some sequences which had a gap between the vector and adaptor sequences. For vector removal the options were set at –minmatch 10 and –minscore 20 and for adaptor the –minmatch and –minscore were both set to 8, since the adaptor sequence was shorter. The above procedure was repeated for the reverese complement of the vector sequences.
  3. PolyA and PolyT trimming was done separately. If the sequence had a poly A (More than 18As) anywhere within the 1/3 rd length of the sequence, the polyA was trimmed. In case of terminal polyA the length restriction was removed. The same procedure was followed for the polyT trimming.
  4. The chimeric sequences were removed initially on the basis of presence of an internal adaptor. However, this protocol does not ensure removal of all chimeric sequences. After first clustering there were some sequences found to be chimeric which caused a bridge between two different contigs. Those sequences were removed after manual inspection.
  5. The contaminated sequences were found in the datasets which had very high similarity to the cloning vector used or to some other vector.
  6. The resulting sequences were used to map the data to the quality scores. In house program was used to map the raw data to the clean data and trim the subsequent quality scores to obtain the good quality scores.
  7. Ribosomal sequences were removed from the dataset after blasting them against the ribosomal sequences from NCBI with atleast 95% identity over 500 bases and the bitscore higher than 500 and e value less than e-20.
  8. This data is used for final EST clustering. There were 2 clustering methods adopted i.e; d2 cluster and TIGR clustering method which is based on megablast. The final assembly was done using cap3. The tgicl clustering and assembly method was done at 2 levels a.stringent and b. non-stringent. However, there was not much difference observed between the 2 methods. The parameters for assembly was set as: minimum percent identity for overlaps : 94, minimum overlap length 30, maximum length of unmatched overhangs:30 and the maximum number of sequences in a clustering slice was set at 1000.

 

Soybean Sequence separation protocol

The soybean sequences were removed from the EST sequences by a combination of approaches.

1. The soybean EST sequences were downloaded from soybean EST consortium at http://www.genome.wustl.edu/est/index.php?soybean=1 .

2. The soybean EST libraries were cleaned of those libraries made from the Phytophthora contamination. The Phytophthora EST sequences were compared to soybean ESTs. If there was >95% identity over 100 bases non-gapped HSP, then those sequences were separated as soybean sequences.

3. To the remaining ones, they were compared with EST libraries from sMY,sMA,sMC,sML,sZO,sZS and sZG at the same level of stringency. If the ESTs match to these libraries, then they were categorized as P.sojae ESTs.

4. To the remaining ESTs, the G+C/A+T contents and TA/CG dinucleotide ratios were calculated. If the G+C/A+T ratio is < 1 and TA/CG ratio is > 1 then the sequence was grouped as Soybean.

5. If the G+C/A+T ratio is > 1 and TA/CG is < 1 then the sequence is labelled as Phytophthora sojae

6. The sequences that did not fall into the above two categories were grouped as "probable" categories, that were sorted manually.

 


Phytophthora Soybean EST Database Version 1.0 For questions and comments email: sutripa@vbi.vt.edu