Résumé | With the advent of sequencing techniques, a deluge of plant genome projects have emerged, all prompting for accurate and high throughput comparative genomic approaches such as orthology prediction. The current incompleteness, polyploidy and low coverage of most plant genomes prompt for further improvements of orthology prediction using evolutionary-related information such as sequence variability and gene order. While a majority of orthology prediction approaches for large genome-scale datasets typically relies on reciprocal-best-BLAST-hits (RBBH), they suffer from insufficiencies related to incorrect prediction of paralogs as orthologs when incomplete genome sequences or gene loss are present. In addition, there is an increasing interest to identify orthologs most likely to have retained similar function. To address these issues, we have developed a high-throughput multi-threaded computational approach that predicts orthologs using DNA and protein sequences and identifies which orthologs have similar genomic context and are likely to have similar function. First, we predict putative orthologs using commonly predicted DNA and protein based RBBHs. This dual approach is used whenever possible to reduce the number of false positives. Second, genomic context conservation is used to provide further support for orthologs assignment and to help with the identification of missing orthologs. Orthologs are predicted to have a higher likelihood of being similar in function if their relative genomic context is conserved. Third, the list of putative orthologs for pairs of plant species (e.g. B. distachyon and S. bicolor) is used to explore pathway similarities for the same biological process and discover putative enzymes omitted in some plant species. |
---|