AGRP

AGRP platform

I. Introduction

What is AGRP platform Database?

Angiosperms, as a class rich in diverse economic uses and ecological functions, are widely distributed across the globe. Encompassing numerous important agricultural crops and medicinal plants, their diversity and adaptability bring significant value to ecosystem stability and human livelihoods. We have built 57 tools for analyzing polyploidy and covariance pipelines, which we named IPAP (Integrated Polyploidy Analysis Pipeline). In addition, in the AGRP are stored 2,028 homologous structure dotplots, 27,996 and 28,164 rows in the hierarchical list of homologous gene with P. vulgaris and V. vinifera as reference, 70,673 event-related genes and genomic hybrid gene pairs, 912,552 gene Angiosperm and 106,111 structural domain information, 38,690 Orthogroups, 8,470,419 gene function annotation analysis results, 912,552 pathway map annotation analysis information. In addition, we collected 135 genomic information resources from 58 Angiosperm and 286,467 Angiosperm related literature. All the above information can be easily displayed, searched and downloaded from the platform. Finally, we have built a Angiosperm communication sharing module to facilitate learning and communication among researchers. Finally, we plan to continuously improve and update the database with newly assembled genomes and comparative genomic studies. We hope that the AGRP will be a great resource for studying the genome and breeding of Angiosperm.

II. Datasets and Workflow

Data sources

AGRP platform Database contains CDS, PEP, GFF3 for 3 plant species. Genome-wide literature and gene annotations are available for download at Ascensialy of NCBI (https://www.ncbi.nlm.nih.gov/assembly/) and/or Phytozome V12.1 (https://phytozome.jgi.doe.gov/pz/portal.html).

Data analysis pipelines

Identification of polyploidy events. To identify polyploid events, we first performed genome-wide BALSTP (E-value <1e-5, score >100) within and between the studied genomes using the software BALST (Altschul et al., 1990). Then, using CollinearScan software (Wang et al., 2006), the best 10 BLASTP matches were selected for inferring gene splicing regions (blocks) within or between genomes. Where the maximum gap was set to 50 spacer genes and large gene families with more than 50 members were removed from the blocks. The median value of synonymous nucleotide substitutions (Ks) for collocated genes was further used to determine the degree of divergence of the identified blocks. We calculated the Ks values between tandem gene pairs using the Bioperl statistical module and the Nei-Gojobori method (Nei & Gojobori, 1986). We further plotted adjacent gene pairs as dot plots based on genomic location and used different colored dots to distinguish whether the anchor gene pair was the best BLAST hit within/between genomes. We then identified the immediate and paralogous genomic regions within and between genomes based on the generated homology dot plots. Between genomes, a region was identified as an orthologous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with species differentiation; within genomes, a region was identified as a paralogous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with a particular polyploidization event. Finally, we can infer the history of WGD by investigating the ratio of syntenic depths within and between genomes.

Identification of event-related genes and dating of key evolutionary events. Plant genomes evolve at different rates (Cui et al., 2006; Wang et al., 2011), making it difficult to determine the timing of key events in their evolutionary history. Here, we constructed a correction algorithm for redetermining key evolutionary events in monocotyledons. First, based on orthologous and paralogous regions identified within and between genomes, we isolated sets of orthologous and paralogous lineages resulting from species divergence and polyploidy events. Second, we determined the evolutionary rates of key evolutionary events in monocotyledons by performing nuclear function analysis of Ks between these orthologous and paralogous relatives. Finally, we performed several rounds of Ks correction for the evolutionary rates of these events according to different correction bases. The first round of correction was based on the Ks distribution peaks of the differentiation events in monocotyledons and grapes to have the same values. After the first round of correction, there was still a large divergence between the τ and σ events produced by homologous plants. Therefore, similar to the first round of correction, we performed several more rounds of Ks correction based on τ and σ events. Details of the correction process can also be found in our previous articles (Wang et al., 2017; Wang, J et al., 2018; Wang, J et al., 2019b; Wang et al., 2022), and the computational script of the correction algorithm has been stored in Github (https://github.com/wangjiaqi206/corrected-evolutionary-dating).

Comparison of genome fractionation. By comparing the rates of gene retention and loss, we can characterize the degree of divergence between subgenomes produced by different polyploidization events. In which, the gene deletion rate was calculated by dividing the number of collinear gene deletions in the study species by the total number of genes per chromosome in reference genome. The genome retention rate was calculated by dividing the number of the most conserved collinear genes (orthologs retained in both reference genomes) in the study species by the number of relatively conserved tandem genes (orthologs only retained in the main reference genome). In addition, the degree of divergence between event-produced subgenomes can also be inferred by a statistical method we previously developed, the polyploidy index (P-index) (Wang, J et al., 2019a). In this study, the P-index among the subgenomes of the Acorus tatarinowii, Vanilla planifolia, Asparagus officinalis, A. setaceus and Zingiber officinale genomes was calculated, using V. vinifera, E. guineensis and M. acuminata as the references, where the sliding window was set to 95, disregarding the degree of divergence of subgenomes that are too similar or too different (parameter < 0.08 and > 0.8). In addition, previously studies have demonstrated that the P-index ~ 0.3 could be used as a threshold to classify auto- and allopolyploidies (Wang, J et al., 2019a). The reason is that the known and previously inferred allopolyploidies always have larger P-index > 0.3, including that the Brassica napus, Zea mays, Gossypium hirsutum, and Brassica oleracea (Schnable et al., 2011; Chalhoub et al., 2014; Li et al., 2014; Wang, M et al., 2015; Renny-Byfield et al., 2017). While the inferred autopolyploidies of Glycine Max, Populus trichocarpa, and Actinidia chinensis (Murat et al., 2017; Wang et al., 2017; Wang, JP et al., 2018) often have P-index < 0.3.

The pipeline for inferring ancestral karyotypes and evolution. The inference of ancestral genome structure and paleogenome remodelling trajectories is divided into 7 main steps. 1) Genome-wide comparison of the species involved, based on BLAST (Altschul et al., 1990) software, to confirm conserved homologous genes between and within genomes. 2) The homology information obtained from BLASTP was entered into CollinearScan (Wang et al., 2006) or MCScanX (Wang et al., 2012) for collinearity analysis to identify the synteny blocks. 3) Identification of orthologs and paralogs associated with speciation and polyploidy by inter- and intra-genomic comparisons. 4) Identification of conserved ancestral regions (CARs) by the combination of dotplots and gene collinearity between genomes. 5) Identification of ancient chromosomal rearrangements in conjunction with species trees. For example, if the conserved chromosomal regions CARs 1 and 2 are adjacent in the study species A, B, then it is reasonable to assume that CARs 1 and 2 are fused in the ancestor of A and B. If CARs 1 and 2 are not adjacent in study species B, it is difficult to determine the ancestral structure of species A and B. A reference species would then need to be introduced, and if CARs 1 and 2 also adjacent in the reference species R, then the ancestral structure of A and B would still be CAR1-CAR2. In addition, the inference of ancestral chromosomes rearrangements also needs to consider the effects of duplication, and we have modelled the possible scenarios in Then, by identifying and collating all the CAR rearrangements, we can bottom-up infer the ancestral karyotype and its composition of the study species. 7) After determining the ancestral genome, we can identify the fusion patterns and rearrangement trajectories of paleochromosome by comparing the CRAs in the dotplot between the modern and ancestral genome. For example, if the two chromosomes corresponding to the same ancestral chromosome in the study species are structurally different, such as the translocation, then this change should occur after the WGD; and conversely, before the WGD, such as the end-to-end joining fusion (EEJ) and nested chromosome fusion (NCF). The actual process of inferring ancestral genome and paleochromosome remodelling trajectories can be more complex, and requires careful and lengthy verification and validation.

Gene Family Analysis Pipeline. Gene families can be easily identified in IPAP, which has three sequence matching modes, such as Blast, Diamond, and Blast match. these three functions can be used to match target sequences against known protein sequences and thus filter the desired gene families. In addition, there is also a structural domain identification function, which allows easy structural domain prediction of target sequences through the Pfam database. After the gene family sequences are identified, researchers can perform multiple sequence comparisons and then construct phylogenetic trees. Meanwhile, codon and CPG island prediction can be performed in IPAP, and non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) can also be calculated. In addition, researchers can predict and map motifs and gene structures. This greatly facilitates the needs of researchers for gene family analysis.

III. Browse

Community and collection of resources

Items	Brief Introduction	Records
Pair-wise dotplots	Homologous structure dotplot related to Angiosperm	2,028
P. vulgaris Hierarchical alignments	Pvu Hierarchical alignments gene pairs	27,996
V. vinifera Hierarchical alignments	Pvu Hierarchical alignments gene pairs	28,164
Event-related genes	Information on gene pairs associated with Event-related	70,673
Functional genes	Function-related gene family information	50,178
Transcription factor	Transcription factor gene family information	50,056
ncRNA	These include rRNA, tRNA, snRNA, snoRNA and microRNA	*
Transposable elements	Details of Transposable elements	*
Orthologous Gene	Gene family information for Orthologous Gene	38,690
Gene Function	Information on functionally annotated genes	8,470,419
Pathway	Detailed information on Angiosperm-related Pathway	912,552
Domain	Information on domin identified using pfam libraries	106,111
Angiosperm Community page	An online community for plant Karyotype research community	-

IV. FAQ

A. What information does AGRP platform Database provide for plant Karyotype evolution?

We built a user-friendly, web-based comparative and functional genomics platform, an integrated platform for polyploid and paleo-genomic evolutionary analysis in the Angiosperm (AGRP, http://www.angiosperm.cgrpoee.top/). We established 45 tools for analyzing polyploids and covariance pipelines, which we named IPAP (Integrated Polyploidy Analysis Pipeline). Then, we selected 25 representative collections of Angiosperm to the chromosome level for systematic bioinformatics analysis. The analysis results are also stored in AGRP, a platform used to help researchers easily query, compare and download the results of these genomic resources and bioinformatics analyses. For example, the platform stores 2,028 = (25 + 1) * (25 + 1) * 3 homology structure dot plots. The homologous gene hierarchy table with P. vulgaris and V. vinifera as reference has 27,996 and 28,164 rows of homologous gene pairs, respectively. Based on the homologous gene hierarchy table, 70673 event-related genes and genomic hybrid gene pairs were obtained. Paleo-genomic karyotypes of Angiosperm were inferred and evolutionary trajectories were animated and displayed.MCScanX was used for gene identity analysis. The results of 676 (26*26) MCScanX were explored using SynVisio's interactive tool for multi-scale genome visualization 912,552 genes Angiosperm and 106,111 structural domain information, 38,690 orthogroups, 8,470, 419 gene function annotation analysis results, 912,552 pathway map annotation analysis information. Using the PfamScan software, and the HMM model downloaded from the Pfam database to identify functionally important genes, we identified, for example, growth hormone genes, anthocyanin genes, flowering-related genes, resistance genes, nitrogen fixation-related genes, oil synthesis genes, and m6A genes. cmscan software was used to identify ncRNAs. repeatModeler-2.0.3, RepeatMasker, and DeepTE were used for TE transposon prediction. orthofinder (v2.0.9) was used to identify 25 genome-directed genes. Functional and pathway annotations of genes were analyzed with InterProScan (v5.51) and GhostKoala, a tool provided by the Kyoto 64 Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/), respectively. And, we collected 135 genomic information resources from 58 Angiosperm species (Table S1). We then selected a representative collection of 25 Angiosperm to the chromosome level for systematic bioinformatic analysis. The analysis results are also stored in AGRP, a platform used to help researchers easily query, compare and download the results of these genomic resources and bioinformatics analyses

B. How to download the data in AGRP platform Database?

All data in the Angiosperm Platform database can be downloaded from the appropriate resource page. Such as genome data, pangenomic data, transcriptome data, Angiosperm pathways, Jbrowseetc..

C. How to contact us?

If you meet any troubles or find any bugs when you visit AGRP platform Database, please email to wangjinpeng@ibcas.ac.cn or yuzijian1010@163.com, pull requests in Angiosperm Community or you can contact us by:

Address info 21 Bohai Road,Caofeidian, Tangshan 063210, Hebei, China

Tel: +86-0315-8805607

V. Citation

Data files contained in the AGRP platform Database are free of all copyright restrictions and made fully and freely available for non-commercial use. Users of the data should cite the following articles:

・Angiosperm genomes bioinformatics platform: A comprehensive database of Angiosperm genomes

・Multi-dimensional reshuffling of ancestral genome during post-polyploid diploidization shaped family Angiosperm