J Plant Biotechnol 2016; 43(3): 293-301
Published online September 30, 2016
https://doi.org/10.5010/JPB.2016.43.3.293
© The Korean Society of Plant Biotechnology
Correspondence to : e-mail: shjo@seeders.co.kr, jhlee@seeders.co.kr
This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
High-quality gene sets are necessary for functional research of genes. Although Perilla is a commonly cultivated oil crop and vegetable crop in Southeast Asia, the quality of its available gene set is insufficient. To construct a high-quality Perilla gene set, we sequenced mRNAs extracted from different tissues of
Keywords
The identification of an organism’s full gene set plays an important role as the foundation for comprehensive genomic and transcriptome studies of the organism (Martin and Wang. 2011). Only a few years ago, our understanding of the transcriptome was dependent on traditional methods such as low-throughput expressed sequence tag (EST)-based or chip-based methods (e.g., DNA microarray), which have several limitations (Wang et al. 2009). Recently, whole- transcriptome profiling based on next-generation sequencing (NGS) has started to provide insights into the large-scale landscape and dynamics of the complex world of the transcriptome (Chen et al. 2011a).
The RNA-sequencing (RNA-seq) approach and powerful bioinformatic tools provide more quantitative and precise measurements of gene expression levels than earlier methods (Marguerat and Bahler et al. 2010), with highly reproducible results and few systematic differences among technical replicates (Zhang et al. 2012). Moreover, many of the latest studies have applied RNA-seq to various biological purposes including the identification of all expressed transcripts (Li et al. 2014), the improvement of genome assembly to find missing information and better understand the structure of the reference genome (Chen et al. 2011b), the detection of novel transcribed regions and alternatively spliced forms (Garber et al. 2011), SNP marker discovery (Iorizzo et al. 2011), and others, both with and without a reference genome (Gongora-Castillo and Buell. 2013). Therefore, it is urgent to gain
Perilla, an oilseed crop belonging to the Lamiaceae family, is commonly cultivated in Asian countries such as Korea, Japan, China, and Nepal. There are commonly considered to be four Perilla species and one variety (Jung et al. 2005).
Perilla seeds contain large amounts of unsaturated fatty acids, conferring various benefits to human health (Bumblauskiene et al. 2009). Hence, most Perilla studies to date have focused on the characterization and biological activities of the Perilla metabolites. The genes involved in the biosynthesis of anthocyanins, flavones, and monoterpenoids have been described (Lee et al. 2014). Various genomic, transcriptomic, and molecular analyses of the cultivated species
In this study, we report the production of a high-quality gene set of
Starting with the total RNA from each of the five samples, mRNA was purified using poly(A) selection or rRNA depletion and then chemically fragmented and converted into single- stranded cDNA using random hexamer priming. Next, the second strand was generated to create double-stranded cDNA. The library construction began with the generation of blunt-end cDNA fragments from the double-stranded cDNA. Then, an ‘A’ base was added to the blunt end in order to make the fragments ready for ligation to sequencing adapters. After the size selection of ligates, the ligated cDNA fragments that contained adapter sequences were enhanced via PCR using adapter-specific primers. Paired-end libraries were prepared using the Illumina TruSeq RNA Sample Preparation Kit v2 (catalog #RS-122-2001, Illumina, San Diego, CA). Then, the libraries were quantified using the KAPA library quantification kit (Kapa Biosystems KK4854) following the manufacturer’s instructions. Each library was then loaded onto the Illumina Hiseq2000 platform (100-bp paired-end reads). High-throughput sequencing was performed to ensure that each sample met the desired average sequencing depth. The sequence data from the leaf tissue has been deposited to the National Agricultural Biotechnology Information Center (NABIC) Sequence Read Archive (SRA) with the accession number NN-1473-000001.
Before assembly, the raw sequence reads with sufficient quality [Phred score (Q) ≥ 20] were selected. Reads < 25 bp in length were then removed using the SolexaQA package (v1.13) (Cox et al. 2010). The remaining high-quality reads were then used for
The
The assembled transcripts were designated as ‘P.citriodora XSLXXXXXXtXXX’. ‘P.citriodora’ is an abbreviation of the species name. The first digit ‘X’ following ‘P.citriodora’ denotes the assembly version, and ‘S’ means a reference from SEEDERS. The ‘L’ is an abbreviation for “locus,” and the following six digits signify the representative transcript (locus) number. The ‘t’ is an abbreviation for “transcript,” and the last three digits signify the transcript number for the locus.
The assembled
To functionally annotate the assembled transcripts, we assessed their sequence similarity with the
We obtained a total of 193,666,492 paired-end reads (19,366,649,200 bp) from the five RNA-seq experiments (Table 1). After strictly filtering by Phred quality score and read length, we obtained 173,911,506 (89.61%) cleaned reads with a total length of 15,584,692,752 bp (80.44%) (Table 1).
Table 1 . Statistics of the short-read sequencing of Perilla
Sample description | Raw reads | Cleaned reads | ||||
---|---|---|---|---|---|---|
Num. of reads | Total length (bp) | Num. of reads | Avg. length | Total length (bp) | %a | |
Leaves | 21,255,739 | 2,125,573,900 | 19,314,024 | 90.28 | 1,743,658,415 | 82.03% |
21,255,739 | 2,125,573,900 | 19,314,024 | 88.62 | 1,711,683,894 | 80.53% | |
Buds | 20,120,794 | 2,012,079,400 | 18,064,027 | 91.21 | 1,647,650,051 | 81.89% |
20,120,794 | 2,012,079,400 | 18,064,027 | 87.71 | 1,584,428,254 | 78.75% | |
Inflorescence (before fertilization) | 19,305,406 | 1,930,540,600 | 16,972,378 | 90.48 | 1,535,591,827 | 79.54% |
19,305,406 | 1,930,540,600 | 16,972,378 | 87.26 | 1,480,965,843 | 76.71% | |
Inflorescence (after fertilization) | 17,354,075 | 1,735,407,500 | 15,393,832 | 91.11 | 1,402,454,377 | 80.81% |
17,354,075 | 1,735,407,500 | 15,393,832 | 87.07 | 1,340,362,115 | 77.24% | |
Seeds | 18,797,232 | 1,879,723,200 | 17,211,492 | 93.11 | 1,602,544,844 | 85.25% |
18,797,232 | 1,879,723,200 | 17,211,492 | 89.21 | 1,535,353,132 | 81.68% | |
Total | 193,666,492 | 19,366,649,200 | 173,911,506 | 89.61 | 15,584,692,752 | 80.44% |
a%: Percentage of the total length of raw reads represented by the total length of the cleaned reads
To obtain a high-quality gene set for
Table 2 . Summary of
Assembler | Transcript type | Total count | Total length (bp) | N50a | AVGb |
---|---|---|---|---|---|
Improved assembly | Total transcripts | 86,396 | 155,964,376 | 2,675 | 1,805 |
Representative transcripts | 38,413 | 49,116,021 | 2,233 | 1,278 | |
Velvet assembly | Total transcripts | 101,855 | 155,276,941 | 2,167 | 1,524 |
Loci of transcripts | 40,455 | 40,783,832 | 1,651 | 1,008 | |
Trinity assembly | Total transcripts | 143,535 | 219,451,507 | 2,437 | 1,528 |
Loci of transcripts | 46,615 | 39,245,806 | 1,564 | 841 |
aN50: a weighted median statistic length such that 50% of the assembled transcripts are longer than the N50 length
bAVG: average length of all assembled transcripts
Length distribution of the assembled
Among the hash lengths used (
The assembled transcripts could be clustered to loci, with an average of 2.2 transcripts per locus (Table 2). An optimized representative transcript for each locus should be recommended for downstream analyses such as gene prediction and gene expression analysis. Velvet produced 40,455 clusters, and Trinity produced 46,615 clusters, but neither assembly program selects representative transcripts for loci. Therefore, we selected representative transcripts via the following steps. We mapped the short reads used in the final assembly back to their respective transcripts using the Bowtie software (Langmead B et al. 2009). We gave the transcripts with the most mapped reads (read depth) at each locus the highest priority for consideration as representative transcripts. Then, we gave the longest transcripts among the translated amino acid sequences containing open reading frames (ORFs) the next priority. If we were unable to select a representative transcript for a given locus based on the first two steps, we selected the candidate transcript with the longest nucleotide sequence as the representative transcript. Thus, we identified 38,413 non-redundant representative transcripts with a mean length of 1,278 bp, an N50 length of 2,233 bp, and a total length of 49,116,021 bp (Table 2).
To closely look at the quality of the assembled transcripts, we identified the assembled transcripts with matches among 638 Arabidopsis genes involved in fatty acid and TAG biosynthesis pathways in the acyl-lipid metabolism database for Arabidopsis (Bates et al. 2014). First, we checked the quality of the assembled transcripts by looking into the matched coverage of the query genes. Among the three assembly outputs, the improved assembly not only had the highest percentage of
Table 3 . Statistics of sequence homology between
Improved assembly | Velvet assembly | Trinity assembly | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Perilla | Arabidopsis | Perilla | Arabidopsis | Perilla | Arabidopsis | |||||||
n | % | n | % | n | % | n | % | n | % | n | % | |
BLAST | 729 | - | 546 | - | 750 | - | 531 | - | 485 | - | 508 | - |
≥C 70% | 416 | 57.06 | 452 | 82.78 | 293 | 39.06 | 365 | 68.74 | 214 | 44.12 | 328 | 64.57 |
≥C 80% | 381 | 52.26 | 417 | 76.37 | 247 | 32.93 | 318 | 59.89 | 171 | 35.26 | 271 | 53.35 |
≥C 90% | 297 | 40.74 | 342 | 62.64 | 188 | 25.06 | 248 | 46.70 | 129 | 26.60 | 214 | 42.13 |
*BLAST: number of transcripts with e-value ≤ 1e-10 and identity ≥ 50% in sequence homology search by BLAST; C: Arabidopsis gene coverage
Table 4 . Details of the assembled
Gene Symbol | AT gene ID | Lengtha (aa) | Perilla transcript ID | Lengthb (aa) | e-valuec | F or Pd | transcripts ne |
---|---|---|---|---|---|---|---|
AT1G01090 | 429 | P.citriodora1SL011651t001 | 438 | 0 | F | 1 | |
AT2G34590 | 407 | P.citriodora1SL012138t001 | 308 | 0 | F | 4 | |
AT1G34430 | 466 | P.citriodora1SL006498t001 | 471 | 1e-156 | F | 6 | |
AT3G25860 | 481 | P.citriodora1SL003106t001 | 463 | 6e-148 | F | 6 | |
AT3G16950 | 624 | P.citriodora1SL006166t006 | 611 | 0 | F | 2 | |
AT2G38040 | 770 | P.citriodora1SL009096t001 | 754 | 0 | F | 4 | |
ATCG00500 | 489 | P.citriodora1SL024624t001 | 511 | 0 | F | 1 | |
AT5G35360 | 556 | P.citriodora1SL003110t001 | 538 | 0 | F | 1 | |
AT5G16390 | 281 | P.citriodora1SL014888t001 | 285 | 1e-54 | F | 3 | |
AT5G15530 | 256 | P.citriodora1SL034420t001 | 279 | 5e-48 | F | 2 | |
AT2G30200 | 394 | P.citriodora1SL010628t002 | 408 | 0 | F | 1 | |
AT1G62640 | 405 | P.citriodora1SL009277t001 | 402 | 0 | F | 2 | |
AT1G24360 | 320 | P.citriodora1SL004327t009 | 320 | 2e-126 | F | 6 | |
AT5G10160 | 220 | P.citriodora1SL017007t002 | 259 | 3e-102 | F | 4 | |
AT2G05990 | 391 | P.citriodora1SL015253t001 | 394 | 0 | F | 4 | |
AT3G25110 | 363 | P.citriodora1SL015493t001 | 238 | 1e-176 | P | 2 | |
AT1G08510 | 413 | P.citriodora1SL000470t003 | 422 | 0 | F | 2 | |
AT2G43710 | 402 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT1G43800 | 392 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT5G46290 | 490 | P.citriodora1SL026266t001 | 475 | 0 | F | 10 | |
AT1G74960 | 542 | P.citriodora1SL001363t001 | 542 | 0 | F | 10 | |
LACS8 | AT2G04350 | 721 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
LACS9 | AT1G77590 | 692 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
Endoplasmic reticulum-desaturase | |||||||
FAD2 | AT3G12120 | 384 | P.citriodora1SL000613t001 | 383 | 0 | F | 2 |
FAD3 | AT2G29980 | 387 | P.citriodora1SL002476t004 | 439 | 2e-174 | F | 8 |
FAD8 | AT5G05580 | 436 | P.citriodora1SL002476t004 | 439 | 0 | F | 16 |
Acyl-CoA- dependent TAG synthesis in Kennedy pathway | |||||||
GPAT9 | AT5G60620 | 377 | P.citriodora1SL004403t009 | 372 | 0 | F | 6 |
LPAT2 | AT3G57650 | 390 | P.citriodora1SL004485t001 | 383 | 0 | F | 6 |
DGAT1 | AT2G19450 | 521 | P.citriodora1SL006978t006 | 450 | 2e-176 | P | 2 |
DGAT2 | AT3G51520 | 315 | P.citriodora1SL006419t008 | 511 | 7e-74 | P | 6 |
DGAT3 | AT1G48300 | 285 | P.citriodora1SL003849t001 | 407 | 3e-24 | P | 1 |
PC-mediated TAG synthesis | |||||||
LPCAT | AT1G12640 | 463 | P.citriodora1SL029356t001 | 466 | 0 | F | 6 |
PDAT1 | AT5G13640 | 672 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
PDAT2 | AT3G44830 | 666 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
DAG-CPT1 | AT1G13560 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
DAG-CPT2 | AT3G25585 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
PDCT | AT3G15820 | 302 | P.citriodora1SL012582t001 | 285 | 2e-109 | P | 1 |
Oil-body protein | |||||||
AT3G01570 | 184 | P.citriodora1SL023891t001 | 183 | 2e-32 | F | 4 | |
AT5G40420 | 200 | P.citriodora1SL023844t001 | 176 | 3e-20 | P | 4 | |
AT3G18570 | 167 | P.citriodora1SL023918t001 | 157 | 1e-32 | F | 1 | |
AT4G25140 | 174 | P.citriodora1SL023850t001 | 143 | 1e-35 | P | 3 | |
Transcription factor | |||||||
WRI1 | AT3G54320 | 439 | P.citriodora1SL001076t003 | 611 | 1e-87 | P | 32 |
athe length of the Arabidopsis gene
bthe length of the
cthe e-value matched between the two sequences (Arabidopsis and
dthe full-length or partial-length transcripts in
ethe number of
We directly compared the assembled transcripts to known plant protein sequences using BLASTX (e-value ≤ 1e-10). Of the 38,413 representative transcripts, 23,667 transcripts (61.61%) matched with 16,143 sesame genes, 24,030 transcripts (62.56%) matched with 19,283 proteins in Phytozome, and 17,752 transcripts (73.87% of 24,030) had highest sequence similarity with
To identify the putative functions of the
GO functional classification of the assembled
KEGG classification of the assembled
This study highlights the utility of next-generation sequencing (RNA-seq) as a basis for gene set assembly and functional annotation in non-model species. By comparing assembly programs and modifying the assembly pipeline, we assembled 86,396 total transcripts and 38,413 representative transcripts and evaluated the quality of the assembled transcripts based on the Arabidopsis gene coverage and the proportion of full-length genes among the assembled transcripts. Our transcriptome analysis of
This work was carried out with the support of the National Agricultural Genome Program (PJ010408) of the RDA (Rural Development Administration), Republic of Korea.
J Plant Biotechnol 2016; 43(3): 293-301
Published online September 30, 2016 https://doi.org/10.5010/JPB.2016.43.3.293
Copyright © The Korean Society of Plant Biotechnology.
Ji-Eun Kim, Junkyoung Choe, Woo Kyung Lee, Sangmi Kim, Myoung Hee Lee, Tae-Ho Kim, Sung-Hwan Jo, and Jeong Hee Lee
SEEDERS Inc., Daejeon, 34015, Republic of Korea,
National Institute of Crop Science, RDA, Miryang 50424, Republic of Korea,
National Academy of Agricultural Science, RDA, Wanju 55365, Republic of Korea,
SEEDERS Inc., Daejeon, 34015, Republic of Korea
Correspondence to: e-mail: shjo@seeders.co.kr, jhlee@seeders.co.kr
This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
High-quality gene sets are necessary for functional research of genes. Although Perilla is a commonly cultivated oil crop and vegetable crop in Southeast Asia, the quality of its available gene set is insufficient. To construct a high-quality Perilla gene set, we sequenced mRNAs extracted from different tissues of
Keywords:
The identification of an organism’s full gene set plays an important role as the foundation for comprehensive genomic and transcriptome studies of the organism (Martin and Wang. 2011). Only a few years ago, our understanding of the transcriptome was dependent on traditional methods such as low-throughput expressed sequence tag (EST)-based or chip-based methods (e.g., DNA microarray), which have several limitations (Wang et al. 2009). Recently, whole- transcriptome profiling based on next-generation sequencing (NGS) has started to provide insights into the large-scale landscape and dynamics of the complex world of the transcriptome (Chen et al. 2011a).
The RNA-sequencing (RNA-seq) approach and powerful bioinformatic tools provide more quantitative and precise measurements of gene expression levels than earlier methods (Marguerat and Bahler et al. 2010), with highly reproducible results and few systematic differences among technical replicates (Zhang et al. 2012). Moreover, many of the latest studies have applied RNA-seq to various biological purposes including the identification of all expressed transcripts (Li et al. 2014), the improvement of genome assembly to find missing information and better understand the structure of the reference genome (Chen et al. 2011b), the detection of novel transcribed regions and alternatively spliced forms (Garber et al. 2011), SNP marker discovery (Iorizzo et al. 2011), and others, both with and without a reference genome (Gongora-Castillo and Buell. 2013). Therefore, it is urgent to gain
Perilla, an oilseed crop belonging to the Lamiaceae family, is commonly cultivated in Asian countries such as Korea, Japan, China, and Nepal. There are commonly considered to be four Perilla species and one variety (Jung et al. 2005).
Perilla seeds contain large amounts of unsaturated fatty acids, conferring various benefits to human health (Bumblauskiene et al. 2009). Hence, most Perilla studies to date have focused on the characterization and biological activities of the Perilla metabolites. The genes involved in the biosynthesis of anthocyanins, flavones, and monoterpenoids have been described (Lee et al. 2014). Various genomic, transcriptomic, and molecular analyses of the cultivated species
In this study, we report the production of a high-quality gene set of
Starting with the total RNA from each of the five samples, mRNA was purified using poly(A) selection or rRNA depletion and then chemically fragmented and converted into single- stranded cDNA using random hexamer priming. Next, the second strand was generated to create double-stranded cDNA. The library construction began with the generation of blunt-end cDNA fragments from the double-stranded cDNA. Then, an ‘A’ base was added to the blunt end in order to make the fragments ready for ligation to sequencing adapters. After the size selection of ligates, the ligated cDNA fragments that contained adapter sequences were enhanced via PCR using adapter-specific primers. Paired-end libraries were prepared using the Illumina TruSeq RNA Sample Preparation Kit v2 (catalog #RS-122-2001, Illumina, San Diego, CA). Then, the libraries were quantified using the KAPA library quantification kit (Kapa Biosystems KK4854) following the manufacturer’s instructions. Each library was then loaded onto the Illumina Hiseq2000 platform (100-bp paired-end reads). High-throughput sequencing was performed to ensure that each sample met the desired average sequencing depth. The sequence data from the leaf tissue has been deposited to the National Agricultural Biotechnology Information Center (NABIC) Sequence Read Archive (SRA) with the accession number NN-1473-000001.
Before assembly, the raw sequence reads with sufficient quality [Phred score (Q) ≥ 20] were selected. Reads < 25 bp in length were then removed using the SolexaQA package (v1.13) (Cox et al. 2010). The remaining high-quality reads were then used for
The
The assembled transcripts were designated as ‘P.citriodora XSLXXXXXXtXXX’. ‘P.citriodora’ is an abbreviation of the species name. The first digit ‘X’ following ‘P.citriodora’ denotes the assembly version, and ‘S’ means a reference from SEEDERS. The ‘L’ is an abbreviation for “locus,” and the following six digits signify the representative transcript (locus) number. The ‘t’ is an abbreviation for “transcript,” and the last three digits signify the transcript number for the locus.
The assembled
To functionally annotate the assembled transcripts, we assessed their sequence similarity with the
We obtained a total of 193,666,492 paired-end reads (19,366,649,200 bp) from the five RNA-seq experiments (Table 1). After strictly filtering by Phred quality score and read length, we obtained 173,911,506 (89.61%) cleaned reads with a total length of 15,584,692,752 bp (80.44%) (Table 1).
Table 1 . Statistics of the short-read sequencing of Perilla.
Sample description | Raw reads | Cleaned reads | ||||
---|---|---|---|---|---|---|
Num. of reads | Total length (bp) | Num. of reads | Avg. length | Total length (bp) | %a | |
Leaves | 21,255,739 | 2,125,573,900 | 19,314,024 | 90.28 | 1,743,658,415 | 82.03% |
21,255,739 | 2,125,573,900 | 19,314,024 | 88.62 | 1,711,683,894 | 80.53% | |
Buds | 20,120,794 | 2,012,079,400 | 18,064,027 | 91.21 | 1,647,650,051 | 81.89% |
20,120,794 | 2,012,079,400 | 18,064,027 | 87.71 | 1,584,428,254 | 78.75% | |
Inflorescence (before fertilization) | 19,305,406 | 1,930,540,600 | 16,972,378 | 90.48 | 1,535,591,827 | 79.54% |
19,305,406 | 1,930,540,600 | 16,972,378 | 87.26 | 1,480,965,843 | 76.71% | |
Inflorescence (after fertilization) | 17,354,075 | 1,735,407,500 | 15,393,832 | 91.11 | 1,402,454,377 | 80.81% |
17,354,075 | 1,735,407,500 | 15,393,832 | 87.07 | 1,340,362,115 | 77.24% | |
Seeds | 18,797,232 | 1,879,723,200 | 17,211,492 | 93.11 | 1,602,544,844 | 85.25% |
18,797,232 | 1,879,723,200 | 17,211,492 | 89.21 | 1,535,353,132 | 81.68% | |
Total | 193,666,492 | 19,366,649,200 | 173,911,506 | 89.61 | 15,584,692,752 | 80.44% |
a%: Percentage of the total length of raw reads represented by the total length of the cleaned reads
To obtain a high-quality gene set for
Table 2 . Summary of
Assembler | Transcript type | Total count | Total length (bp) | N50a | AVGb |
---|---|---|---|---|---|
Improved assembly | Total transcripts | 86,396 | 155,964,376 | 2,675 | 1,805 |
Representative transcripts | 38,413 | 49,116,021 | 2,233 | 1,278 | |
Velvet assembly | Total transcripts | 101,855 | 155,276,941 | 2,167 | 1,524 |
Loci of transcripts | 40,455 | 40,783,832 | 1,651 | 1,008 | |
Trinity assembly | Total transcripts | 143,535 | 219,451,507 | 2,437 | 1,528 |
Loci of transcripts | 46,615 | 39,245,806 | 1,564 | 841 |
aN50: a weighted median statistic length such that 50% of the assembled transcripts are longer than the N50 length
bAVG: average length of all assembled transcripts
Length distribution of the assembled
Among the hash lengths used (
The assembled transcripts could be clustered to loci, with an average of 2.2 transcripts per locus (Table 2). An optimized representative transcript for each locus should be recommended for downstream analyses such as gene prediction and gene expression analysis. Velvet produced 40,455 clusters, and Trinity produced 46,615 clusters, but neither assembly program selects representative transcripts for loci. Therefore, we selected representative transcripts via the following steps. We mapped the short reads used in the final assembly back to their respective transcripts using the Bowtie software (Langmead B et al. 2009). We gave the transcripts with the most mapped reads (read depth) at each locus the highest priority for consideration as representative transcripts. Then, we gave the longest transcripts among the translated amino acid sequences containing open reading frames (ORFs) the next priority. If we were unable to select a representative transcript for a given locus based on the first two steps, we selected the candidate transcript with the longest nucleotide sequence as the representative transcript. Thus, we identified 38,413 non-redundant representative transcripts with a mean length of 1,278 bp, an N50 length of 2,233 bp, and a total length of 49,116,021 bp (Table 2).
To closely look at the quality of the assembled transcripts, we identified the assembled transcripts with matches among 638 Arabidopsis genes involved in fatty acid and TAG biosynthesis pathways in the acyl-lipid metabolism database for Arabidopsis (Bates et al. 2014). First, we checked the quality of the assembled transcripts by looking into the matched coverage of the query genes. Among the three assembly outputs, the improved assembly not only had the highest percentage of
Table 3 . Statistics of sequence homology between
Improved assembly | Velvet assembly | Trinity assembly | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Perilla | Arabidopsis | Perilla | Arabidopsis | Perilla | Arabidopsis | |||||||
n | % | n | % | n | % | n | % | n | % | n | % | |
BLAST | 729 | - | 546 | - | 750 | - | 531 | - | 485 | - | 508 | - |
≥C 70% | 416 | 57.06 | 452 | 82.78 | 293 | 39.06 | 365 | 68.74 | 214 | 44.12 | 328 | 64.57 |
≥C 80% | 381 | 52.26 | 417 | 76.37 | 247 | 32.93 | 318 | 59.89 | 171 | 35.26 | 271 | 53.35 |
≥C 90% | 297 | 40.74 | 342 | 62.64 | 188 | 25.06 | 248 | 46.70 | 129 | 26.60 | 214 | 42.13 |
*BLAST: number of transcripts with e-value ≤ 1e-10 and identity ≥ 50% in sequence homology search by BLAST; C: Arabidopsis gene coverage.
Table 4 . Details of the assembled
Gene Symbol | AT gene ID | Lengtha (aa) | Perilla transcript ID | Lengthb (aa) | e-valuec | F or Pd | transcripts ne |
---|---|---|---|---|---|---|---|
AT1G01090 | 429 | P.citriodora1SL011651t001 | 438 | 0 | F | 1 | |
AT2G34590 | 407 | P.citriodora1SL012138t001 | 308 | 0 | F | 4 | |
AT1G34430 | 466 | P.citriodora1SL006498t001 | 471 | 1e-156 | F | 6 | |
AT3G25860 | 481 | P.citriodora1SL003106t001 | 463 | 6e-148 | F | 6 | |
AT3G16950 | 624 | P.citriodora1SL006166t006 | 611 | 0 | F | 2 | |
AT2G38040 | 770 | P.citriodora1SL009096t001 | 754 | 0 | F | 4 | |
ATCG00500 | 489 | P.citriodora1SL024624t001 | 511 | 0 | F | 1 | |
AT5G35360 | 556 | P.citriodora1SL003110t001 | 538 | 0 | F | 1 | |
AT5G16390 | 281 | P.citriodora1SL014888t001 | 285 | 1e-54 | F | 3 | |
AT5G15530 | 256 | P.citriodora1SL034420t001 | 279 | 5e-48 | F | 2 | |
AT2G30200 | 394 | P.citriodora1SL010628t002 | 408 | 0 | F | 1 | |
AT1G62640 | 405 | P.citriodora1SL009277t001 | 402 | 0 | F | 2 | |
AT1G24360 | 320 | P.citriodora1SL004327t009 | 320 | 2e-126 | F | 6 | |
AT5G10160 | 220 | P.citriodora1SL017007t002 | 259 | 3e-102 | F | 4 | |
AT2G05990 | 391 | P.citriodora1SL015253t001 | 394 | 0 | F | 4 | |
AT3G25110 | 363 | P.citriodora1SL015493t001 | 238 | 1e-176 | P | 2 | |
AT1G08510 | 413 | P.citriodora1SL000470t003 | 422 | 0 | F | 2 | |
AT2G43710 | 402 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT1G43800 | 392 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT5G46290 | 490 | P.citriodora1SL026266t001 | 475 | 0 | F | 10 | |
AT1G74960 | 542 | P.citriodora1SL001363t001 | 542 | 0 | F | 10 | |
LACS8 | AT2G04350 | 721 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
LACS9 | AT1G77590 | 692 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
Endoplasmic reticulum-desaturase | |||||||
FAD2 | AT3G12120 | 384 | P.citriodora1SL000613t001 | 383 | 0 | F | 2 |
FAD3 | AT2G29980 | 387 | P.citriodora1SL002476t004 | 439 | 2e-174 | F | 8 |
FAD8 | AT5G05580 | 436 | P.citriodora1SL002476t004 | 439 | 0 | F | 16 |
Acyl-CoA- dependent TAG synthesis in Kennedy pathway | |||||||
GPAT9 | AT5G60620 | 377 | P.citriodora1SL004403t009 | 372 | 0 | F | 6 |
LPAT2 | AT3G57650 | 390 | P.citriodora1SL004485t001 | 383 | 0 | F | 6 |
DGAT1 | AT2G19450 | 521 | P.citriodora1SL006978t006 | 450 | 2e-176 | P | 2 |
DGAT2 | AT3G51520 | 315 | P.citriodora1SL006419t008 | 511 | 7e-74 | P | 6 |
DGAT3 | AT1G48300 | 285 | P.citriodora1SL003849t001 | 407 | 3e-24 | P | 1 |
PC-mediated TAG synthesis | |||||||
LPCAT | AT1G12640 | 463 | P.citriodora1SL029356t001 | 466 | 0 | F | 6 |
PDAT1 | AT5G13640 | 672 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
PDAT2 | AT3G44830 | 666 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
DAG-CPT1 | AT1G13560 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
DAG-CPT2 | AT3G25585 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
PDCT | AT3G15820 | 302 | P.citriodora1SL012582t001 | 285 | 2e-109 | P | 1 |
Oil-body protein | |||||||
AT3G01570 | 184 | P.citriodora1SL023891t001 | 183 | 2e-32 | F | 4 | |
AT5G40420 | 200 | P.citriodora1SL023844t001 | 176 | 3e-20 | P | 4 | |
AT3G18570 | 167 | P.citriodora1SL023918t001 | 157 | 1e-32 | F | 1 | |
AT4G25140 | 174 | P.citriodora1SL023850t001 | 143 | 1e-35 | P | 3 | |
Transcription factor | |||||||
WRI1 | AT3G54320 | 439 | P.citriodora1SL001076t003 | 611 | 1e-87 | P | 32 |
athe length of the Arabidopsis gene
bthe length of the
cthe e-value matched between the two sequences (Arabidopsis and
dthe full-length or partial-length transcripts in
ethe number of
We directly compared the assembled transcripts to known plant protein sequences using BLASTX (e-value ≤ 1e-10). Of the 38,413 representative transcripts, 23,667 transcripts (61.61%) matched with 16,143 sesame genes, 24,030 transcripts (62.56%) matched with 19,283 proteins in Phytozome, and 17,752 transcripts (73.87% of 24,030) had highest sequence similarity with
To identify the putative functions of the
GO functional classification of the assembled
KEGG classification of the assembled
This study highlights the utility of next-generation sequencing (RNA-seq) as a basis for gene set assembly and functional annotation in non-model species. By comparing assembly programs and modifying the assembly pipeline, we assembled 86,396 total transcripts and 38,413 representative transcripts and evaluated the quality of the assembled transcripts based on the Arabidopsis gene coverage and the proportion of full-length genes among the assembled transcripts. Our transcriptome analysis of
This work was carried out with the support of the National Agricultural Genome Program (PJ010408) of the RDA (Rural Development Administration), Republic of Korea.
Length distribution of the assembled
GO functional classification of the assembled
KEGG classification of the assembled
Table 1 . Statistics of the short-read sequencing of Perilla.
Sample description | Raw reads | Cleaned reads | ||||
---|---|---|---|---|---|---|
Num. of reads | Total length (bp) | Num. of reads | Avg. length | Total length (bp) | %a | |
Leaves | 21,255,739 | 2,125,573,900 | 19,314,024 | 90.28 | 1,743,658,415 | 82.03% |
21,255,739 | 2,125,573,900 | 19,314,024 | 88.62 | 1,711,683,894 | 80.53% | |
Buds | 20,120,794 | 2,012,079,400 | 18,064,027 | 91.21 | 1,647,650,051 | 81.89% |
20,120,794 | 2,012,079,400 | 18,064,027 | 87.71 | 1,584,428,254 | 78.75% | |
Inflorescence (before fertilization) | 19,305,406 | 1,930,540,600 | 16,972,378 | 90.48 | 1,535,591,827 | 79.54% |
19,305,406 | 1,930,540,600 | 16,972,378 | 87.26 | 1,480,965,843 | 76.71% | |
Inflorescence (after fertilization) | 17,354,075 | 1,735,407,500 | 15,393,832 | 91.11 | 1,402,454,377 | 80.81% |
17,354,075 | 1,735,407,500 | 15,393,832 | 87.07 | 1,340,362,115 | 77.24% | |
Seeds | 18,797,232 | 1,879,723,200 | 17,211,492 | 93.11 | 1,602,544,844 | 85.25% |
18,797,232 | 1,879,723,200 | 17,211,492 | 89.21 | 1,535,353,132 | 81.68% | |
Total | 193,666,492 | 19,366,649,200 | 173,911,506 | 89.61 | 15,584,692,752 | 80.44% |
a%: Percentage of the total length of raw reads represented by the total length of the cleaned reads
Table 2 . Summary of
Assembler | Transcript type | Total count | Total length (bp) | N50a | AVGb |
---|---|---|---|---|---|
Improved assembly | Total transcripts | 86,396 | 155,964,376 | 2,675 | 1,805 |
Representative transcripts | 38,413 | 49,116,021 | 2,233 | 1,278 | |
Velvet assembly | Total transcripts | 101,855 | 155,276,941 | 2,167 | 1,524 |
Loci of transcripts | 40,455 | 40,783,832 | 1,651 | 1,008 | |
Trinity assembly | Total transcripts | 143,535 | 219,451,507 | 2,437 | 1,528 |
Loci of transcripts | 46,615 | 39,245,806 | 1,564 | 841 |
aN50: a weighted median statistic length such that 50% of the assembled transcripts are longer than the N50 length
bAVG: average length of all assembled transcripts
Table 3 . Statistics of sequence homology between
Improved assembly | Velvet assembly | Trinity assembly | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Perilla | Arabidopsis | Perilla | Arabidopsis | Perilla | Arabidopsis | |||||||
n | % | n | % | n | % | n | % | n | % | n | % | |
BLAST | 729 | - | 546 | - | 750 | - | 531 | - | 485 | - | 508 | - |
≥C 70% | 416 | 57.06 | 452 | 82.78 | 293 | 39.06 | 365 | 68.74 | 214 | 44.12 | 328 | 64.57 |
≥C 80% | 381 | 52.26 | 417 | 76.37 | 247 | 32.93 | 318 | 59.89 | 171 | 35.26 | 271 | 53.35 |
≥C 90% | 297 | 40.74 | 342 | 62.64 | 188 | 25.06 | 248 | 46.70 | 129 | 26.60 | 214 | 42.13 |
*BLAST: number of transcripts with e-value ≤ 1e-10 and identity ≥ 50% in sequence homology search by BLAST; C: Arabidopsis gene coverage.
Table 4 . Details of the assembled
Gene Symbol | AT gene ID | Lengtha (aa) | Perilla transcript ID | Lengthb (aa) | e-valuec | F or Pd | transcripts ne |
---|---|---|---|---|---|---|---|
AT1G01090 | 429 | P.citriodora1SL011651t001 | 438 | 0 | F | 1 | |
AT2G34590 | 407 | P.citriodora1SL012138t001 | 308 | 0 | F | 4 | |
AT1G34430 | 466 | P.citriodora1SL006498t001 | 471 | 1e-156 | F | 6 | |
AT3G25860 | 481 | P.citriodora1SL003106t001 | 463 | 6e-148 | F | 6 | |
AT3G16950 | 624 | P.citriodora1SL006166t006 | 611 | 0 | F | 2 | |
AT2G38040 | 770 | P.citriodora1SL009096t001 | 754 | 0 | F | 4 | |
ATCG00500 | 489 | P.citriodora1SL024624t001 | 511 | 0 | F | 1 | |
AT5G35360 | 556 | P.citriodora1SL003110t001 | 538 | 0 | F | 1 | |
AT5G16390 | 281 | P.citriodora1SL014888t001 | 285 | 1e-54 | F | 3 | |
AT5G15530 | 256 | P.citriodora1SL034420t001 | 279 | 5e-48 | F | 2 | |
AT2G30200 | 394 | P.citriodora1SL010628t002 | 408 | 0 | F | 1 | |
AT1G62640 | 405 | P.citriodora1SL009277t001 | 402 | 0 | F | 2 | |
AT1G24360 | 320 | P.citriodora1SL004327t009 | 320 | 2e-126 | F | 6 | |
AT5G10160 | 220 | P.citriodora1SL017007t002 | 259 | 3e-102 | F | 4 | |
AT2G05990 | 391 | P.citriodora1SL015253t001 | 394 | 0 | F | 4 | |
AT3G25110 | 363 | P.citriodora1SL015493t001 | 238 | 1e-176 | P | 2 | |
AT1G08510 | 413 | P.citriodora1SL000470t003 | 422 | 0 | F | 2 | |
AT2G43710 | 402 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT1G43800 | 392 | P.citriodora1SL024454t001 | 397 | 0 | F | 8 | |
AT5G46290 | 490 | P.citriodora1SL026266t001 | 475 | 0 | F | 10 | |
AT1G74960 | 542 | P.citriodora1SL001363t001 | 542 | 0 | F | 10 | |
LACS8 | AT2G04350 | 721 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
LACS9 | AT1G77590 | 692 | P.citriodora1SL028662t001 | 697 | 0 | F | 12 |
Endoplasmic reticulum-desaturase | |||||||
FAD2 | AT3G12120 | 384 | P.citriodora1SL000613t001 | 383 | 0 | F | 2 |
FAD3 | AT2G29980 | 387 | P.citriodora1SL002476t004 | 439 | 2e-174 | F | 8 |
FAD8 | AT5G05580 | 436 | P.citriodora1SL002476t004 | 439 | 0 | F | 16 |
Acyl-CoA- dependent TAG synthesis in Kennedy pathway | |||||||
GPAT9 | AT5G60620 | 377 | P.citriodora1SL004403t009 | 372 | 0 | F | 6 |
LPAT2 | AT3G57650 | 390 | P.citriodora1SL004485t001 | 383 | 0 | F | 6 |
DGAT1 | AT2G19450 | 521 | P.citriodora1SL006978t006 | 450 | 2e-176 | P | 2 |
DGAT2 | AT3G51520 | 315 | P.citriodora1SL006419t008 | 511 | 7e-74 | P | 6 |
DGAT3 | AT1G48300 | 285 | P.citriodora1SL003849t001 | 407 | 3e-24 | P | 1 |
PC-mediated TAG synthesis | |||||||
LPCAT | AT1G12640 | 463 | P.citriodora1SL029356t001 | 466 | 0 | F | 6 |
PDAT1 | AT5G13640 | 672 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
PDAT2 | AT3G44830 | 666 | P.citriodora1SL004897t001 | 662 | 0 | F | 3 |
DAG-CPT1 | AT1G13560 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
DAG-CPT2 | AT3G25585 | 390 | P.citriodora1SL004437t003 | 390 | 0 | F | 12 |
PDCT | AT3G15820 | 302 | P.citriodora1SL012582t001 | 285 | 2e-109 | P | 1 |
Oil-body protein | |||||||
AT3G01570 | 184 | P.citriodora1SL023891t001 | 183 | 2e-32 | F | 4 | |
AT5G40420 | 200 | P.citriodora1SL023844t001 | 176 | 3e-20 | P | 4 | |
AT3G18570 | 167 | P.citriodora1SL023918t001 | 157 | 1e-32 | F | 1 | |
AT4G25140 | 174 | P.citriodora1SL023850t001 | 143 | 1e-35 | P | 3 | |
Transcription factor | |||||||
WRI1 | AT3G54320 | 439 | P.citriodora1SL001076t003 | 611 | 1e-87 | P | 32 |
athe length of the Arabidopsis gene
bthe length of the
cthe e-value matched between the two sequences (Arabidopsis and
dthe full-length or partial-length transcripts in
ethe number of
Ho Bang Kim · Chang Jae Oh · Nam-Hoon Kim · Cheol Woo Choi · Minju Kim · Sukman Park · Seong Beom Jin · Su-Hyun Yun · Kwan Jeong Song
J Plant Biotechnol 2022; 49(4): 271-291Hualin Nie ・Sujung Kim・Yongjae Lee ・Hyungjun Park ・Jeongeun Lee ・Jiseong Kim・Doyeon Kim・ Sunhyung Kim
J Plant Biotechnol 2020; 47(3): 194-202Ki-Young Choi · Duck Hwan Park · Eun-Soo Seong · Sang Woo Lee · Jin Hang · Li Wan Yi · Jong-Hwa Kim · Jong-Kuk Na
J Plant Biotechnol 2019; 46(4): 274-281
Journal of
Plant BiotechnologyLength distribution of the assembled
GO functional classification of the assembled
KEGG classification of the assembled