Question

Is TransDecoder predicting the "true" set of protein-coding regions?

1

Entering edit mode

7.1 years ago

biomonte ▴ 220

Dear everyone,

I have downloaded from TSA (Transcriptome Shotgun Assembly) the contig sequences of the same species but from two different BioProject (same authors, but different studies). One file contains ~800,000 sequences while the other has ~400,000 sequences.

I'm interested in identifying protein-coding regions and I'm using TransDecoder for that purpose. After running TransDecoder I have gotten ~300,000 and ~150,000 protein-coding regions, respectively. I'm aware that TransDecoder looks for possible ORF in all 6 reading frames, and so the initial amount of contig sequences is possibly correlated with the final number of proteins.

However, I'm wondering how can one infer the "true" (i.e. closest to reality) set of protein-coding regions for a species? For example, the proteome of Xenopus tropicalis contains right now 39,662 sequences (or mRNAs as stated here) and Anolis carolinensis 32,230. So why do I get so many proteins and how can I get a more realistic number?

Thanks!

RNA-Seq TransDecoder ORF • 4.3k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 7.1 years ago by biomonte ▴ 220

2

Entering edit mode

I recommend you read the manual since it includes a way to include Blastp and Pfam searches to select coding regions.

ADD REPLY • link 7.1 years ago by biofalconch ★ 1.1k

0

Entering edit mode

Thanks for your suggestion @biofalconch , you are right, I knew about this optional step but I did not use it. I agree I would get less sequences including blastp or pfam searches, but what about novel proteins that are not in the reference databases? That's why I did not use it before... :(

ADD REPLY • link 7.1 years ago by biomonte ▴ 220

0

Entering edit mode

about novel proteins that are not in the reference databases

Unless you are working with an extreme outlier, there should be something with hints of reasonable homology in current protein databases.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

One reason would be if these sequences consist of multiple isoforms instead of only the longest isoform. Different splice-forms from the same transcript can give multiple CDS.

ADD REPLY • link 7.1 years ago by Rohit ★ 1.5k

0

Entering edit mode

Thanks @Rohit , in both datasets there is only unique IDs, so I'm assuming that the authors kept only the longest isoform per gene before publishing the contig sequences in TSA.

ADD REPLY • link 7.1 years ago by biomonte ▴ 220

1

Entering edit mode

Isoforms I cant be sure of with just the unique ID - what if there was pre-processing for changing the transcript names into unique ones. There is no mention in TSA about keeping only the longest isoform of the transcript. If there is a reference genome, mapping onto it with splice-aware mappers to make sure would definitely help. Else as @genomax suggested, there wouldn't be a huge difference

ADD REPLY • link 7.1 years ago by Rohit ★ 1.5k