What is an acceptable mapping rate on de novo transcriptomes for subsequent differential expression analysis and candidate search?
1
4
Entering edit mode
5.1 years ago

I am looking for sex-biased genes in different species.

RNAseq data : I am mapping 75bp SE reads via RSEM with bowtie2 (very-sensitive), on a transcriptomes of reference. I actually have that for 4 species. The data comes from a very specific tissue.

For each species, the reference transcriptomes were made with different PE reads coming from a mix of different stages and whole body animals. They should represent a fairly big proportion of all the genes of the animals. I used Trinity to (re)assemble them. Followed by transdecoder.LongOrfs and transdecoder.predict to retain only the protein coding genes in which I am mostly interested. I don't have a reference genome as it is not a model species.

I mapped the 75bp SE reads (sex-specific and tissue specific) on:

  • (1) Trinity raw output, and I get 70% of the reads that map at least once on the reference. They cover around 60% of the reference with a min coverage of 5X. Busco completeness assessment of these raw trinity out puts gives ~95% of BUSCO (on arthropoda data base 9) for each species.

  • (2) only the longest mRNA with a coding sequence of each gene (~20 000 per ref), and I get 40-30% of reads map at least once with a min coverage of 5X. They cover around 70% of that references. Busco completeness assessment of these mRNA references gives ~85% of BUSCO (on arthropoda data base 9) for each species.

Polymorphism might lower the mapping rate because the population used for the references and the SE reads are different.

Incompleteness of the reference might also lower the rate. Although BUSCO searches are fairly good.

Other info:

  • the SE sequencing was done with polyA selector primers “smart-seq cds primer ii a”.

  • the PE sequencing for de novo references also had polyA selector primer. Thus, I shouldn't have ribosomes rRNA.

  • I still have transposable elements in my mRNA references (because they have coding sequences).

  • I am sure I haven’t mixed up species. I map SE reads on the corresponding reference.

  • Each SE reads library are at least 30M reads.

  • by mRNA I mean whole transcript; UTRs + CDS

  • Species are insects

I wonder whether I should be concerned about this low mapping rate on the mRNA references (30-40%) ?

I would like to use the mapping from the reference containing only the mRNA => (2), because I only care about these sequences which I can identify.

The ultimate goal is to perform sub-sequent Differential Expression analysis and extract candidate genes involved in the development of one trait present in males of certain species but not others.

RNA-Seq alignment • 3.0k views
ADD COMMENT
1
Entering edit mode
5.1 years ago

Is there a closely related organism with an annotated genome?

If so, perhaps you could check:

1) What is that genome alignment rate?

2) What percent of reads fall within a cufflinks assembly, guided by the similar genome sequence?

I believe Oases has a reference-guided assembly option (because Columbus would be the reference-guided assembly method in Velvet). However, I think cufflinks may be a little better option for reference-guided assembly of RNA-Seq data.

It's been a little while, but I thought the main issue that I had with Trinity was producing an implausible number of long transcripts with chimeric homology to known genes. However, that would be different than Oases, where I thought I tended to get partial genes (and I would therefore expect a lower overall alignment rate), but I thought they were more accurate. Nevertheless, I'm not sure what you tell you in terms of what alignment rate you should expect, except the higher alignment rate may not necessary result in higher quality sequences (so, I would have other metrics to compare for your assemblies).

You could also look if your organism has ESTs and/or RefSeq sequences. That could be a baseline: if your de novo alignment rate is lower than the EST/RefSeq alignment rate, maybe use of those external sequences (and possibly a small number of some of the most highly covered contigs, from either Oases or Trinity) would be a better option?

ADD COMMENT
1
Entering edit mode

1) The closest species with a Genome is quite far (~100 Myears). Running RSEM with SE reads of other species on it gives a mapping rate is ~1%. This species with a genome is actually part of my experimental design, but due to this low between species alignment rate, I decided to built de novo transcriptomes for each other species and look for orthologues across all these different references with Orthofinder.

2) I am not sure a new guided de novo assembly with cufflinks will help because this closest genome is actually quite far. Also, my Trinity assembles are quite good. The raw Trinity output assembles have a N50 from 1000 to 1900, depending on the species. After Transcoder.LongOrfs and Transcoder.predict, I get ~20 000 genes, which is the same as the Official Gene Set (~ 20 000 genes) of this closest species with a genome. If Oases gives more fragmented transcripts, as you say, it might not help neither.

My organisms are none-model systems, and they don't have sequences on any database.

Is 40-30% mapping rate, with RSEM on solely mRNA with a coding sequence, too low to continue? You said that anyway, higher mapping rates don't mean better sequences.

ADD REPLY
0
Entering edit mode

A genome mapping rate of 1% is low, but I would usually report that with a genome aligner like TopHat2/STAR/HISAT. I would usually expect you were using RSEM with a Bowtie2 alignment for a transcriptome.

Even with a transcriptome alignment, I think you could just use Bowtie2 for your alignment. It seems to me like that alone may noticeably increase your alignment rate. To be honest, I have either not gotten great results with RSEM or I found the run-time to be unacceptable. So, I would actually prefer eXpress (from the Bowtie2 alignment, although you can also start with FASTQ files) over RSEM for your transcriptome quantification.

ADD REPLY
0
Entering edit mode

On a side note: Are there actually circumstances, where it would make sense to use the transdecoder-predicted ORFs for quantification, instead of the assembled transcripts? I'm asking, because I noticed quite a few concatenated transcripts, especially for plastids. Transdecoder was able to disentangle these genes, as far as BLAST could tell. However, the mapping rate was quite a bit lower (using Salmon's quasi alignment: 99 % for assembled transcripts, 67 % for ORFs only).

ADD REPLY

Login before adding your answer.

Traffic: 1941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6