Question

Low mapping percentage after mapping RNA-seq reads to a closely related species

0

Entering edit mode

6.0 years ago

unawaz ▴ 60

Hi,

I have Illumina 100 bp paired end RNA-Seq data from a non-model species. I mapped it to the closely related genome available and I used STAR to do this task. I mainly did this to see if I could use the genome of the organism for a genome guided assembly. It turns out that I got an overall alignment rate of 0.93%. I used default parameters for this, however working around the parameters only increased the results to 1-2%. The species I'm working with is a cephalopod.

I'm not really interested in increasing the mapping rate at this point (since this was mainly for exploratory analysis). However I wanted to know what other downstream analysis can I do on the reads that actually did map (I'm assuming these would be tRNAs, histones etc). Basically I want to be able to make plots for the sequences that did align, but not sure what programs I can use to represent my data. Any ideas would be great :)

I also wanted to know if others have done a similar analysis and got similar results? What conclusions did you derive from this sort of analysis?

RNA-Seq alignment genome • 2.1k views

ADD COMMENT • link updated 5.9 years ago by Friederike 8.9k • written 6.0 years ago by unawaz ▴ 60

2

Entering edit mode

use the genome of the organism for a genome guided assembly

Guided assembly from RNA-seq? for what?, However, if the mapping percentage to related specie is low, you can perform a de novo assembly (transcripts), predict orfs and blast them to predict functions... etc etc. I think that plot your actual results does not have any sense because because they may be related to sequencing noise, however you can try using the .sam file and htseq or even samtools -view.

ADD REPLY • link 6.0 years ago by Buffo ★ 2.4k

5

Entering edit mode

I think the first thing to do when you get a very low mapping percentage, is to take some of your reads and do blast against the ncbi nr database, and see what kind of organisms you get hits to. Your reads may not be what you expected them to be.

ADD REPLY • link 5.9 years ago by mastal511 ★ 2.1k

1

Entering edit mode

Cant up vote what mastal511 said enough. You literally have less than 1% idea of what your data is. For all you know it could be contaminated.

On a side note - If there is no reference genome, why don't you make an attempt at de-novo assembly and try to get it published?

ADD REPLY • link 5.9 years ago by BioinfGuru ★ 1.7k

score 2 · Accepted Answer · 2018-06-11

However I wanted to know what other downstream analysis can I do on the reads that actually did map (I'm assuming these would be tRNAs, histones etc)

There is no one-size-fits-all solution to the question "what are my genes of interest?". I would probably go about this by pretending you're looking at a normal RNA-seq data set using the annotation of the model species whose transcriptome you used with STAR. (This would be the GTF file that you presumably used with STAR in addition to the fasta file that contains the genome sequence.) You could use that with featureCounts (of the subread package) to get the genes with non-zero coverage. To find out more about the genes (such as GO terms), you could, for example, follow the descriptions in Chapter 7 of this bioconductor workflow.