low unique mapping ratio of RNA-seq data
1
2
Entering edit mode
5.1 years ago
younglin113 ▴ 60

I've been confused of my recent RNA-seq analysis result. Here is the situation: the RNA-seq results were obtained by Illumina pair-end 150bp sequencing, after trimming the adapters and low-quality bases, the clean reads were mapped to GRch38 genome, but the results showed that the unique mapping ratio is pretty low. So following the advises of the forum, I mapped the clean reads to rRNA sequences. However, the situation is totally reverse, the sample with higher rRNA proportion got higher unique mapping ratio instead. Can anyone gets any ideas of this or is there any other way I can try to find what are these multiple mapping reads? Thank in advance. Here shows the mapping result: enter image description here

rna-seq next-gen • 4.9k views
ADD COMMENT
2
Entering edit mode

How are you quantifying rRNA fraction? The process is not necessarily straightforward. These previous discussions may be helpful:

ADD REPLY
0
Entering edit mode

Thanks a lot for your help. I did define the fraction of rRNA by mapping all clean reads to rRNA sequences.

ADD REPLY
1
Entering edit mode

Was this ribo-depleted RNA-seq? Have you looked at the results in IGV? Depending on the experiment you may just have a bunch of repeat expression.

ADD REPLY
0
Entering edit mode

Use these directions to properly post images: How to add images to a Biostars post

Can you tell us what program you used for "cleaning", alignments and for estimating the unique mapping?

ADD REPLY
0
Entering edit mode

Of course, I used the Trimmomatic for cleaning, and Hisat2 for alignment, and then use command samtools view -q 20 to filter the unique mapped reads.

ADD REPLY
0
Entering edit mode

enter image description here

ADD REPLY
0
Entering edit mode

Do you only have 4 samples?

Can you post HiSat2 parameters?

ADD REPLY
0
Entering edit mode

yes, I only have 4 samples. And the hiast2 command is : hisat2 -p 16 -x $hisat_index -1 $R1_fq -2 $R2_fq -S $out_sam

ADD REPLY
0
Entering edit mode

Did you run FastQC and what were the results?

ADD REPLY
0
Entering edit mode

The FastQC result seemed fine, isn't it?

ADD REPLY
0
Entering edit mode

enter image description here

ADD REPLY
0
Entering edit mode

What are the overrepresented sequences that FastQC finds? And how do you determine the non-unique reads?

ADD REPLY
0
Entering edit mode

The overrepresented sequences are :enter image description here

And I filtered the unique mapping reads use command: samtools view -q 20 , the reads with mapping quality lower than 20 were determined as multiple mapping reads.

ADD REPLY
0
Entering edit mode

The reason I asked about how you determined uniquely mapping reads is that the MAPQ value (which you are filtering on if you use samtools view -q 20) is usually not a good indicator of whether a reads is aligning to multiple places in the genome or not. What a MAPQ value of 20 actually means totally depends on the alignment software as this great post shows in great detail. Particularly for HiSat, the MAPQ does not seem to be the best way to identify uniquely aligning reads, based on this post I would guess that filtering for NH:i:1 might be your best bet.

Or use STAR instead which has, IMO, a more transparent handling of multi-mappers.

ADD REPLY
1
Entering edit mode
5.1 years ago
bruce.moran ▴ 960

and then use command samtools view -q 20 to filter the unique mapped reads

From samtools manual

-q INT Skip alignments with MAPQ smaller than INT [0].

So your table shows reads with q > 20.

'Uniquely mapping' is not that helpful a term. Some reads align to more than one position, generally these are called 'multi-mappers'. You can use the SAM flags to find out what reads are the primary alignments, e.g. best scoring alignment if multi-mappers exist for that read. One way is to use samtools view -F 256, this is actually specified in the Hisat2 manual in the 'Reporting Options' section, and it returns reads with flag 256 not set, therefore all secondary alignments are excluded, and you have your primary alignments only.

As to having only 4 samples, that's probably not a good experimental design given what most people use RNAseq for.

ADD COMMENT
0
Entering edit mode

Actually, many papers show that they will use unique mapping reads for downstream analysis. And once I saw a writer said we could use the threshold of Q>20 to filter the unique mapping reads. That how my results came.

ADD REPLY
0
Entering edit mode

OK, can you provide the references to these papers? Would be interested to read them.

Did you try the other samtools filter? Again would be interested to see how many reads are primary alignments vs. the q > 20 'unique' reads.

ADD REPLY
0
Entering edit mode

I tried the samtools view -F 256 command, and the number of rows of the output file equals the clean reads number. So this command actually only output the multiple mapping reads for once, rather than filter out those multiple mapping reads, doesn't it? And here is the exact number of a sample: clean reads: 37996657 *2
"samtools view -F 256" results: 81968964 the q > 20 'unique' reads : 21473543 there exist a big difference.

ADD REPLY
0
Entering edit mode

The samtools view -F 256 command should output primary alignments. The HiSat2 manual states:

Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAGS field.

You have paired data so it is 81,968,964/2 = 40,984,482 'primary' alignmennts. But you specify you have 37,996,657 paired 'clean' reads? And alignments with MAPQ > 20 (N.B. that this is not 'unique', but still interested to see those references that say this) = 21,473,543 .

At this stage I would recommend using another aligner (STAR) which I have never had this kind of issue with. It seems that HiSat2 is outputting multiple primary alignments per pair in some cases, which is confusing.

ADD REPLY

Login before adding your answer.

Traffic: 2648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6