How to handle RNASeq reads that can map to both human and mouse ref (e.g. conserved gene regions)?
1
0
Entering edit mode
5.4 years ago
DVA ▴ 630

Hello,

I have a few 3' RNA sequencing data sets that have mixed human and mouse samples. As I am trying to split these reads apart, during the alignment (transcript region only) step, I noticed that there are about 3-5% of reads that can be mapped to both references - with VERY similar map-able base pair counts.

One example shows a 66bp (in blat) exact match for both human (CIGAR string: 14S66M) and mouse (CIGAR string: 66M14S) :

(read A in human)

QB5019:183:GJGBGX7:1:111:246:99       16      17      1344539 60     14S66M   *       0       0       TTTTTTTTTTTTTTTTTGGCATTTTTAATTTAGGTTTGTTTTATTTAAGTTTAATGTTAATTCCATGCTGTGTTTCAGTA        EE/E6E/EAEA/EAEAE//EEEEEEAEEEEEE<EEEEEAE/EEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEAAAAA        AS:i:-14        XN:i:0  XM:i:0 XO:i:0   XG:i:0  NM:i:0  MD:Z:66 YT:Z:UU NH:i:1

(read A in mouse)

QB5019:183:GJGBGX7:1:111:246:99       0       11      75765513       60       66M14S  *       0       0       TACTGAAACACAGCATGGAATTAACATTAAACTTAAATAAAACAAACCTAAATTAAAAATGCCAAAAAAAAAAAAAAAAA        AAAAAEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEE/EAEEEEE<EEEEEEAEEEEEE//EAEAE/AEAE/E6E/EE        AS:i:-14        XN:i:0 XM:i:0   XO:i:0  XG:i:0  NM:i:0  MD:Z:66 YT:Z:UU NH:i:1

Another example shows an 80bp (in blat) exact match for human (CIGAR string: 80M), 76bp(in blat) for mouse (CIGAR string still showed 80M using default setting of HISAT2).

(read B in human)

QB501519:1283:GJGBGX7:1:111:2486:1800       0       10      11333759       60       80M     *       0       0       GTTTTATTGATGCATTTGGATTTTGTTGTTTGATGGAATTTGAGCCAAAAAAAAAATACGCAGGCTTTCCTATTTCTACA        AAAAA6EEEAEEEEEEEEEEEEEEEEEEEEEEEEEEA/EEEEEE//EEEEEEEEEA<EEE/EAE<AEEEEE<E<EAEAE/        AS:i:0  XN:i:0  XM:i:0 XO:i:0   XG:i:0  NM:i:0  MD:Z:80 YT:Z:UU NH:i:1

(read B in mouse)

QB501519:1283:GJGBGX7:1:111:2486:1800       16      2       6542373 60     80M      *       0       0       TGTAGAAATAGGAAAGCCTGCGTATTTTTTTTTTGGCTCAAATTCCATCAAACAACAAAATCCAAATGCATCAATAAAAC        /EAEAE<E<EEEEEA<EAE/EEE<AEEEEEEEEE//EEEEEE/AEEEEEEEEEEEEEEEEEEEEEEEEEEAEEE6AAAAA        AS:i:-14        XN:i:0  XM:i:3 XO:i:0   XG:i:0  NM:i:3  MD:Z:21A1T3G52  YT:Z:UU NH:i:1

For cases like read A, I cannot think of an easy way to distinguish them. As for cases like read B, I guess I could try a different aligner and/or change threshold and so the 76bps matches to give a lower MAPQ and/or different CIGAR string... but since it is 80 vs 76, I don't know how sensitive the software can be.

The reason I worry about it is because 1. it counts for 3-5% of the total reads, and 2. Blat showed that they fall into highly conserved gene regions. Ignoring them may as well mean ignoring all conservative genes. Therefore, I would also like to learn if I simply delete these reads from my downstream differential expression and pathway analysis, would it influence the final outcome badly?

Please let me know what you think. Any advice is appreciate it!

alignment hisat2 • 1.0k views
ADD COMMENT
1
Entering edit mode
5.4 years ago
JC 13k

Checking the coordinates for your first pair seems like the human hit is more reliable because it is at the end of a gene, meanwhile the mouse coordinates are in an intergenic region. So, you can filter if the hit clearly comes from a possible expressed region.

ADD COMMENT
0
Entering edit mode

Hi @JC, thanks a lot for the reply! This is an interesting point - for the human case, it falls at the end with only one of the isoforms it seems like. I guess to use this design, that would require a very thorough annotation.

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6