Biostar Beta. Not for public use.
Genomic Read Mapping Biased Towards Coding Regions?
3
Entering edit mode
13 months ago
Vitis ♦ 2.1k
New York

I'm trying to map genomic sequencing reads (Illumina HiSeq PE100) to a related reference genome. The coding region divergence is about 1% between the organism and the reference, so I allowed 5~8 mismatches in 100bp reads as well as allowing small indels, hoping this could accommodate the higher divergence expected outside the exons. But in the coverage plot, coding regions still got the most coverage. This bias is so severe that it looks like an mRNA-Seq experiment. Of course, there are regions with relatively uniform coverage outside the exons (so they should be true genomic reads), but they're much rarer than the coverage 'deserts' elsewhere. The overall coverage, based on kmers, is about 5X, which can be a reason why this is happening. Also, is there anything wrong I did in terms of the way I approach the mapping process?

ADD COMMENTlink
0
Entering edit mode

Maybe you could just say which organisms you are comparing and how distant they are.

ADD REPLYlink
0
Entering edit mode

vitis, is this whole genome shotgun data or some reduced representation library that you have sequenced?

ADD REPLYlink
0
Entering edit mode

These are whole genome shotgun sequences, so shouldn't be biased in terms of genome compositions.

ADD REPLYlink
5
Entering edit mode
3.2 years ago
Cambridge, UK

The problem is that you are mapping to a "related reference genome". Clearly, coding regions are much more conserved than intragenic or introns, so reads from exons map a lot better. I suspect only a relatively small fraction of your reads maps.

you will need some sort of de-novo or use your related reference genome as a scafolding (but it will not be a task of a day...)

ADD COMMENTlink
0
Entering edit mode

Indeed, this sounds reasonable. Imo the mapping 'bias' is due to conservation. If you had both genome sequences and made a conservation plot, then I would bet that the mapping correlates with the conservation. In a sense this result is not really surprising.

ADD REPLYlink
0
Entering edit mode

the idea was to capture sequences outside coding regions, because we have coding sequences from mRNA-Seq. de novo didn't work well because the overall coverage was relatively low. It makes sense that mapping correlates with conservation but the point was by allowing more mismatches maybe the correlation can be relaxed, which didn't happen.

ADD REPLYlink
0
Entering edit mode

5-8% miss matches is very low. Could work for Human vs Chimpazee, but as soon as you go further it does not hold. I suspect that for some species, especially plants, 5-8% could be within the same species.

ADD REPLYlink
0
Entering edit mode

These two are within the same genus, but definitely further away than Human/Chimp. Looks like I underestimated the divergence in the non-coding regions.

ADD REPLYlink
0
Entering edit mode

I just got some Sanger sequencing results from (the ancient technology of) genome walking, which are very interesting: highly heterogeneous in terms of genomic divergence, as low as no difference to 12% divergence. Have to think of a good way to accommodate this in mapping.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1