Question

mapping to incomplete reference

0

Entering edit mode

3.6 years ago

issia • 0

Hi everyone!

I have a lot of (short) RNAseq reads from a metagenome sample. I am only interested in the reads originating from one species in the microbial community, so i would like to filter my fastq files accordingly.

I have a reference genome of the species, so my naive approach would be to map all reads to this reference (let's say with bowtie2) and keep all reads that have mapped for further analysis.

However, this approach seems to introduce a bias that is much larger than i expected. I mapped my data twice: First to a bunch of reference genomes and then only to the single reference that i am interested in. In the second case, the amount of reads mapping to my genome of interest is much higher. Can you explain to me why this happens? Am i now also collecting all the reads of all related species?

Can i overcome the problem by adjusting parameters and making the mapping more strict? Or is the whole approach of mapping to only a single reference just not reasonable?

alignment mapping bowtie2 metagenomics • 618 views

ADD COMMENT • link updated 3.6 years ago by GenoMax 141k • written 3.6 years ago by issia • 0

score 1 · Answer 1 · 2020-09-24

1

Entering edit mode

3.6 years ago

GenoMax 141k

I mapped my data twice: First to a bunch of reference genomes and then only to the single reference that i am interested in. In the second case, the amount of reads mapping to my genome of interest is much higher. Can you explain to me why this happens?

Aligners are generally designed to be greedy i.e. they will do their best to try and align a read. When you use a reduced representation of the reference (e.g. your single reference) it is possible that the aligner may shoe-horn a read in even though it may not have originated from that species. And then there is the genuine issue of multi-mapping where reads from similar genes from multiple genomes will map to that one reference. So yes you could be indeed be collecting reads from other species when you use a single reference.

Or is the whole approach of mapping to only a single reference just not reasonable?

Depends. Ultimately what you want to get out of this analysis may decide the answer.

ADD COMMENT • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

Thank you! I would like to determine differential gene expression for this one species. So a high level of "contamination" with other reads is a huge problem. Can you recommend an aligner that might be a better fit for what im trying to do?

ADD REPLY • link 3.6 years ago by issia • 0

1

Entering edit mode

This issue is not fixable by using a different aligner. If you had UMI's on your reads that would be one way to address their origin for sure.

ADD REPLY • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

The samples are from microbial communities, so there is no opportunity to put different tags on the individual species' DNA. Is the right course of action assembly everything in the sample, then mapping the reads, then extracting those which mapped contigs resembling my reference?

ADD REPLY • link 3.6 years ago by issia • 0