Question

Forum:Why not merging?

8

Entering edit mode

6.8 years ago

finswimmer 16k

Hello,

I have a more general question. I do targeted, paired-end DNA resequencing. Before the alignment I merge overlapping read paires, which improves my variant outcome in that way, that I have much less artefacts or pseudogenvariants. I'm happy with it.

But it seems that most people doesn't merge overlapping read pairs. I couldn't find any good reason for that. In the most posts I have read about the topic, I just found answers like "I wouldn't do that" or "I've never done that before". So why is this step not that widely-used? I can see the great benefits for merging: getting longer reads, error correction, increasing base quality. But I cannot see any disadvantages.

Thanks for your opinion.

fin swimmer

fastq sequencing • 4.4k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 6.8 years ago by finswimmer 16k

score 5 · Answer 1 · 2017-07-14

Error correction and increasing base quality is usually take care of in the variant calling step with paired-end reads (in samtools mpileup, that's the BAQ parameter). If you can guarantee that your PE reads overlap enough that they can be correctly merged then you're not hurting anything by doing so. I'm quite skeptical that having a longer read is really beneficial to you. It's not like aligners ignores the fact that the reads are paired during alignment. You're certainly making its job easier, but with appropriate parameters the difference should be minor.

score 2 · Answer 2 · 2017-07-14

By pre-merging, you may actually be losing information. For example, if the pairs do not agree on a specific base, the merged fragment would most likely report the base of higher quality and possibly report it along with the lower quality. That lower quality might still be high though. When feeding the fragment to the variant caller, the variant caller would consider this base as high quality for variant calculations. Conversely, when feeding the variant caller with the pair, the caller would consider the specific base as ambiguous hence low quality and the base will not be counted. This behavior applies to "samtools-mpileup-bcftools-call" pipeline and may possibly be the same for other pipelines.

score 2 · Answer 3 · 2017-07-14

I will play devil's advocate here.

you wish to do quality score recalibration. Since this is done on a per position basis, a merged base has both a base from cycle say 20 and say 100. Merged reads will loose the information from the original reads.
you cannot produce reliable quality scores for these overlapping portions. leeHom solves this by using the independence principle which may or may not be 100% accurate.
For large DNA fragments, you might incur more incorrect mergers if the overlap is relatively short.

To add to jomo018, the 2 reads are not 2 independent observations of the same chromosome but rather, 2 observation of the same haploid chromosome. A coverage of 2X could be heterozygous and will have different alleles with probability 1/2. This is impossible to observe if you have 2 observation of the same haploid chromosome, bases will always agree in the absence of sequencing errors.

score 2 · Answer 4 · 2017-07-14

In my opinion, read merging is a good idea. Assuming all reads are merged correctly, it greatly increases the quality of the reads, and of mapping. And increases your ability to call long indels, since the length of indels you can call are proportional to read length.

The disadvantages are:

1) Not all read mergers are equal. Using a bad read merger can make your results worse.

2) It's kind of a pain to deal with a merged reads file and an unmerged reads file.

That said, I think the optimal approach to variant-calling is always to merge reads, map the merged and unmerged reads independently, and combine the sams.

@Gabriel, who wishes to play the Devil's advocate:

1) It's pretty easy to recalibrate the quality scores prior to merging. This increases the successful merging rate, as well. Specifically:

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa
calctruequality.sh in=mapped.sam callvars
bbduk.sh in=reads.fq out=recal.fq recalibrate
bbmerge.sh in=recal.fq out=merged.fq outu=unmerged.fq

2) Quality score output does not matter much in the overlapping portion. Raw quality scores are inaccurate anyway. Your best effort at assigning a quality score to a base will generally be better than whatever Illumina assigns it since they don't give best effort. As long as matching bases get a higher quality score than raw, and mismatching bases get a lower quality score than raw, then assuming the overlap is correct, you're improving the quality scores. It's obviously best to generate perfect quality scores, but that's not really possible given the input data.

3) True, but as long as you use a good merging program, not an issue.

The main disadvantage I see is that some variant callers like to give more weight to variants seen in properly paired reads. But, as long as the variant caller is crafted to give similar weight to a 250bp read compared to a 2x150bp pair with 250bp insert size, that's not an issue either.

Edit - incidentally, as to the issues with variant calling using with both merged and unmerged reads (both the fact that some variant callers prioritize variants indicated by proper pairs, and the fact that it's a pain to map both a merged and unmerged file, then combine them), BBMerge offers "ecco" mode (meaning "error correction by overlap"). This merges the reads, error-corrects the overlapping portion by consensus, and outputs the paired reads as pairs, just with the overlapping portion error-corrected. It's quite convenient. I still recommend actually merging reads to produce longer reads for optimal indel-calling, but I use ecco mode when for whatever reason paired reads are most useful.

score 0 · Answer 5 · 2017-07-16

0

Entering edit mode

6.8 years ago

finswimmer 16k

Hello all and thank you for your answers.

But I still see no great disadvantages if I merge overlapping reads :)

Ok, there is a risk that the merge is incorrect. But as Brian stated, it depends on the merger I use and what parameters I choose. So I can lower this risk to a point, where the advantage are greater than the risk that some reads are merged incorrectly.

By pre-merging, you may actually be losing information.

Yes, indeed I loose information like the position of a variant within a read. But are these information realy important? I know that some variant caller check whether a variant just appears at the end of a read. If so it is more likley that it is in artefact. But if I merge my reads before, it is most likely that I remove this artefact before the variant calling is done.

fin swimmer

ADD COMMENT • link 6.8 years ago by finswimmer 16k

4

Entering edit mode

Well, the consensus here seems to be that pre-merging is good. So I'll just add my two-cents as to the claim that pre-merging may inherently lose information, irrespective of aligners and mergers. A merger employs two pieces of information which are available to it: two sequenced reads. When the overlap is not too long, or the base quality is not too perfect, or the sequences lack complexity, the merger needs to choose between alternatives using some scoring algorithm. The outcome may or may not be correct. An aligner of a pair, employs three pieces of information: two sequenced reads AND the genomic landscape. The genomic sequence is actually the most trusted piece of information. For example, assume a low complexity overlap region and high complexity around it, The merger is bound to get it wrong. The pair-aligner will probably do a better job.

ADD REPLY • link 6.8 years ago by jomo018 ▴ 720

1

Entering edit mode

Hello jomo018,

of course you're right, that in repeating region there is a chance of incorrect merging. But e.g. bbmerge didn't merge reads if it finds a repeated region in the overlap. Also choosing a large enough minimum overlap reduces the risk of a false merge, but give me the advantages of merging for all other reads.

fin swimmer

ADD REPLY • link 6.8 years ago by finswimmer 16k