What to do when reads pairs are almost same,like 90-100 % overlap each other in paired reads?
2
0
Entering edit mode
8.9 years ago
crivenster ▴ 50

I have a HLA based NGS data from Myseq. How to deal with the overlap in NGS data when the read one and read two of a read pair (PE) overlap more than 90 % or even they contain the same exact sequence among them? I am working on pre-processing script that goes with the pipeline already present.

pre-processing NGS next-gen hla • 4.3k views
ADD COMMENT
0
Entering edit mode

Why do you want to do something with them ? Can't you treat them as normal PE data ?

ADD REPLY
0
Entering edit mode

But wont they affect the coverage calculation when,some regions will have more reads(due to read pairs being the same) and some region having less number of reads (due to read pairs in that region don't or have very little overlap)?

ADD REPLY
1
Entering edit mode

No, it doesn't matter. The insert size should generally be independent of the genome, but regardless, the coverage is not really affected by the insert size (other than +-1).

The most important thing to do in this case is to adapter-trim reads, as inserts shorter than read length will have adapter sequence that will cause poor mapping.

ADD REPLY
2
Entering edit mode

If read pairs overlap some tools might double count coverage in the overlapping portion, which is incorrect as you are just sequencing the same fragment twice (I'm not sure if this what crivenster meant though).

ADD REPLY
2
Entering edit mode

I have no idea what I was thinking when I said +-1, you're correct, the difference can be a factor of 2. The point I was trying to make was that this will be evenly distributed everywhere so it shouldn't really affect a coverage analysis much. Once you have sufficient coverage at some location, the insert size will not matter very much.

BBMap has a "physcov" flag that will allow calculation of physical coverage, meaning that with large inserts the unsequenced bases in the middle will be counted, and with short inserts the double-covered bases will only be counted once, rather than twice. But I think if this analysis was done with physical coverage enabled versus disabled the conclusion would be the same.

ADD REPLY
1
Entering edit mode
8.9 years ago

Merge them, the joining program will take the nt with better quality in each position: http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Is amplicon sequencing data (PCR products sequencing)? I think with DNA fragmentation is more difficult to have this problem.

ADD COMMENT
0
Entering edit mode

There have been better tools developed for merging reads via overlap since 2012, but merging is not really necessary in this case.

ADD REPLY
1
Entering edit mode
8.9 years ago

After mapping PE reads, I usually soft clip the overlapping part of one of the two reads. There is a nice program for this: clipOverlap, I think it is better to clip after mapping rather than merging reads as Alvaro suggests. Also take care that if read pairs overlap by 100% some aligners might not mark them as "mapped in proper pair", whereas I think they are.

ADD COMMENT
0
Entering edit mode

So its better to perform read merging or soft clipping as u mentioned when reads overlap 100 % before alignment ? this way,i can avoid the alignment of same region twice and may be get better mapping results? The hla data i use,is generated has a fragment lengths between 200-500 bp,as some of the sequencing regions are only 250 bp in length. thus the chances of overlap is certain and having 100% overlap has been common in data generated here.

ADD REPLY
1
Entering edit mode

As I said, I prefer to soft clip after mapping rather then merging. I don't take in consideration how much overlap there is between pairs (100% or just 1 base), I just clip whatever is overlapping.

ADD REPLY

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6