Samtools: How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?
2
2
Entering edit mode
6.9 years ago
bioinfo8 ▴ 230

Hi,

Here is the flagstat output of my bam file:

37750740 + 352032 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
1965916 + 0 duplicates
20118207 + 184214 mapped (53.29% : 52.33%)
37750740 + 352032 paired in sequencing
18875370 + 176016 read1
18875370 + 176016 read2
15175066 + 138352 properly paired (40.20% : 39.30%)            
16901306 + 153876 with itself and mate mapped
3216901 + 30338 singletons (8.52% : 8.62%)
1538902 + 13814 with mate mapped to a different chr
561025 + 5046 with mate mapped to a different chr (mapQ>=5)

extracted properly-paired reads with samtools view -bf 0x2 1.bam > 1pp.bam and again cross-checked with flagstat:

15175098 + 138352 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 138352 mapped (100.00% : 100.00%)
15175098 + 138352 paired in sequencing
7587550 + 69176 read1
7587548 + 69176 read2
15175066 + 138352 properly paired (100.00% : 100.00%)           
15175066 + 138352 with itself and mate mapped
0 + 0 singletons (0.00% : 0.00%)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?

Kindly guide.

Thanks

samtools bam ngs sequencing paired-end • 10k views
ADD COMMENT
0
Entering edit mode

Output with: samtools view -b -f 2 -F 524 1.bam > 1pp.bam seems more satisfactory than samtools view -f 2 -F 512 -b -o 1pp.bam 1.bam What both of you think?

How can I integrate removal of duplicate properly-paired reads and hence finally get properly-paired uniquely mapped QC-passed reads?

ADD REPLY
0
Entering edit mode

How do you want to define "uniquely mapped"?

ADD REPLY
0
Entering edit mode

I mean to avoid duplicate properly paired reads (for e.g. in the output) or would it be useful to keep them for further analysis?

ADD REPLY
1
Entering edit mode

Then use -F 780 instead of -F 524.

ADD REPLY
0
Entering edit mode

Output (duplicates are still showing up):

15175066 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175066 + 0 paired in sequencing
7587533 + 0 read1
7587533 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
1
Entering edit mode

What duplicates? You didn't have any to begin with.

ADD REPLY
0
Entering edit mode

There is same output from -F 780 and -F 524.

What about 930902 + 0 duplicates?

ADD REPLY
1
Entering edit mode

Ah, I missed that line. You can exclude them with -F 1804. Make sure to not exclude those unless it really makes sense.

ADD REPLY
0
Entering edit mode

Would you please elaborate more?

ADD REPLY
1
Entering edit mode

780 + 1024 (PCR duplicates) is 1804.

ADD REPLY
0
Entering edit mode

Is it equivalent to rmdup?

So, I can go as follows:

samtools view -b -f 2 -F 1804 1.bam > 1ppqc_nodup.bam                               
samtools sort 1ppqc_nodup.bam > 1ppqc_nodup.sorted.bam                     
samtools index 1ppqc_nodup.sorted.bam

Do I need to do fixmate before sort ? Is there any step which I am missing?

ADD REPLY
1
Entering edit mode

I think rmdup tries to determine what's a duplicate as well (it's not recommended to use rmdup, especially for PE data).

You don't need to run fixmate, both mates (or neither) will get marked as duplicates.

ADD REPLY
6
Entering edit mode
6.9 years ago

This should do it:

samtools view -b -f 2 -F 524 1.bam > 1pp.bam

You appear to have unmapped entries with assigned chromosomes. Presumably you used bwa, since it's the only tool I know of that produces things like that.

ADD COMMENT
0
Entering edit mode

Thanks. Yes, bwa.

Output:

15175066 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175066 + 0 paired in sequencing
7587533 + 0 read1
7587533 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY
1
Entering edit mode

I'm not sure what's going on with the 2 alignments that are marked as properly paired but on different chromosomes. That's obviously incorrect. Can you post what samtools view 1pp.bam | awk '{if($7 != "=") print}'?

ADD REPLY
0
Entering edit mode
HS24_16924:3:2205:8403:6068#38  99     org9_NA954       2524    29      75M     org9_NA955       56      0       AGACAGAAGAGCTGAAGACCCACCGACAGAAGACCACCCAGCAGAGGCACACGCTCCACTGCACACACACACAGA        CCCFFFFFHFHGGIJJIJJJJJIJJGIJJIJJIJJJJJJJIIIGIJIGIJJJHHHFFFEEEEEEDDDDDDDDDD?        X0:i:1  X1:i:0  BC:Z:NCTCACGN   XG:i:0  AM:i:29 SM:i:29 XM:i:1  XO:i:0     QT:Z:!0;@???!   XT:A:U  MD:Z:65G9       NM:i:1  RG:Z:1#38
 HS24_16924:3:2205:8403:6068#38  147    org9_NA955       56      29      27S44M4S        org9_NA954       2524    0 GACATCAGAGATATGCTAACAACACACTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTACCCAG      ,5((((DCB:@>;5((.(;(;6)7)7;HGJIIIIIJJJJIJJJJJJJIIJJJJJIJJJJIJJHGHHHFEDFFCCC        XG:i:0  AM:i:29 SM:i:29 XM:i:0  XO:i:0  XT:A:M  MD:Z:44 NM:i:0  RG:Z:1#38
ADD REPLY
1
Entering edit mode

Weird, that's a bwa bug I guess.

ADD REPLY
0
Entering edit mode

Is there a way to avoid it or I can process bam file for further analysis (coverage calculations etc.)?

ADD REPLY
1
Entering edit mode

Two alignments isn't enough to bother worrying about.

ADD REPLY
2
Entering edit mode
6.9 years ago

How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?

include properly mapped -f2

exclude failing QC -F 512

samtools view  -f 2 -F 512 -b -o 1pp.bam 1.bam
ADD COMMENT
0
Entering edit mode

Thanks. Output:

15175098 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175098 + 0 paired in sequencing
7587550 + 0 read1
7587548 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)
ADD REPLY

Login before adding your answer.

Traffic: 2906 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6