Question

Samtools: How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?

2

Entering edit mode

6.9 years ago

bioinfo8 ▴ 230

Hi,

Here is the flagstat output of my bam file:

37750740 + 352032 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
1965916 + 0 duplicates
20118207 + 184214 mapped (53.29% : 52.33%)
37750740 + 352032 paired in sequencing
18875370 + 176016 read1
18875370 + 176016 read2
15175066 + 138352 properly paired (40.20% : 39.30%)            
16901306 + 153876 with itself and mate mapped
3216901 + 30338 singletons (8.52% : 8.62%)
1538902 + 13814 with mate mapped to a different chr
561025 + 5046 with mate mapped to a different chr (mapQ>=5)

extracted properly-paired reads with samtools view -bf 0x2 1.bam > 1pp.bam and again cross-checked with flagstat:

15175098 + 138352 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 138352 mapped (100.00% : 100.00%)
15175098 + 138352 paired in sequencing
7587550 + 69176 read1
7587548 + 69176 read2
15175066 + 138352 properly paired (100.00% : 100.00%)           
15175066 + 138352 with itself and mate mapped
0 + 0 singletons (0.00% : 0.00%)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?

Kindly guide.

Thanks

samtools bam ngs sequencing paired-end • 10k views

ADD COMMENT • link 6.9 years ago by bioinfo8 ▴ 230

0

Entering edit mode

Output with: samtools view -b -f 2 -F 524 1.bam > 1pp.bam seems more satisfactory than samtools view -f 2 -F 512 -b -o 1pp.bam 1.bam What both of you think?

How can I integrate removal of duplicate properly-paired reads and hence finally get properly-paired uniquely mapped QC-passed reads?

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

0

Entering edit mode

How do you want to define "uniquely mapped"?

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

I mean to avoid duplicate properly paired reads (for e.g. in the output) or would it be useful to keep them for further analysis?

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

Then use -F 780 instead of -F 524.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Output (duplicates are still showing up):

15175066 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175066 + 0 paired in sequencing
7587533 + 0 read1
7587533 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

What duplicates? You didn't have any to begin with.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

There is same output from -F 780 and -F 524.

What about 930902 + 0 duplicates?

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

Ah, I missed that line. You can exclude them with -F 1804. Make sure to not exclude those unless it really makes sense.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Would you please elaborate more?

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

780 + 1024 (PCR duplicates) is 1804.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Is it equivalent to rmdup?

So, I can go as follows:

samtools view -b -f 2 -F 1804 1.bam > 1ppqc_nodup.bam                               
samtools sort 1ppqc_nodup.bam > 1ppqc_nodup.sorted.bam                     
samtools index 1ppqc_nodup.sorted.bam

Do I need to do fixmate before sort ? Is there any step which I am missing?

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

I think rmdup tries to determine what's a duplicate as well (it's not recommended to use rmdup, especially for PE data).

You don't need to run fixmate, both mates (or neither) will get marked as duplicates.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

2

Entering edit mode

6.9 years ago

Pierre Lindenbaum 161k

How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?

include properly mapped -f2

exclude failing QC -F 512

samtools view  -f 2 -F 512 -b -o 1pp.bam 1.bam

ADD COMMENT • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks. Output:

15175098 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175098 + 0 paired in sequencing
7587550 + 0 read1
7587548 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

score 6 · Accepted Answer · 2017-06-09

6

Entering edit mode

6.9 years ago

Devon Ryan 104k

This should do it:

samtools view -b -f 2 -F 524 1.bam > 1pp.bam

You appear to have unmapped entries with assigned chromosomes. Presumably you used bwa, since it's the only tool I know of that produces things like that.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks. Yes, bwa.

Output:

15175066 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 0 mapped (100.00% : N/A)
15175066 + 0 paired in sequencing
7587533 + 0 read1
7587533 + 0 read2
15175066 + 0 properly paired (100.00% : N/A)
15175066 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

I'm not sure what's going on with the 2 alignments that are marked as properly paired but on different chromosomes. That's obviously incorrect. Can you post what samtools view 1pp.bam | awk '{if($7 != "=") print}'?

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

HS24_16924:3:2205:8403:6068#38  99     org9_NA954       2524    29      75M     org9_NA955       56      0       AGACAGAAGAGCTGAAGACCCACCGACAGAAGACCACCCAGCAGAGGCACACGCTCCACTGCACACACACACAGA        CCCFFFFFHFHGGIJJIJJJJJIJJGIJJIJJIJJJJJJJIIIGIJIGIJJJHHHFFFEEEEEEDDDDDDDDDD?        X0:i:1  X1:i:0  BC:Z:NCTCACGN   XG:i:0  AM:i:29 SM:i:29 XM:i:1  XO:i:0     QT:Z:!0;@???!   XT:A:U  MD:Z:65G9       NM:i:1  RG:Z:1#38
 HS24_16924:3:2205:8403:6068#38  147    org9_NA955       56      29      27S44M4S        org9_NA954       2524    0 GACATCAGAGATATGCTAACAACACACTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTACCCAG      ,5((((DCB:@>;5((.(;(;6)7)7;HGJIIIIIJJJJIJJJJJJJIIJJJJJIJJJJIJJHGHHHFEDFFCCC        XG:i:0  AM:i:29 SM:i:29 XM:i:0  XO:i:0  XT:A:M  MD:Z:44 NM:i:0  RG:Z:1#38

ADD REPLY • link 6.9 years ago by bioinfo8 ▴ 230

1

Entering edit mode

Weird, that's a bwa bug I guess.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Is there a way to avoid it or I can process bam file for further analysis (coverage calculations etc.)?