Hi,
Here is the flagstat
output of my bam file:
37750740 + 352032 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
1965916 + 0 duplicates
20118207 + 184214 mapped (53.29% : 52.33%)
37750740 + 352032 paired in sequencing
18875370 + 176016 read1
18875370 + 176016 read2
15175066 + 138352 properly paired (40.20% : 39.30%)
16901306 + 153876 with itself and mate mapped
3216901 + 30338 singletons (8.52% : 8.62%)
1538902 + 13814 with mate mapped to a different chr
561025 + 5046 with mate mapped to a different chr (mapQ>=5)
extracted properly-paired reads with samtools view -bf 0x2 1.bam > 1pp.bam
and again cross-checked with flagstat
:
15175098 + 138352 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
930902 + 0 duplicates
15175066 + 138352 mapped (100.00% : 100.00%)
15175098 + 138352 paired in sequencing
7587550 + 69176 read1
7587548 + 69176 read2
15175066 + 138352 properly paired (100.00% : 100.00%)
15175066 + 138352 with itself and mate mapped
0 + 0 singletons (0.00% : 0.00%)
2 + 0 with mate mapped to a different chr
2 + 0 with mate mapped to a different chr (mapQ>=5)
How can I extract properly-paired QC-passed reads instead of extracting only properly-paired?
Kindly guide.
Thanks
Output with:
samtools view -b -f 2 -F 524 1.bam > 1pp.bam
seems more satisfactory thansamtools view -f 2 -F 512 -b -o 1pp.bam 1.bam
What both of you think?How can I integrate removal of duplicate properly-paired reads and hence finally get properly-paired uniquely mapped QC-passed reads?
How do you want to define "uniquely mapped"?
I mean to avoid duplicate properly paired reads (for e.g. in the output) or would it be useful to keep them for further analysis?
Then use
-F 780
instead of-F 524
.Output (duplicates are still showing up):
What duplicates? You didn't have any to begin with.
There is same output from
-F 780
and-F 524
.What about
930902 + 0
duplicates?Ah, I missed that line. You can exclude them with
-F 1804
. Make sure to not exclude those unless it really makes sense.Would you please elaborate more?
780 + 1024 (PCR duplicates) is 1804.
Is it equivalent to
rmdup
?So, I can go as follows:
Do I need to do
fixmate
beforesort
? Is there any step which I am missing?I think
rmdup
tries to determine what's a duplicate as well (it's not recommended to usermdup
, especially for PE data).You don't need to run
fixmate
, both mates (or neither) will get marked as duplicates.