Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

Problem using PicardTools MarkDuplicates: Value Was Put Into Pairinfomap More Than Once
1
1
Entering edit mode
5.2 years ago
anp375 ▴ 180

High, I'm having a problem using PicardTools MarkDuplicates. I had two fastq files for forward and reverse reads. They were both from the same lane, etc. I aligned them with bwa mem using the -M option and got the bam file. I used PicardTools FixMateInformation, SortSam, and samtools index. When I use MarkDuplicates, I get this error:

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 4: Sample_AD_096:HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:407) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134) [E::hts_open] fail to open file 'Sample_AD_096_b37_aln-pe.bam'

When I grepped the read name, I got 4 lines back: Primary forward read, primary reverse read, separated by a SAMFlag value of 80 or 16. Secondary forward read, secondary reverse read separated by a SAMFlag value of 80 or 16.

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 387 5 154316645 6 0 39M62H 2 233666062 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACC BBBFFFFFFFFFFIFFFBFFFFFIFBFFBFFFFFIFFFF SA:Z:2,233666062,+,35S66M,60,0;M D:Z:39 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:39 XS:i:0

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 371 5 154316648 6 0 36M65H 2 233666062 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACC 77FFBFFFFFFFFFFFFFFFFBBBBBFFFFFFB7BB SA:Z:2,233666062,-,32S69M,60,0;M D:Z:36 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:36 XS:i:0

I tried isolating primary alignments only, with samtools view -b -F 0x4 -F 0x100 -F 0x800 both.bam > primary.bam, but got the same error even though a grep showed only the primary forward and reverse reads separated by a value of 80 or 16. I did remember to index this. I don't know how to solve this. I have two fastq files so I can't use MergeBamAlignment. Should I combine the fastq files into one? My grep output from the primaries only:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

PicardTools Picard MarkDuplicates Pairinfomap • 3.0k views
ADD COMMENT
2
Entering edit mode
5.2 years ago

It looks like you have actual duplicate reads based on the read name and not the flags. e.g.:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ...  

My guess is that you had some newish data from a HiSeq where the read pair identifiers come after the read names, separated by a space:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 1:x:xx:xxxxx 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 2:x:xx:xxxxx 115 2 233666062 6 0 32S69M 5 154316648 0 ...  

One of your pre-processing steps probably stripped everything in the read name after the space, and now Picard thinks these are duplicates.

I forgot what the exact format for the 1:x:xx:xxxxx portion is, but you can find the definition from illumina. It will be something like 1:Y:0:ATCGAT which is read number:passing filter:control flag:barcode.

ADD COMMENT
1
Entering edit mode

Thank you. I think something weird happened with the assignments. The secondary reads seem to be assigned as mates to the primary reads.

ADD REPLY
0
Entering edit mode

Glad you figured it out.

ADD REPLY

Login before adding your answer.

Traffic: 329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6