Will Picard MarkDuplicates also un-mark duplicates?
1
1
Entering edit mode
9.8 years ago

If I take a bam that's already aligned and has dups marked, remove a bunch of reads, then re-run Picard's mark-duplicates, will it correctly change the flags of reads that are no longer duplicates (but may have been before ditching the reads)?

This question is proving surprisingly hard to google for.

picard duplicates markduplicates ngs bam • 4.0k views
ADD COMMENT
0
Entering edit mode

@Pierre Lindenbaum

Actually I have a question rather than an answer here. I am very hesitating if I should use REMOVE_DUPLICATES=true or REMOVE_DUPLICATES=false. What difference it will make by only marking the flag to duplicates and removing duplicates? Thank you,

Lisa

ADD REPLY
0
Entering edit mode

Depends on what downstream tools you're using. In general, my preference is not to remove them, as changes in software or approach may make that data valuable at some later date. My general inclination is always to avoid throwing away any data, though.

ADD REPLY
6
Entering edit mode
9.8 years ago

let's test it. I use the following BAM ( 3 pairs of reads, same sequence/cigar)

@HD	VN:1.0 SO:coordinate	SO:coordinate
@SQ	SN:seq1	LN:5000
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:577	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:578	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:579	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:577	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0
EAS54_61:4:143:69:578	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0
EAS54_61:4:143:69:579	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0

I run markduplicate:

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:577	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:578	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:577	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35
EAS54_61:4:143:69:578	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35
EAS54_61:4:143:69:579	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35

I remove some reads. I keep one pair of reads marked as dup :

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:579	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35

I re-run picard, the flag is removed.

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@PG	ID:MarkDuplicates.1	PN:MarkDuplicates	PP:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[jeter2.bam] OUTPUT=marked2.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:579	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates.1	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates.1	NM:i:1	UQ:i:5	Aq:i:35
ADD COMMENT
2
Entering edit mode

Well played, sir. In retrospect, doing the experiment myself would have been faster than that 10 minutes of searching and typing this post. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2096 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6