Question

Mate-Fixing tools outputting identical bams for different samples?

1

Entering edit mode

9.7 years ago

rishi.z.sinha ▴ 10

Hello! I'm trying to analyze some RNA-Seq data for which I'm eventually hoping to run do a differential expression analysis.

Thus far, I've done everything til alignment through tophat, and some primary filtering, but as I was trying to run MarkDuplicates from Picard, I got this error--

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 56030473, Read name HISEQ:157:H9UNAADXX:1:1108:17721:54212, Mate Alignment start should be 0 because reference name = *.
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
    at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413)
    at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
    at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

To fix that, I tried both SamTools' fixmate command, as well as Picard's FixMateInformation on my samples and then ran MarkDuplicates which didn't error then, but it's outputting identical bam files for almost all of my samples (some are replicates of a sample, but some are also knockouts of a control, so this definitely shouldn't be happening..?)

Does anyone have any idea why this might be happening, and/or what I can try to fix this? The files post-alignment look absolutely fine in IGV Browser, and are not identical prior to mate-fixing and mark duplicates.

samtools picardtools bam Mate-fixing RNA-Seq • 3.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by rishi.z.sinha ▴ 10

Ram · Answer 1 · 2014-08-01

1

Entering edit mode

9.7 years ago

Devon Ryan 104k

Firstly, you don't need or (typically) want to mark duplicates with RNAseq data if you plan to look at differential expression. Secondly, you'll find that picard is often a bit excessive when it comes to standards conformance (e.g., the error you're getting is due to a correct SAM file). So, try using VALIDATION_STRINGENCY=LENIENT.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Devon Ryan 104k

0

Entering edit mode

Oh ok, it did run successfully after that. Thanks!

Also, would you mind explaining why MarkDuplicates wouldn't be necessary/desired for DESeq? I'm still new to learning it, so sorry if that's a basic question...

ADD REPLY • link 9.7 years ago by rishi.z.sinha ▴ 10

0

Entering edit mode

Any highly expressed gene will appear to have false-positive PCR duplicates that would end up being marked and excluded, artificially deflating counts and compromising the results. The only common situations wherein marking duplicates is useful is SNP/variant calling and non-targeted bisulfite sequencing.