Mate-Fixing tools outputting identical bams for different samples?
1
1
Entering edit mode
9.7 years ago

Hello! I'm trying to analyze some RNA-Seq data for which I'm eventually hoping to run do a differential expression analysis.

Thus far, I've done everything til alignment through tophat, and some primary filtering, but as I was trying to run MarkDuplicates from Picard, I got this error--

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 56030473, Read name HISEQ:157:H9UNAADXX:1:1108:17721:54212, Mate Alignment start should be 0 because reference name = *.
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
    at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413)
    at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
    at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)

To fix that, I tried both SamTools' fixmate command, as well as Picard's FixMateInformation on my samples and then ran MarkDuplicates which didn't error then, but it's outputting identical bam files for almost all of my samples (some are replicates of a sample, but some are also knockouts of a control, so this definitely shouldn't be happening..?)

Does anyone have any idea why this might be happening, and/or what I can try to fix this? The files post-alignment look absolutely fine in IGV Browser, and are not identical prior to mate-fixing and mark duplicates.

samtools picardtools bam Mate-fixing RNA-Seq • 3.7k views
ADD COMMENT
1
Entering edit mode
9.7 years ago

Firstly, you don't need or (typically) want to mark duplicates with RNAseq data if you plan to look at differential expression. Secondly, you'll find that picard is often a bit excessive when it comes to standards conformance (e.g., the error you're getting is due to a correct SAM file). So, try using VALIDATION_STRINGENCY=LENIENT.

ADD COMMENT
0
Entering edit mode

Oh ok, it did run successfully after that. Thanks!

Also, would you mind explaining why MarkDuplicates wouldn't be necessary/desired for DESeq? I'm still new to learning it, so sorry if that's a basic question...

ADD REPLY
0
Entering edit mode

Any highly expressed gene will appear to have false-positive PCR duplicates that would end up being marked and excluded, artificially deflating counts and compromising the results. The only common situations wherein marking duplicates is useful is SNP/variant calling and non-targeted bisulfite sequencing.

ADD REPLY
0
Entering edit mode

+1 for recommending LENIENT over SILENT :)

ADD REPLY

Login before adding your answer.

Traffic: 1411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6