Hello! I'm trying to analyze some RNA-Seq data for which I'm eventually hoping to run do a differential expression analysis.
Thus far, I've done everything til alignment through tophat, and some primary filtering, but as I was trying to run MarkDuplicates from Picard, I got this error--
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 56030473, Read name HISEQ:157:H9UNAADXX:1:1108:17721:54212, Mate Alignment start should be 0 because reference name = *.
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:452)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:643)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:628)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:598)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:514)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:488)
at picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:413)
at picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:177)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:183)
at picard.sam.MarkDuplicates.main(MarkDuplicates.java:161)
To fix that, I tried both SamTools' fixmate command, as well as Picard's FixMateInformation on my samples and then ran MarkDuplicates which didn't error then, but it's outputting identical bam files for almost all of my samples (some are replicates of a sample, but some are also knockouts of a control, so this definitely shouldn't be happening..?)
Does anyone have any idea why this might be happening, and/or what I can try to fix this? The files post-alignment look absolutely fine in IGV Browser, and are not identical prior to mate-fixing and mark duplicates.
Oh ok, it did run successfully after that. Thanks!
Also, would you mind explaining why MarkDuplicates wouldn't be necessary/desired for DESeq? I'm still new to learning it, so sorry if that's a basic question...
Any highly expressed gene will appear to have false-positive PCR duplicates that would end up being marked and excluded, artificially deflating counts and compromising the results. The only common situations wherein marking duplicates is useful is SNP/variant calling and non-targeted bisulfite sequencing.
+1 for recommending LENIENT over SILENT :)