Issues with Marking Duplicates in Picard
1
0
Entering edit mode
5.4 years ago

Hi everyone!

So I have been tasked with analyzing some sequence data even though I have no clue what I'm doing. I was given some data from 11 samples (S1, S2, etc.), each with a singles file as well as a forward read file (R1) and a reverse read file (R2). In addition, each sample was run in two different lanes (L001 and L002), so for every file there is a corresponding file from the other lane. I was given this data after it had already had some quality control done using Scythe and Sickle. They were in fastq format.

So my first step was to map these files to the reference. I did this using BWA mem. I aligned both the R1 and R2 files of a given sample and lane to the reference, then did it for the R1 and R2 of the same sample different lane, then did it for the singles files for each lane. Therefore, for every sample, I got 4 sam files that were mapped to the reference (ex - S1 L001 R1&R2, S1 L002 R1&R2, S1 L001 single, S1 L002 single) for all 11 samples.

Next I used samtools to convert the sam files to bam, as well as to constrict the bam file to only what was mapped to the reference genome.

Now here is where the trouble begins - I next used samtools to merge all of a sample's files together, for instance anything from S1 including both lanes for both the singles and the merged R1&R2 files. I used

samtools merge S1_merged.bam singleS1_L001.bam singleS1_L002.bam S1_L001.bam S1_L002.bam.

Then I tried to use MarkDuplicatesWithMateCigar in Picard to mark the duplicates in the single merged file (S1_merged.bam). But when I did it gave me the error "this program requires inputs in coordinate SortOrder." It seems as though my headings weren't sorted correctly.

So I tried to sort the merged bam file using samtools sort. I did

samtools sort S1_merged.bam -o S1_sorted.bam

which gave me a ton of files. I tried redoing it using the "-m 20G" command and it gave me 6 files instead.

So then I merged these six sorted files into "S1_sorted.bam" using samtools merge and tried doing MarkDuplicatesWithMakeCigar again. I did

java -jar $PICARD MarkDuplicatesWithMateCigar I=S1_sorted.bam O=S1_marked.bam M=S1_marked_metrics.txt

And it told me "Exception in thread "main" picard.PicardException: Found a samRecordWithOrdinal with sufficiently large clipping that we may have missed including it in an early duplicate marking iteration. Please increase the minimum distance to at least 120bp." So I tried to do it again but with the command "MINIMUM_DISTANCE=120" command added and it didn't even give me an error, it just spit me back out a list of a bunch of commands. I tried using MarkDuplicates instead of MarkDuplicatesWithMateCigar and it did the same thing.

I'm really at a loss here guys. Should I have sorted before I merged all the lanes and singles? Should I have merged my sorted files after sorting? Am I missing something?

Any help would be greatly appreciated.

Picard GATK sorting BWA samtools • 2.8k views
ADD COMMENT
0
Entering edit mode

it just spit me back out a list of a bunch of commands

which ones ?

ADD REPLY
0
Entering edit mode

![I linked a screenshot of my terminal showing what it gives me][1]

https://imgur.com/pUmF8lh

ADD REPLY
0
Entering edit mode

there is a problem with your command line. A parameter is missing or wrong.

ADD REPLY
2
Entering edit mode
5.4 years ago
goodez ▴ 640

First, I don't totally understand what the "singles" file is. I would just stick with the forward and reverse fastq files (R1 and R2).

Now for combining samples from multiple lanes... I usually merge the fastq files before aligning. I first check their quality using FastQC. It is also okay to combine after alignment as you have.

So I tried to sort the merged bam file using samtools sort. . . which gave me a ton of files. I tried redoing it using the "-m 20G" command and it gave me 6 files instead.

This is the most troubling part. Samtools sort should have output one sorted bam, not multiple files. These may have been intermediate files, did you let the program finish running completely? Also, the manual states to run samtools sort this way:

samtools sort -o out.bam in.bam

You did this in the wrong order (I don't know if that actually affects how it runs).

Perhaps that will fix your issues.

ADD COMMENT
0
Entering edit mode

Oh dang, okay. Let me trying doing the sort the way you listed. Hopefully that helps. Thanks!

ADD REPLY
0
Entering edit mode

I tried doing

samtools sort -o S1_sorted.bam S1_merged.bam

and it told me "fail to open file S1_sorted.bam"

ADD REPLY
0
Entering edit mode

Weird. Maybe because that output file already exists, and doesn't want to overwrite it?

Try this as well. You shouldn't have to specify bam format, but I don't know what version of software you're using.

samtools sort -O bam -o S1_sorted.bam S1_merged.bam

Run exactly that. It shouldn't give errors like that.

ADD REPLY
0
Entering edit mode

So while I was waiting for your response, I tried doing

samtools sort -@ 4 -m 30G S1_merged.bam S1_sorted.bam

And it worked and only gave me one file!

But then I tried doing the MarkDuplicates and it still did the same thing as before. I must be doing something wrong at this step.

ADD REPLY
1
Entering edit mode

Sorry to hear that. I personally have had issues every time I've tried to use any Picard tool... Do you require duplicate removal in your analysis? Removing duplicates is often unnecessary and can even falsely remove unique reads.

ADD REPLY
0
Entering edit mode

Which version of samtools are you using? Getting this error message for the command given, makes me think it will be 0.1.19. Because there the -o parameter and the positional arguments has another meaning then nowadays.

If you really use this very, very old version please upgrade first before continue.

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6