Approximately following the GATK recommended workflow for DNA sequence data, my mpileup output files are empty
1
0
Entering edit mode
5.1 years ago

Hello!

I've googled for hours but haven't found a solution that worked for me. I have some raw human DNA paired-end data in fastq format, then I use these programs in this order:

Trimmomatic (trim the reads) BWA (map the reads to hg19) SAMtools (sorting and indexing) Picard (duplicate marking) GATK (indel realignment) GATK (base recalibration) SAMtools (mpileup) VarScan

Everything looks good until the mpileup step. The output files are empty. I did a flagstat on the bam files that I input to mpileup, and they look mostly like this:

2136672 + 0 in total (QC-passed reads + QC-failed reads)
12002 + 0 secondary
0 + 0 supplementary
1370901 + 0 duplicates
2131283 + 0 mapped (99.75% : N/A)
2124670 + 0 paired in sequencing
1062335 + 0 read1
1062335 + 0 read2
2032962 + 0 properly paired (95.68% : N/A)
2115336 + 0 with itself and mate mapped
3945 + 0 singletons (0.19% : N/A)
5880 + 0 with mate mapped to a different chr
3537 + 0 with mate mapped to a different chr (mapQ>=5)

This looks good right? What am I missing? I tried adding the -A flag to mpileup because that should include improperly mated read pairs (but judging by the flagstat output, I should not have to do this), it changes nothing.

Commands in order:

java -jar ${EBROOTTRIMMOMATIC}/trimmomatic-0.38.jar PE "$f1" "$f2" "TRIMMED/${sample}.fp.gz" "TRIMMED/${sample}.fu.gz" "TRIMMED/${sample}.rp.gz" "TRIMMED/${sample}.ru.gz" "$illuclip"

bwa mem -t 8 -M -R "@RG\tID:${sample}\tSM:${sample}\tPL:ILLUMINA" "$refGenome" "$read1" "$read2" | samtools view -Sb - > "MAPPED/${sample}.bam"

samtools sort "MAPPED/${sample}.bam" > "SORTED/${sample}.bam"

samtools index "SORTED/${sample}.bam" "SORTED/${sample}.bai"

java -jar "${EBROOTPICARD}/picard.jar" MarkDuplicates I="SORTED/${sample}.bam" O="DUPLICATES/MARKED/${sample}.bam" M="QC/PICARD/${sample}.txt"

samtools index "DUPLICATES/MARKED/${sample}.bam" "DUPLICATES/MARKED/${sample}.bai"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T RealignerTargetCreator -R "$refGenome" "${knownI[@]}" -I "$i" --filter_reads_with_N_cigar -o "$t"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T IndelRealigner -R "$refGenome" "${knownI[@]}" -I "$i" -targetIntervals "$t" -o "INDELSREALIGNED/${sample}.bam"

samtools index "INDELSREALIGNED/${sample}.bam" "INDELSREALIGNED/${sample}.bai"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -o "$o"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T BaseRecalibrator -R "$refGenome" "${knownS[@]}" -I "$i" -BQSR "$o" -o "$p"

 java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T AnalyzeCovariates -R "$refGenome" -before "$o" -after "$p" -plots "BASERECALIBRATED/${sample}.pdf"

java -jar "$EBROOTGATK/GenomeAnalysisTK.jar" -T PrintReads -R "$refGenome" -I "$i" -BQSR "$o" -o "BASERECALIBRATED/${sample}.bam"    

samtools mpileup -A -B -d 1000 -f "$refGenome" -o "PILEDUP/${sample}.mup" "BASERECALIBRATED/${sample}.bam"

java -Xmx3G -jar "${EBROOTVARSCAN}/VarScan.v2.4.1.jar" mpileup2snp "PILEDUP/${sample}.mup" --min-var-freq 0.001 --strand-filter 0 --p-value 0.05 > "VARCALLS/$sample.snp"

I'm newly graduated so don't have much experience. I just like to program, heh.

Big thanks in advance!!

sequencing SNP genome next-gen • 1.9k views
ADD COMMENT
0
Entering edit mode

Hi Joel Wallenius, show the relevant command please ( Brief Reminder On How To Ask A Good Question )

ADD REPLY
0
Entering edit mode

Changed the title and added command!

ADD REPLY
0
Entering edit mode

GATK (indel realignment) GATK (base recalibration) SAMtools (mpileup) VarScan

That doesn't look like the (current?) GATK best practices to me.

ADD REPLY
0
Entering edit mode

Changed the title and added command!

ADD REPLY
0
Entering edit mode

It will be good if you are able to paste all your commands as we can see the parameters used. what are the commands for Varscan ?

ADD REPLY
1
Entering edit mode

Also, generally, it is not a great idea to mix and match up these programs that were developed by independent groups. If you want to use GATK, then follow their best practices guide and use HaplotypeCaller to call variants. As far as I am aware, the indel realignment step is neither now necessary as a separate step in the GATK 4 release - you should check.

As an example of where this may go wrong, GATK does base recalibration separately, but SAMtools can also do it during the mpileup command. One should not be doing it twice (?).

ADD REPLY
0
Entering edit mode

The super computer I'm working with does not support the latest GATK so I have to use 3.6. The best practices guide includes the indel realignment and base recalibrations as far as I can see on their web page. Unless SAMtools mpileup performs the recalibration automatically without being asked to, I am surely not doing it twice. Advice noted though, and thank you!

ADD REPLY
0
Entering edit mode

conda is your friend. https://anaconda.org/bioconda/gatk4

Or what do you mean by "does not support"?

ADD REPLY
0
Entering edit mode

They use a module system, you load modules more or less like you'd activate virtual envs. The latest GATK module is 3.7 or something. I could maybe ask the admins to install the latest one but that would take a while, probably. How much do I stand to gain in practice from an update? The problem I'm having is with mpileup, after all (though I guess maybe mpileup misbehaves because of the output from GATK's baserecalibration?).

ADD REPLY
0
Entering edit mode

Yes, modules are common on HPCs but you have a $HOME folder, no? There you can install whatever you like locally and link it to your $PATH. conda is a package manager to do exactly that.

ADD REPLY
0
Entering edit mode

Yes true but this script is supposed to be used by the rest of my research group and they aren't very savvy, if you know what I mean. I'd like to make it as easy as possible for them, and installing a bunch of packages as a first step is... well, it's not a priority right now at least. Perhaps in a few weeks if I don't receive new tasks...

ADD REPLY
0
Entering edit mode

To avoid to install tools locally in your home (and also to avoid that your colleagues install themselves tools in their home) you may try to build a singularity image with all the tools you want. Then share the image with your colleagues. I'm sure your HPC has a module to use singularity images.

ADD REPLY
0
Entering edit mode

The commands for VarScan do not matter as it's a later step. I will paste the commands tomorrow when I'm at work again!

ADD REPLY
0
Entering edit mode

Added all the commands in order!

ADD REPLY
0
Entering edit mode

How was the library done? Is this amplicon based? I'm asking because you have a very high duplication rate and in case of amplicon based data, you cannot mark duplicates.

ADD REPLY
0
Entering edit mode

I'm not sure actually, I only know that the input at the start of my script is in .fastq.gz format, that the reads are paired, and produced by an Illumina machine. FWIW, all the reads and all the samples have the same adapter sequences. Does that supply enough information?

ADD REPLY
0
Entering edit mode

So is it WGS or WES or targeted ?

ADD REPLY
0
Entering edit mode

Your question gave me an idea. I am using a file with targets in a BED-format, and the chromosome names in that file were like 1,2,3... rather than chr1, chr2, and so on... that's why the output was empty. :[ The shame is real. At least the problem is solved now...

ADD REPLY
0
Entering edit mode
5.1 years ago

SOLVED! ( Sorry I couldn't write this earlier, my post limit had been reached )

The problem was that my targets file (BED-format) used chromosome names 1,2,3,4,5,6..., while my alignment files used names like chr1,ch2,chr3...

Big thanks to everyone for you patience. I'll remember your advice!

ADD COMMENT

Login before adding your answer.

Traffic: 2676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6