Biostar Beta. Not for public use.
Concat VCFs with some shared samples
1
Entering edit mode
15 months ago
USA SoCal

Say I have 24 VCFs (chrom 1-Y) and I want to concat them all into one VCF. But ... for 16 VCFs I have 100 samples and for the remaining I have 95. The 95 samples are present in the 100 sample VCF.

So is there an easy way to concat (not merge) these VCFs into one?

0
Entering edit mode

So is there an easy way to concat (not merge) these VCFs into one?

why don't you want to merge ? 'merge' (gatk/bcftools) is the only way to merge some VCF with different samples.

1
Entering edit mode

Because the merge tools do not like the duplicate sample names. And if you --force-samples then you get a second column of the sample name NA12878 becomes 2:NA12878 and that's more of a headache .

I'm looking for a quick fix out of laziness. This can be done with a quick script, but I have split the genome into 10Mb intervals for over 6000 whole genomes. So yeah, if someone else wrote a tool I would prefer that.

1
Entering edit mode

GATK has a tool, CombineVariants, that does it. It has an option genotypemergeoption for when sample names are repeated.

1
Entering edit mode
21 months ago
Buenos Aires, Argentina

I was struggling with a similar issue and CombineVariants was the answer, as @RamRS suggested. (Thanks!)

I had chr8.vcf.gz and chrY.vcf.gz files with some shared samples. More specifically, male samples were shared by both files, but female samples were absent from chrY.vcf.gz. After indexing both files with tabix <filename>, this is the command merged both sets perfectly:

gatk3 -T CombineVariants \
-R /path/to/reference.fasta \
--variant:A chr8.vcf.gz \
--variant:B chrY.vcf.gz \
--genotypemergeoption PRIORITIZE \
-priority A,B \
-o out.vcf


Of course, gatk3 should be java plus the path to gatk. The PRIORITIZE parameter is meant to prioritize redundant genotypes from one file over the other, with -priority A,B refering to the labels assigned in the --variant parameters. However, if you have non-overlapping regions (e.g. different chromosomes like my case), the chosen priority does not affect the output. The result will have missing genotypes for those samples that are present in one file but not the other, as expected (in my case, female samples have missing genotypes for chromosome Y variants).

0
Entering edit mode

Did you type <pre><code> for the code block? This post is weird because it combines old markup formatting with the newer markdown formatting!

0
Entering edit mode

Yep, I wanted a "block of code", but ... did not preserve newlines. I just tried the [101|010] edit button but the result seems to be the same right?

What's the usual way to insert a code block here?

1
Entering edit mode

The code button is the way to go. Select the content and use the 101010 button for a code block. For inline code formatting, use back-ticks (content).

but ... did not preserve newlines

This is a bug where the preview box that pops up when you're typing your content doesn't render content surrounded by triple back-ticks properly, but once you click Submit, there is virtually no difference between content surrounded by triple back-ticks and content where each line is prefixed with 4 spaces and padded by a newline from content preceding and following it. (which is what the 101010 button does).


content


is the same as

<empty-line>
<space><space><space><space>content
<empty-line>