Question

Most effective way to merge individual VCF files into VCF files merged by chromosome

0

Entering edit mode

9.9 years ago

vlaufer ▴ 290

Hello biostars,

The VCF files I have are currently 1 file per individual. I want to convert them to 1 file per chromosome, with many individuals.

Normally, I would simply split them, then merge them; or merge them, then split them, but I have quite a few (large) files.

So, I was wondering if there is a best way to carry out these two steps.

Or, if not, I assume it would be a lot smarter (in terms of RAM usage) to split each individual file first, then merge them after each is split?

Which tool and options would you use?

VCF VCFmerge VCFtools merge Plinkseq • 9.1k views

ADD COMMENT • link updated 5.4 years ago by Biostar 20 • written 9.9 years ago by vlaufer ▴ 290

1

Entering edit mode

The problem I see with what you are doing is that unless you have a line in your vcf for every single position, there will be a lot of positions that are missing. When you combine samples, and have one sample with a variant, and another sample is blank, you won't know if the other sample really is homozygous reference, or if it's too poor quality to know.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by swbarnes2 14k

0

Entering edit mode

Hi swbarnes2 -

that's a really, really good point. So, let me ask you this... provided that I only ever intend to use this particular set of files for phasing, do you think this would still be an issue?

I care about lots of parts of the analysis, but currently with this subset of files, my only goal is to phase.

The reason I had elected to do this is because that is how the example dataset at BEAGLE is formatted...I was following their lead:

http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/

So, let me ask you this...

If you wanted to phase a few hundred whole genomes, what exactly would you do? Would you not generate multisample VCFs by chromosome? I now have the files in both formats, so I can use either ... my plan was to group my samples together by ethnicity, then phase them based on the 1kG reference files provided through the link, above.

But I fully admit I am not experienced in doing this and would love to have your feedback.

Regards,
vlaufer

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by vlaufer ▴ 290

1

Entering edit mode

9.9 years ago

matted 7.8k

I would try the htslib versions of vcftools (options here), doing bcftools view first to split things by chromosome, and then bcftools merge to merge across individuals.

I don't know how fast it would be for your files, but they claim to be speedy:

"HTSlib is a C library for high-throughput sequencing data formats. It is designed for speed and works with both VCF and BCFv2."

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.9 years ago by matted 7.8k

0

Entering edit mode

Thanks! I was looking through vcftools at the moment but had looked past this.

I appreciate your time and help!

ADD REPLY • link 9.9 years ago by vlaufer ▴ 290

Ram · Accepted Answer · 2014-05-15

Handling headers might be a complication, but after converting VCF files to BED, one could use the bedextract tool to very efficiently split BED files by chromosome, take the multiset union of the per-chromosome BED files with bedops --everything - and then convert each per-chromosome BED file to VCF.

Using bash, here's one way to convert person_{1..n}.vcf to BED files:

$ for vcfFn in `ls person_*.vcf`; do \
    vcf2bed < ${vcfFn} > ${vcfFn%.*}.bed; \
done

Then split each BED file:

$ for bedFn in `ls person_*.bed`; do \
    for chrName in `bedextract --list-chr ${bedFn}`; do \
        bedextract ${chrName} ${bedFn} > ${bedFn%.*}.${chrName}.bed;
    done \
done

Note: To speed this up further, note that the "outer" (per-BED) and "inner" (per-chromosome) loop steps are each parallelizable, either with a computational grid or a software approach like GNU Parallel.

Pick one person to get a list of chromosomes, using that list to build the union of the per-chromosome BED files:

$ for chrName in `bedextract --list-chr person_1.bed`; do \
    bedops --everything *.${chrName}.bed > everyone_${chrName}.bed; \
done

Next use awk to re-arrange columns and re-index the coordinates from 0-based, half-open BED into 1-based, closed VCF form. It's the inverse of vcf2bed - I'll leave that to the reader. Additionally, for each chromosome file result, it would also be necessary to append a header, perhaps built from the unique INFO fields of one or all inputs - a hash table in Python or Perl would deal with that second case pretty easily.