Fastest was to sort a 340 gb BCF by chromosome and position?
0
1
Entering edit mode
3.9 years ago
curious ▴ 750

Right now I am running:

bcftools sort --max-mem 30G --temp-dir {temp_dir}  {in_bcf} -Ob > {out_bcf}"

Which has been running for several days and is still not done. Is there a faster way to sort a bcf? As far as I can tell bcftools does not have an option to parallelize this step. Could sorting like this even be done in parallel?

Would something like this be even faster, even though I think there would be several day overhead in conversions back and forth from bcf to vcf.

bcftools view -H bcf_unsorted.bcf | sort --buffer-size=80% --parallel=24 -k1,1 -k2,2n --temporary-directory=temp_dir | bcftools view -h  - | bcftools view -Ob > bcf_sorted.bcf
bcf bcftools • 1.9k views
ADD COMMENT
1
Entering edit mode

just curious: what was your upstream process that led to such an unsorted big file?

ADD REPLY
0
Entering edit mode

Its just a lot of samples with a lot of sites. I had to add a few sites and I used bcftools concat to do that. Now I am just sorting to get them in the right order.

ADD REPLY
0
Entering edit mode

Did you include all reference bases, too?

ADD REPLY
0
Entering edit mode

It is imputed, so it has all sites in the imputation panel, which is very dense and I have many samples. So not all positions in reference if I understand your meaning correctly.

ADD REPLY
0
Entering edit mode

I wondered if your input is not already sorted.

ADD REPLY
0
Entering edit mode

It was sorted at first by chromosome and position, there are on average about 10M variants in each chromosome. I added a few thousand with bcftools concat, now I have to sort again by chromosome and position, since it appears concat just adds new variants the new lines at the end of the file.

ADD REPLY
0
Entering edit mode

My problem is basically just to add a few thousand new sites to an already sorted bcf with millions of sites and a number of samples that makes sorting unwieldy. Originally I was using concat and sort to do this.

Another idea I just had was to use bcftools view -R to get get all sites leading up to my first new site that I want to add, add the new site with concat, then use bcftools view -R again to get everything between that site and the next site to add, then cat that, rinse repeat. This would end up with me running concat and passing thousands of bcfs as arguments though, I would put them in a file list, but this is the general idea:

bcftools concat {first_region.bcf} {first_new_variant.bcf} {second_region.bcf} {second_new_variant.bcf}...

ADD REPLY
0
Entering edit mode

How big is your BCF?

ADD REPLY
1
Entering edit mode

chromosome 2 is 340 GB.

ADD REPLY

Login before adding your answer.

Traffic: 2548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6