Biostar Beta. Not for public use.
How Can I Merge A Large Amount Of Vcf Files?
9
Entering edit mode
6.2 years ago
Matt W • 240

I have about one thousand vcf files, and this is a ridiculous amount to type by hand into vcf-merge. Is there a way to merge all of these files without typing each name?

I tried putting all of the files into a text file and using stdin, but vcf-merge doesn't support this, i.e.

vcf-merge < list.txt > merge.vcf

Maybe this problem stems from creating this many vcf files to begin with. When using vcf-tools, I need to select multiple locations (i.e.--from-bp 100 --to-bp 200) which means I need to run vcf-tools for each location. Is there a better way to do this?

Thanks in advance for any help!

ADD COMMENTlink
13
Entering edit mode
14 months ago
Washington University in St. Louis, MO

If the problem is merely getting the list into vcf-merge, it's as easy as:

vcf-merge $(ls -1 *.vcf | perl -pe 's/\n/ /g') > merge.vcf

(assumes all your vcfs to be merged are in the same directory)

ADD COMMENTlink
1
Entering edit mode

This is definitely along the track of what I'm trying to achieve. However, I keep getting an error: http://pastebin.com/H5AQZEtp Any ideas?

ADD REPLYlink
5
Entering edit mode

A quick guess: the argument to ls is the number one (1), not the letter L (l).

ADD REPLYlink
0
Entering edit mode

That did the trick! Thanks for your help!

ADD REPLYlink
8
Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

deprecated : do not use this

I wrote a tool named scanvcf : it concatenates the VCF, adds an extra column with the sample name.

$ head -n3 input.txt

#Sample VCF
Sample1 data/sample1.vcf.gz
Sample2 data/sample1.vcf.gz
Sample2 data/sample1.vcf.gz


$ cat input.txt |scanvcf 
#CHROM POS ID REF ALT QUAL FILTER . FORMAT Call SAMPLE
1 879317 rs7523549 C T 71 0 . GT:GQ:DP:FLT 0/1:34:8:0 Sample1
1 880238 rs3748592 A G 51 0 . GT:GQ:DP:FLT 1/1:51:8:0 Sample1
1 880390 rs3748593 C A 99 0 . GT:GQ:DP:FLT 1/0:99:30:0 Sample1
1 881627 rs2272757 G A 99 0 . GT:GQ:DP:FLT 1/0:59:20:0 Sample1
(...)
Y 13524507 . C T 99 0 . GT:GQ:DP:FLT 1/1:99:233:0 Sample20
Y 21154323 rs10465459 G A 99 0 . GT:GQ:DP:FLT 1/1:99:215:0 Sample20
Y 21154426 rs52812045 G A 99 0 . GT:GQ:DP:FLT 1/0:99:143:0 Sample20
Y 21154466 rs10465460 T A 99 0 . GT:GQ:DP:FLT 1/1:99:134:0 Sample20
Y 21154529 . G A 51 0 . GT:GQ:DP:FLT 1/1:51:8:0 Sample20

You can later filter the result using samplespersnp.

  $ cat list.tsv | scanvcf  |\
  sort -t'  ' -k1,1 -k2,2n -k4,4 -k5,5 -k11,11 |\
  samplespersnp --sample 11 | awk '($8=".")'
1 753269 rs61770172 C G 99 0 . GT:GQ:DP:FLT 1/1:99:116:0 Sample16 1
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:63:31:0 Sample10 7
1 753405 rs61770173 C A 81 0 . GT:GQ:DP:FLT 1/1:51:19:0 Sample12 7
1 753405 rs61770173 C A 35 0 . GT:GQ:DP:FLT 1/0:35:66:0 Sample19 7
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:99:35:0 Sample3 7
1 753405 rs61770173 C A 90 0 . GT:GQ:DP:FLT 1/1:90:21:0 Sample5 7
1 753405 rs61770173 C A 99 0 . GT:GQ:DP:FLT 1/1:99:36:0 Sample6 7
1 753405 rs61770173 C A 90 0 . GT:GQ:DP:FLT 1/1:90:21:0 Sample9 7
1 876499 rs4372192 A G 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample12 6
1 876499 rs4372192 A G 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample16 6
1 876499 rs4372192 A G 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample17 6
1 876499 rs4372192 A G 45 0 . GT:GQ:DP:FLT 1/1:45:6:0 Sample18 6
1 876499 rs4372192 A G 45 0 . GT:GQ:DP:FLT 1/1:45:6:0 Sample4 6
1 876499 rs4372192 A G 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample6 6
1 877831 rs6672356 T C 42 0 . GT:GQ:DP:FLT 1/1:42:5:0 Sample14 2
1 877831 rs6672356 T C 39 0 . GT:GQ:DP:FLT 1/1:39:4:0 Sample4 2
1 878601 . C T 98 0 . GT:GQ:DP:FLT 0/1:50:11:0 Sample14 1

This tool also takes an option -e for a query over the samples. e.g: "variation must contains Sample11, Sample12 BUT NOT Sample1 to Sample5:"

-e '(Sample11  && Sample12 && (!(Sample1 || Sample2 || Sample3 || Sample4 || Sample5)))'
ADD COMMENTlink
0
Entering edit mode

2.5 years later: my solution is deprecated. This is not what you're looking for.

ADD REPLYlink
0
Entering edit mode

what do you use now?

ADD REPLYlink
0
Entering edit mode
ADD REPLYlink
6
Entering edit mode
4.0 years ago
matted 7.0k
Boston, United States

If I understand your task correctly, xargs would be a quick way to go (similar to the perl example above):

ls *vcf | xargs vcf-merge > merged.vcf

Or if it's a bit more complicated and you want to use the file list as input:

cat files.txt | xargs vcf-merge > merged.vcf
ADD COMMENTlink
0
Entering edit mode

Definitely what I am looking for, but each time I try there is some error. When I had plain vcf files it asked for the tabix index. So I used bgzip | tabix and tried this again doing ls *.vcf.gz and another error reading columns comes up: http://pastebin.com/aDpgkqjw Any ideas? Thanks for your help! EDIT: I also know vcf-merge works correctly because I have used it successfully on a small number of files.

ADD REPLYlink
0
Entering edit mode

Resolved. My blank output was being picked up by the *vcf and throwing an error. Thanks for the help!

ADD REPLYlink
4
Entering edit mode
16 months ago
Washington University School of Medicin…

Another option is: joinx

joinx vcf-merge [OPTIONS] file1.vcf file2.vcf [file3.vcf ...]
ADD COMMENTlink
2
Entering edit mode
18 months ago
Vivek ♦ 2.3k
Denmark

I _think_ GATK CombineVariants walker would support a list of files as input.

ADD COMMENTlink
1
Entering edit mode
15 months ago
Netherlands

What about, time while read line; do vcf-merge file1.vcf $line >> merge.cvf; done < list.txt and your list.txt should have filenames except the first one.

Cheers

ADD COMMENTlink
0
Entering edit mode

Hello,

The script help to merge all vcf files, but how can you make it possible to add columns with sample name.?

ADD REPLYlink
0
Entering edit mode
10 weeks ago
ATpoint 17k
Germany

Years later, only using Unix tool, producing a coord-sorted and bgzipped VCF:

ls *.vcf | head -n 1 | xargs grep '#' > ${BASENAME}_header.txt
cat *.vcf | mawk '$1 ~ /^#/ {next} {print $0 | "sort -k1,1 -k2,2n --parallel=8"}' | \
  cat ${BASENAME}_header.txt /dev/stdin | bgzip -@ 8 > out.catted.vcf.gz
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1