How to sort a VCF file lexicographically by chromosome number?
2
2
Entering edit mode
9.1 years ago
Parth Patel ▴ 50

Hello,

I have a large 155mb vcf file that I am trying to sort lexicographically by chromosome number (in the order 1,10, 11, 12 .. etc ) and then by position. I am trying to use vcf-sort from the vcf-tools perl package.

I have been running the command :

perl vcf-sort myfile.vcf > output.vcf

however, the output file loses all chromosome, position, and ref information for some reason and sorts variants by ID. According to the help page, this is suppose to sort the variants in lexicographic order.

I tried the command:

perl -c vcf-sort myfile.vcf > output.vcf

and the output file is sorted in chromosomally chronological order (1,2,3,4..) and all of the necessary information is retained.

Any idea on how to obtain a lexicographical sort by chromosome number (and then position)?

  • Parth
vcf • 18k views
ADD COMMENT
0
Entering edit mode

Could you please post the command again...I am having some issues with the following command. Thanks.

grep -v '^#' in.vcf | LC_ALL=C sort -t '\t' -k1,1 -k2,2n >> out.vcf

error as follows:

sort: multi-character tab `\\t'
ADD REPLY
0
Entering edit mode

Use $'\t'

grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n >> out.vcf
ADD REPLY
0
Entering edit mode

Oops... I got it. Thanks.

ADD REPLY
0
Entering edit mode

But the random chromosomes are really giving some problems...

ADD REPLY
8
Entering edit mode
9.1 years ago
grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | LC_ALL=C sort -t $'\t' -k1,1 -k2,2n >> out.vcf
ADD COMMENT
1
Entering edit mode

Running on Debian 8 is slight different

grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf |  sort -k1,1 -k2,2n >> out.vcf

Great solution Pierre

ADD REPLY
0
Entering edit mode

Hi Pierre,

Thanks for the quick response. I tried your grep method and it yielded the same results as vcf-sort. The variants are sorted lexicographically by id, but the chromosome, position, and ref information is gone:

##fileformat=VCFv4.1
#CHROM POS   ID                REF   ALT   QUAL  FILTER INFO
             rs10110847_dbSNP        C    100   PASS
             rs10130831_dbSNP        G    100   PASS
             rs10132024_dbSNP        T    100   PASS
             rs10133030_dbSNP        A    100   PASS
             rs10135051_dbSNP        T    100   PASS
             rs10136467_dbSNP        C    100   PASS
             rs10147247_dbSNP        T    100   PASS
             rs10149947_dbSNP        T    100   PASS
             rs10152678_dbSNP        A    100   PASS
             rs10188587_dbSNP        T    100   PASS
             rs10277675_dbSNP        A    100   PASS

Is vcf-sort supposed to be doing this? I am using grep off of cygwin, so I am beginning to wonder if there is something wrong with my grep installation..

Parth

ADD REPLY
1
Entering edit mode

If

awk '/^\t/ {print}' in.vcf

produces any ouput then your VCF is corrupted.

ADD REPLY
0
Entering edit mode

That was really helpful -- a formatting issue was in fact giving me a corrupt VCF file. Thanks!

ADD REPLY
0
Entering edit mode

Hi Pierre,

Thanks. Sorry, I am bit lost. Where to give tab (control+v+tab) as I still get the following error if cut and paste the aforesaid command. Thanks.

sort: multi-character tab `\\t'
ADD REPLY
0
Entering edit mode

You don't really to write the word '\t' but under linux: single-quote CTRL-V TAB single-quote

ADD REPLY
0
Entering edit mode

Pierre, in bash, you can use $'\t' which is the same thing as "Ctrl-V tab" but can be copy/pasted from stackoverflow.

ADD REPLY
2
Entering edit mode
6.5 years ago
ATpoint 81k

A solution without any grep:

cat in.vcf | awk '$1 ~ /^#/ {print $0;next} {print $0 | "LC_ALL=C sort -k1,1 -k2,2n"}' > sorted.vcf
ADD COMMENT
0
Entering edit mode

Is there any reason for using cat first and then awk instead of using directly awk and send the file as input?

ADD REPLY
0
Entering edit mode

no, but people often do this when they're exploring the data, inserting some tools in the pipeline. Imagine it could be gunzip -c in.vcf.gz instead of cat in.vcf

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6