Question

tabix to exclude chromosomes from vcf

1

Entering edit mode

4.9 years ago

cubellis ▴ 10

Dear friends

I would like to split a vcf file to make to obtain smaller files.

I would like to obtain one vcf file that has alla the data BUT NOT those of chr1, chr2, chr3.

the reason is that I would to split the vcf into

a file for chr1 ,
a file for chr2 ,
a file for chr3,

and a fourth file with all the rest including unlocalized and unknown

is that possible?

thank you

mvc

snp vcf tabix • 2.7k views

ADD COMMENT • link updated 4.9 years ago by zx8754 11k • written 4.9 years ago by cubellis ▴ 10

0

Entering edit mode

is that possible?

use grep

ADD REPLY • link 4.9 years ago by Pierre Lindenbaum 161k

score 2 · Answer 1 · 2019-06-11

Hello,

using the --help parameter or start bioinformatic applications without any option first is always a good idea.

$ tabix
Version: 1.9
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]

Indexing Options:
   -0, --zero-based           coordinates are zero-based
   -b, --begin INT            column number for region start [4]
   -c, --comment CHAR         skip comment lines starting with CHAR [null]
   -C, --csi                  generate CSI index for VCF (default is TBI)
   -e, --end INT              column number for region end (if no end, set INT to -b) [5]
   -f, --force                overwrite existing index without asking
   -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
   -p, --preset STR           gff, bed, sam, vcf
   -s, --sequence INT         column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT       skip first INT lines [0]

Querying and other options:
   -h, --print-header         print also the header lines
   -H, --only-header          print only the header lines
   -l, --list-chroms          list chromosome names
   -r, --reheader FILE        replace the header with the content of FILE
   -R, --regions FILE         restrict to regions listed in the file
   -T, --targets FILE         similar to -R but streams rather than index-jumps

REGION [...] tells you, you can define multiple regions at once. You can also define a file which holds the regions of interest using -R.

So getting variants of chr1 to chr3 is straightforwar:

$ tabix -h input.vcf.gz chr1 chr2 chr3> output.vcf

To get the rest of the chromosome, we can create a file with all the chromosome but chr1 to chr3:

$ tabix -l input.vcf.gz | grep -vw "chr[123]" > regions.txt

And use than this file with tabix

$ tabix -h -R regions.txt > output2.vcf

fin swimmer

score 1 · Answer 2 · 2019-06-10

If you know what chromosomes you want, and you have their bounds (if you need a subset of a chromosome), you can query tabix for them directly:

https://bioinformatics.stackexchange.com/questions/8650/tabix-queries-spanning-chromosomes

So you'd have four queries, one each for chr1, chr2, and chr3, and a fourth query with a list of the other chromosomes, which you'd build with a trivial script.

You could also use grep, but VCF files tend to be very large and sweeping through an entire file a few times could take a while. If you're using tabix then your data are indexed, and you may as well take advantage of that index.