tabix to exclude chromosomes from vcf
2
1
Entering edit mode
4.9 years ago
cubellis ▴ 10

Dear friends

I would like to split a vcf file to make to obtain smaller files.

I would like to obtain one vcf file that has alla the data BUT NOT those of chr1, chr2, chr3.

the reason is that I would to split the vcf into

a file for chr1 ,
a file for chr2 ,
a file for chr3,

and a fourth file with all the rest including unlocalized and unknown

is that possible?

thank you

mvc

snp vcf tabix • 2.7k views
ADD COMMENT
0
Entering edit mode

is that possible?

use grep

ADD REPLY
2
Entering edit mode
4.9 years ago

Hello,

using the --help parameter or start bioinformatic applications without any option first is always a good idea.

$ tabix
Version: 1.9
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]

Indexing Options:
   -0, --zero-based           coordinates are zero-based
   -b, --begin INT            column number for region start [4]
   -c, --comment CHAR         skip comment lines starting with CHAR [null]
   -C, --csi                  generate CSI index for VCF (default is TBI)
   -e, --end INT              column number for region end (if no end, set INT to -b) [5]
   -f, --force                overwrite existing index without asking
   -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
   -p, --preset STR           gff, bed, sam, vcf
   -s, --sequence INT         column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT       skip first INT lines [0]

Querying and other options:
   -h, --print-header         print also the header lines
   -H, --only-header          print only the header lines
   -l, --list-chroms          list chromosome names
   -r, --reheader FILE        replace the header with the content of FILE
   -R, --regions FILE         restrict to regions listed in the file
   -T, --targets FILE         similar to -R but streams rather than index-jumps

REGION [...] tells you, you can define multiple regions at once. You can also define a file which holds the regions of interest using -R.

So getting variants of chr1 to chr3 is straightforwar:

$ tabix -h input.vcf.gz chr1 chr2 chr3> output.vcf

To get the rest of the chromosome, we can create a file with all the chromosome but chr1 to chr3:

$ tabix -l input.vcf.gz | grep -vw "chr[123]" > regions.txt

And use than this file with tabix

$ tabix -h -R regions.txt > output2.vcf

fin swimmer

ADD COMMENT
1
Entering edit mode
4.9 years ago

If you know what chromosomes you want, and you have their bounds (if you need a subset of a chromosome), you can query tabix for them directly:

https://bioinformatics.stackexchange.com/questions/8650/tabix-queries-spanning-chromosomes

So you'd have four queries, one each for chr1, chr2, and chr3, and a fourth query with a list of the other chromosomes, which you'd build with a trivial script.

You could also use grep, but VCF files tend to be very large and sweeping through an entire file a few times could take a while. If you're using tabix then your data are indexed, and you may as well take advantage of that index.

ADD COMMENT

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6