Question

Retrieve Subset Positions Vcf File

2

Entering edit mode

11.8 years ago

Rubal7 ▴ 830

Hello all,

What is the most efficient way to retrieve many subsets of regions from a vcf.gz file? I have about 1000 10kb regions that I need to extract from a whole genome vcf file. I think tabix is the best way to do this but I haven't been able to understand to use it on a large scale to retrieve hundreds of regions. For instance can one input a file with a list of regions (eg 1:11345-112345 for a position on chrom1) and automate the process?

Thanks very much for any help and comments in advance!

vcf tabix • 15k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 11.8 years ago by Rubal7 ▴ 830

Ram · Answer 1 · 2012-06-18

10

Entering edit mode

11.8 years ago

brentp 24k

Tabix can take a bed file of regions and pull those out. So:

tabix k1000g.vcf.gz -B regions.bed

and if your regions are formatted like you indicate "1:11345-112345". You can get them into a bed file quickly with awk:

awk 'BEGIN{OFS="\t"; FS=":"} { split($2, a, "-"); print $0, a[1] - 1, a[2] }' regions.txt > regions.bed

You can also intersect a bed file with a VCF using bedtools.

ADD COMMENT • link 11.8 years ago by brentp 24k

0

Entering edit mode

Thanks, very helpful

ADD REPLY • link 11.8 years ago by Rubal7 ▴ 830

0

Entering edit mode

-B option is dropped in the latest version(1.X) of tabix

ADD REPLY • link 9.4 years ago by Rm 8.3k

0

Entering edit mode

@Rm: can you tell us where you have seen this? The online manual is outdated and from version 0.1.1. I have version 0.2.5 (r964) with the option. It wouldn't make much sense to me to remove that option...

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Leonor Palmeira 3.8k

0

Entering edit mode

Dropped from latest version ( to read the locations from a file)

Version: 1.1
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]
Options:
   -0, --zero-based        coordinates are zero-based
   -b, --begin INT         column number for region start [4]
   -c, --comment CHAR      skip comment lines starting with CHAR [null]
   -e, --end INT           column number for region end (if no end, set INT to -b) [5]
   -f, --force             overwrite existing index without asking
   -h, --print-header      print also the header lines
   -H, --only-header       print only the header lines
   -l, --list-chroms       list chromosome names
   -m, --min-shift INT     set the minimal interval size to 1<<INT; 0 for the old tabix index [0]
   -p, --preset STR        gff, bed, sam, vcf, bcf, bam
   -r, --reheader FILE     replace the header with the content of FILE
   -s, --sequence INT      column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT    skip first INT lines [0]

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r964)

Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]

Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read) # see this
         -0         zero-based coordinate
         -h         print the header lines
         -l         list chromosome names
         -f         force to overwrite the index

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Rm 8.3k

0

Entering edit mode

@Rm: You are right. Now that's a strange move... The latest 1.1 version I have just downloaded (tabix is now part of HTSlib) doesn't allow for the -B option although the manual (again out-of-date) still indicates it does: http://www.htslib.org/doc/tabix-1.1.html

ADD REPLY • link 9.4 years ago by Leonor Palmeira 3.8k

0

Entering edit mode

Possible solution:

head -1 gene.bed
1    55505220    55530525     PCSK9    -    +    ENSG00000169174     protein_coding

cat gene.bed | awk -v IVCF="input.vcf" 'BEGIN {FS=OFS="\t"} {bed=$1":"$2"-"$3; out=$4".vcf"; print IVCF, bed ; system("tabix " "-h -p vcf " IVCF " " bed " > " out)}'

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Rm 8.3k

0

Entering edit mode

I don't understand this solution..Me too I have the same problem with Tabix v1.1 and I was not able to install previous versions such as v0.2.5, make function doesn't work. Did you find any substitutive (in Tabix v1.1) for -B option?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.0 years ago by raveancic • 0

0

Entering edit mode

Since version 1.2, tabix has -R option instead. See my answer for more info.

ADD REPLY • link 3.8 years ago by mg ▴ 250

score 2 · Answer 2 · 2012-06-19

This is a quick-and-dirty script that I wrote to call tabix on multiple regions, and download genotypes from 1000genomes:

get genotypes from 1000genomes

It requires a file containing the gene coordinates, as following:

DOLK N-glycan substrates chr9 131707808 131710012
RPN1 N-glycan ost_complex chr3 128338812 128399918
ALG8 N-glycan precursor_biosynthesis chr11 77811987 77850699
ALG9 N-glycan precursor_biosynthesis chr11 111652918 111750181
UAP1 N-glycan substrates chr1 162531295 162569633
ST6GAL1 N-glycan branching2 chr3 186648314 186796341
ALG2 N-glycan precursor_biosynthesis chr9 101978706 101984246
ALG3 N-glycan precursor_biosynthesis chr3 183960116 183967313
ALG1 N-glycan precursor_biosynthesis chr16 5083702 5137380
ALG6 N-glycan precursor_biosynthesis chr1 63833260 63904233
GANAB N-glycan cnx_crt chr11 62392297 62414104

But I guess that you can easily modify the code to accept other formats.

It's not much of a good script, because I usually don't have to download genotypes frequently. I used it only a few times. In any case, it has some useful functions, like an option to skip downloading files if they already exist, and it compresses them after downloading.