Retrieve Subset Positions Vcf File
2
2
Entering edit mode
11.8 years ago
Rubal7 ▴ 830

Hello all,

What is the most efficient way to retrieve many subsets of regions from a vcf.gz file? I have about 1000 10kb regions that I need to extract from a whole genome vcf file. I think tabix is the best way to do this but I haven't been able to understand to use it on a large scale to retrieve hundreds of regions. For instance can one input a file with a list of regions (eg 1:11345-112345 for a position on chrom1) and automate the process?

Thanks very much for any help and comments in advance!

vcf tabix • 15k views
ADD COMMENT
10
Entering edit mode
11.8 years ago
brentp 24k

Tabix can take a bed file of regions and pull those out. So:

tabix k1000g.vcf.gz -B regions.bed

and if your regions are formatted like you indicate "1:11345-112345". You can get them into a bed file quickly with awk:

awk 'BEGIN{OFS="\t"; FS=":"} { split($2, a, "-"); print $0, a[1] - 1, a[2] }' regions.txt > regions.bed

You can also intersect a bed file with a VCF using bedtools.

ADD COMMENT
0
Entering edit mode

Thanks, very helpful

ADD REPLY
0
Entering edit mode
-B option is dropped in the latest version(1.X) of tabix
ADD REPLY
0
Entering edit mode

@Rm: can you tell us where you have seen this? The online manual is outdated and from version 0.1.1. I have version 0.2.5 (r964) with the option. It wouldn't make much sense to me to remove that option...

ADD REPLY
0
Entering edit mode

Dropped from latest version ( to read the locations from a file)

Version: 1.1
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]
Options:
   -0, --zero-based        coordinates are zero-based
   -b, --begin INT         column number for region start [4]
   -c, --comment CHAR      skip comment lines starting with CHAR [null]
   -e, --end INT           column number for region end (if no end, set INT to -b) [5]
   -f, --force             overwrite existing index without asking
   -h, --print-header      print also the header lines
   -H, --only-header       print only the header lines
   -l, --list-chroms       list chromosome names
   -m, --min-shift INT     set the minimal interval size to 1<<INT; 0 for the old tabix index [0]
   -p, --preset STR        gff, bed, sam, vcf, bcf, bam
   -r, --reheader FILE     replace the header with the content of FILE
   -s, --sequence INT      column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT    skip first INT lines [0]

Program: tabix (TAB-delimited file InderXer)
Version: 0.2.5 (r964)

Usage:   tabix <in.tab.bgz> [region1 [region2 [...]]]

Options: -p STR     preset: gff, bed, sam, vcf, psltbl [gff]
         -s INT     sequence name column [1]
         -b INT     start column [4]
         -e INT     end column; can be identical to '-b' [5]
         -S INT     skip first INT lines [0]
         -c CHAR    symbol for comment/meta lines [#]
         -r FILE    replace the header with the content of FILE [null]
         -B         region1 is a BED file (entire file will be read) # see this
         -0         zero-based coordinate
         -h         print the header lines
         -l         list chromosome names
         -f         force to overwrite the index
ADD REPLY
0
Entering edit mode

@Rm: You are right. Now that's a strange move... The latest 1.1 version I have just downloaded (tabix is now part of HTSlib) doesn't allow for the -B option although the manual (again out-of-date) still indicates it does: http://www.htslib.org/doc/tabix-1.1.html

ADD REPLY
0
Entering edit mode

Possible solution:

head -1 gene.bed
1    55505220    55530525     PCSK9    -    +    ENSG00000169174     protein_coding
cat gene.bed | awk -v IVCF="input.vcf" 'BEGIN {FS=OFS="\t"} {bed=$1":"$2"-"$3; out=$4".vcf"; print IVCF, bed ; system("tabix " "-h -p vcf " IVCF " " bed " > " out)}'
ADD REPLY
0
Entering edit mode

I don't understand this solution..Me too I have the same problem with Tabix v1.1 and I was not able to install previous versions such as v0.2.5, make function doesn't work. Did you find any substitutive (in Tabix v1.1) for -B option?

ADD REPLY
0
Entering edit mode

Since version 1.2, tabix has -R option instead. See my answer for more info.

ADD REPLY
2
Entering edit mode
11.8 years ago

This is a quick-and-dirty script that I wrote to call tabix on multiple regions, and download genotypes from 1000genomes:

It requires a file containing the gene coordinates, as following:

DOLK N-glycan substrates chr9 131707808 131710012
RPN1 N-glycan ost_complex chr3 128338812 128399918
ALG8 N-glycan precursor_biosynthesis chr11 77811987 77850699
ALG9 N-glycan precursor_biosynthesis chr11 111652918 111750181
UAP1 N-glycan substrates chr1 162531295 162569633
ST6GAL1 N-glycan branching2 chr3 186648314 186796341
ALG2 N-glycan precursor_biosynthesis chr9 101978706 101984246
ALG3 N-glycan precursor_biosynthesis chr3 183960116 183967313
ALG1 N-glycan precursor_biosynthesis chr16 5083702 5137380
ALG6 N-glycan precursor_biosynthesis chr1 63833260 63904233
GANAB N-glycan cnx_crt chr11 62392297 62414104

But I guess that you can easily modify the code to accept other formats.

It's not much of a good script, because I usually don't have to download genotypes frequently. I used it only a few times. In any case, it has some useful functions, like an option to skip downloading files if they already exist, and it compresses them after downloading.

ADD COMMENT

Login before adding your answer.

Traffic: 2451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6