Use tabix with a list of regions
6
5
Entering edit mode
11.8 years ago
win ▴ 970

Hi all,

I need to use tabix for a project and the following is the syntax that is used, for e.g.

tabix -fhftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz 12:2345295-2345295 > genotypes.vcf

The question I have is if i have a lot of regions in the same chromosome, is there a way that I can supply a file with the regions as input, what would the syntax be? In other words can I just replace the chromosome number and coordinate with a text file name with each region on a new line?

Thanks
Ashwin

tabix • 19k views
ADD COMMENT
9
Entering edit mode
11.8 years ago
JC 13k

Put your regions in a text file (one per line) and stream to tabix:

xargs -a regions.txt -I {} tabix -fh  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf
ADD COMMENT
0
Entering edit mode

i just realized another issue, the vcf files i have are annotated with rs#, whereas the above approach will search using chromosomal coordinates, i wonder is there was a way to search by variant name using the above approach.

ADD REPLY
0
Entering edit mode

you can convert the rs# to coordinates in dbSNP

ADD REPLY
0
Entering edit mode

hi, i just tried to run in this command in a command window in ubuntu 12 and i get a message

[kftpconnectfile] 350 Restarting at 0. Send STORE or RETRIEVE to initiate transfer [main] fail to open the data file.

the command i typed was

xargs -a Chr12.txt -I {} tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf

any ideas?

ADD REPLY
0
Entering edit mode

yes it does, remove that damn < > (I removed them from the response)

ADD REPLY
0
Entering edit mode

hi jc, all of your input has certainly helped, but i had another question. When i run the aboe command i get a header line for each of the variant that was found. is it not possible to have just one header line accross the entire VCF file?

ADD REPLY
0
Entering edit mode

you can remove the header removing the -h flag, but that removes all the headers, to obtain just one header, first generate the output file with a null base:

# generating the header
tabix -fh  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz chr1:1 > genotypes.vcf
# then you can call all your positions
xargs -a regions.txt -I {} tabix -f  ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz {} >> genotypes.vcf
ADD REPLY
6
Entering edit mode
9.5 years ago

Tabix has a -B option which allows you to input a .bed file of regions of interest:

tabix -fhB ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz regions.bed > genotypes.vcf
ADD COMMENT
0
Entering edit mode

I'm trying to get tabix to read in a .bed file as you say, so far without success. What should the format of the .bed lines be? I have tried multiple formats, including this specification at UCSC as well as the format suggested in an answer to similar Biostars question. Also, what about the comment by Rm on that answer, claiming that the -B option has been dropped from tabix? Is that out of date?

ADD REPLY
0
Entering edit mode

ask a new question for your data please + show us a few lines of your bed.

ADD REPLY
0
Entering edit mode

I apologize, my only problem is that I was not careful in reading what people write. I thought that the .bed filename was part of the -B option. Of course, that's not how it works. The -B option just tells tabix to look for a bed file instead of 'region1'.

ADD REPLY
0
Entering edit mode

@Greg P: I think Rm is talking about the outdated manual available here, I haven't seen that the -B option has been dropped (wouldn't make much sense to me either).

ADD REPLY
0
Entering edit mode

-B has been dropped, and since version 1.2, tabix has -R option instead. See my answer for more info.

ADD REPLY
4
Entering edit mode
11.8 years ago
cl.parallel ▴ 140

Yet another way would be a BASH script as:

while read line;do
    tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz $line >> genotypes.vcf
done < coord_file

but I like the simplicity of the use of xargs in JC's answer otherwise.

ADD COMMENT
0
Entering edit mode

thanks! I see you're in Seattle, where? I'm at ISB

ADD REPLY
0
Entering edit mode

UW Microbiology

ADD REPLY
0
Entering edit mode

well no so far, maybe we can chat sometime

ADD REPLY
0
Entering edit mode

Hi,

Thanks for this. I suppose this will generate as many files as the coord_file has right? Please let me know if I'm wrong. I'm trying to use tabix to extract information from vcf files for the coordinates I want, based on a text file with all my regions. I need to have as many files as my region file has. e.g if I have 100 regions, I will have 100 files as well. So in the example above, I will have 100 genotypes.$line.vcf.

Thanks

ADD REPLY
1
Entering edit mode
11.8 years ago

I would rather curl the vcf and pipe the output with bedtools intersect

something like (I cannot check my command I'm away from my lab):

curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr12.phase1.projectConsensus.genotypes.vcf.gz"  | 
gunzip -c |
intersectBed -a stdin  -u -b list.bed
ADD COMMENT
0
Entering edit mode

I think it's better to use tabix. Curl forces you to download the whole file locally; while tabix can take advantage of the tabix indexes (which are already in the 1000genomes ftp), and download only the regions of interest.

ADD REPLY
0
Entering edit mode

I don't think there is any magic on the server side. Tabix download the data and he knows how many bits he should 'skip' and when he should 'stop' scanning. But if all your regions are at the bottom (e.g chrY), it will have to fetch everything each time.

ADD REPLY
0
Entering edit mode

The magic is in the .tbi files. These contains indexes for positions in the vcf files, so tabix doesn't have to download the whole file and scan it. The 1000genomes website already contains some .tbi files, which are detected automatically when you call the tool. Look at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

ADD REPLY
0
Entering edit mode

Giovanni, I agree. But once tabix has downloaded the index, it must download the data itself from the server until it get the right offset.

ADD REPLY
1
Entering edit mode
11.8 years ago

This is a python script that I wrote as a wrapper to call tabix on a list of gene coordinates:

However, the xargs solution is more elegant. The only advantage of my code is that it doesn't download files if they are already present in the directory, so you can interrupt the download and continue it later, or add genes to the list after downloading them all.

ADD COMMENT
1
Entering edit mode
3.8 years ago
mg ▴ 250

Since version 1.2, tabix uses -R option to accept file with regions of interest. From tabix's commandline help message:

-R, --regions FILE restrict to regions listed in the file

And tabix's doc has more info on how this file should look like:

-R, --regions FILE.      Restrict to regions listed in the FILE. The FILE can be BED file 
      (requires .bed, .bed.gz, .bed.bgz file name extension) or a TAB-delimited file with 
      CHROM, POS, and, optionally, POS_TO columns, where positions are 1-based and inclusive. 
      When this option is in use, the input file may not be sorted.

Note that, unlike typical bed file (where START is zero-based), both POS and POS_TO are one-based.

ADD COMMENT

Login before adding your answer.

Traffic: 2820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6