Question

Extracting Variants For Select Individuals Overlapping Pre-Specified Regions From 1000 Genomes

1

Entering edit mode

11.9 years ago

Ryan D ★ 3.4k

We have a list of a few thousand CNVs/SVs identified using array-based CNV calling methods. They are from five individuals sequenced by the 1000 Genomes Project. We would like to compare breakpoints on the CNVs identified by our calling and those identified by 1000 genomes.

So the question: how can I generate a vcf file (presumably using tabix and vcftools) for a select set of individuals overlapping a list of specified CNVs like the ones below? Moreover, can I do this (thousands of regions) in a single step? Should my regions I query be smaller or larger than the CNVs we identified? And can we filter the results by those that are above threshold size?

Note: it appears BrentP has an answer about how to get multiple regions at once from 1kG here , but if I am pulling all of these from the 1kG FTP server, must I do it by chromosome? And how would pulling regions handle imperfect overlaps as mentioned above?

Individuals:
NA10851
NA18505

Regions:
chr22:22680529-22726814
chr22:22613016-22670785
chr22:41234550-41276824

Thanks, Ryan

1000genomes vcftools tabix • 4.2k views

ADD COMMENT • link updated 11.8 years ago by Laura ★ 1.8k • written 11.9 years ago by Ryan D ★ 3.4k

0

Entering edit mode

a little tip - you can get better answers if you only ask one question per topic. Otherwise, it's difficult for people to reply, and you will only get general answers.

ADD REPLY • link 11.9 years ago by Giovanni M Dall'Olio 28k

score 2 · Answer 1 · 2012-06-19

I will try to answer the second part of your question.

Manuscript that describes 1000 Genomes data management and community access provides extensive details on accessing 1000 genome data.
You can iteratively access data using Samtools, Data Slicer or Tabix, refer to the 1000 genomes for a detailed tutorial. AFAIK, the tabix files will be indexed by chromosomes, you can pull out data as you need from the 1000 Genomes FTP. See the section on How do I get a slice of your vcf files.

score 2 · Answer 2 · 2012-06-19

Unfortunately there is no batch processing tools currently available for the 1000 genomes data sets. We provide genotypes on a per chromosome basis as the file would simply be to big to provide whole genome

Something like this would be fairly straight forward to script using tabix and perl or a similar scripting language though

The basic tabix command is provided in the faq Khader linked to

http://www.1000genomes.org/faq/how-do-i-get-sub-section-vcf-file

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.chr17_1451000-1472000.20110521.genotypes.vcf.gz

You need to write a script which iterates through your bed file and constructs commands like this

Our vcf filename convention is quite consistent, generally it is Pop.chr.description.YYYYMMDD When the file is whole genome sometimes the chromosome field is missing