Extracting Variants For Select Individuals Overlapping Pre-Specified Regions From 1000 Genomes
2
1
Entering edit mode
11.9 years ago
Ryan D ★ 3.4k

We have a list of a few thousand CNVs/SVs identified using array-based CNV calling methods. They are from five individuals sequenced by the 1000 Genomes Project. We would like to compare breakpoints on the CNVs identified by our calling and those identified by 1000 genomes.

So the question: how can I generate a vcf file (presumably using tabix and vcftools) for a select set of individuals overlapping a list of specified CNVs like the ones below? Moreover, can I do this (thousands of regions) in a single step? Should my regions I query be smaller or larger than the CNVs we identified? And can we filter the results by those that are above threshold size?

Note: it appears BrentP has an answer about how to get multiple regions at once from 1kG here , but if I am pulling all of these from the 1kG FTP server, must I do it by chromosome? And how would pulling regions handle imperfect overlaps as mentioned above?

Individuals:
NA10851
NA18505

Regions:
chr22:22680529-22726814
chr22:22613016-22670785
chr22:41234550-41276824

Thanks, Ryan

1000genomes vcftools tabix • 4.2k views
ADD COMMENT
0
Entering edit mode

a little tip - you can get better answers if you only ask one question per topic. Otherwise, it's difficult for people to reply, and you will only get general answers.

ADD REPLY
2
Entering edit mode
11.9 years ago

I will try to answer the second part of your question.

Manuscript that describes 1000 Genomes data management and community access provides extensive details on accessing 1000 genome data.
You can iteratively access data using Samtools, Data Slicer or Tabix, refer to the 1000 genomes for a detailed tutorial. AFAIK, the tabix files will be indexed by chromosomes, you can pull out data as you need from the 1000 Genomes FTP. See the section on How do I get a slice of your vcf files.

ADD COMMENT
2
Entering edit mode
11.9 years ago
Laura ★ 1.8k

Unfortunately there is no batch processing tools currently available for the 1000 genomes data sets. We provide genotypes on a per chromosome basis as the file would simply be to big to provide whole genome

Something like this would be fairly straight forward to script using tabix and perl or a similar scripting language though

The basic tabix command is provided in the faq Khader linked to

http://www.1000genomes.org/faq/how-do-i-get-sub-section-vcf-file

tabix -h http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.chr17_1451000-1472000.20110521.genotypes.vcf.gz

You need to write a script which iterates through your bed file and constructs commands like this

Our vcf filename convention is quite consistent, generally it is Pop.chr.description.YYYYMMDD When the file is whole genome sometimes the chromosome field is missing

ADD COMMENT

Login before adding your answer.

Traffic: 2521 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6