Extract vcf from 1000 genomes based on population and region
2
0
Entering edit mode
7.1 years ago
ammarsabir15 ▴ 70

I want to see the mutation in SCN5A (which is on chromosome 3:38548066-38649628) in Punjabi population and for this purpose I have downloaded the vcf file for chromosome 3 from here.

The file is very large and contain variants (on chromosome 3) for all the populations in 1000 genomes project and I do not know how to filter it based on population and region that's why I am not able to get variants for the region of interest.

How this vcf file can be filtered based on region and population ?

alignment sequencing snp • 6.0k views
ADD COMMENT
1
Entering edit mode

There are a lot of tools to subset a VCF file by region/sample - this is a routine requirement. Did you try searching the forum? GATK has a couple of tools, and so does vcftools/tabix/plinkseq.

You will need to generate the list of samples that form the population of your choice - that is, the samples being a part of a population is more of a logical grouping than something an algorithm can figure out.

ADD REPLY
0
Entering edit mode

But there is still a problem. I want to compare these blast results part1 of blast results and part2 of blast results i.e the G shown in yellow in part2 with the variants file for punjabi population from 1000 genomes project. The variant file is here. I want to check that whether at the same position this mutation exists in punjabi population or not. But I am not getting the right way to do it because the blast file has different numbers for position of bases than the vcf file. So what can be the way to compare these?

ADD REPLY
0
Entering edit mode

Do not add answers unless you're answering the question. This deserves to be a comment reply on an appropriate post. I'm moving it to a comment on the top level post now.

ADD REPLY
0
Entering edit mode

I tried to add it as a comment but it gave error that's why I used this option.

ADD REPLY
0
Entering edit mode

What was the exact error you faced?

ADD REPLY
1
0
Entering edit mode

@Emily thanks for the tool its great. As I said I want to compare mutations from above given blast results with one here.. But in above blast results I am confused about the right position of mutation on the subject (the mutation which is highlighted). I think it is on position 38666125 of chromosome # 3 , is it right ?

ADD REPLY
0
Entering edit mode
7.1 years ago
aham ▴ 40

You can do this with vcftools and bcftools:
First extract the variants in Punjabi population using 'vcftools':

vcf-subset -e -c list_of_Punjabi_persons.txt 1000_genomes_chr3.vcf > PJL_chr3.vcf

Then, extract the variants in specified region using bcftools:

bcftools view -R target_region.bed -o PJL_SCN5A.vcf PJL_chr3.vcf.gz
ADD COMMENT
0
Entering edit mode

I was successful in getting list of variants for punjabi population i.e PJL_chr3.vcf. The command used for this purpose was

zcat genotypes.vcf.gz | vcf-subset -e -c list1.txt > PJL_chr3.vcf

But when I tried to extract desired region i.e chr3:38540899-38649799 from thenvcf file then I get a vcf file containing only headers for the vcf file. The command used for this was

bcftools view -r chr3:38540899-38649799 -o PJL_SCN5A.vcf PJL_chr3.vcf.gz

Is there any mistake in this command?

ADD REPLY
0
Entering edit mode

You may want to double check the chromosome naming scheme - it may be 3:start-end and not chr3:start-end

ADD REPLY
0
Entering edit mode

Your solution worked in this way.

   bcftools view -r 3:38548066-38649628  -o PJL_SCN5A.vcf PJL_chr3.vcf.gz
ADD REPLY
0
Entering edit mode

I suggest you are using the wrong coordinates of SCN5A gene. If your 'PJL_chr3.vcf.gz' file is on hg19 build, which I am sure so, then the correct coordinates of this gene are: '3:38589548-38691164' (from gencode v.19). Please try using these coordinates. Also check the chromosome naming scheme in input file, as suggested by @Ram.

ADD REPLY
0
Entering edit mode

mshakeel yes you are right, the vcf is on hg19 build as it states in header file of vcf

##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

Should I only use '3:38589548-38691164' or a list of coordinates for SCN5A? Also for the same gene there are listed many coordinates so which to use? For example the first one is '3:38674526-38687267' and the second one is which you have told.

ADD REPLY
0
Entering edit mode

Please see here to find the coordinates of this gene according to hg19/hs37d5 assembly. The correct coordinates of the gene '3:38589548-38691164'.

ADD REPLY
0
Entering edit mode

@aham your solution worked thanks for help.

ADD REPLY
0
Entering edit mode

in case you need to perform this 2-step process, it'd be much faster first to filter by region and then to select the samples. anyway, you can merge both commands in one:

bcftools view -r 3:38589548-38691164 -S list_of_Punjabi_persons.txt -o PJL_SCN5A.vcf 1000_genomes_chr3.vcf
ADD REPLY
0
Entering edit mode

following your suggestions ,use "vcf-subset -e -c ceu_sample.txt ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz > ceu_chr7.vcf" I can't get the vcf files for CEU, ceu_sample.txt including all the sample name of cue,could you give me some suggestions

ADD REPLY
0
Entering edit mode

Why not use the Data Slicer posted by Emily above in this post ??

ADD REPLY

Login before adding your answer.

Traffic: 2514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6