Question

How to extract chrX, chrY and chrM from a vcf file?

0

Entering edit mode

7.6 years ago

MAPK ★ 2.1k

I have a large vcf file. I need to split them by chromosome for which I have used the following command below. I was able to extract and the chromosomes chr1 to chr22, but could extract chrX,chrY and chrM. How can I get the split the file for those three chromosomes?

command I used :

bgzip -c myvcf.vcf > myvcf.vcf.gz

tabix -p vcf myvcf.vcf.gz

tabix myvcf.vcf.gz chr1 > chr1.vcf

VCF • 7.0k views

ADD COMMENT • link updated 4.1 years ago by yiqiangz ▴ 10 • written 7.6 years ago by MAPK ★ 2.1k

0

Entering edit mode

Any luck with this problem? Same for me with tabix version: 1.9, when I am trying to extract variants from chromosome X, no outputs.

I found the same problem for bedtools2 intersectBed, no chrX outputs.

ADD REPLY • link 4.1 years ago by yiqiangz ▴ 10

0

Entering edit mode

Either no X present or the name you use is different from what is actually used in the file. Is it like chrX or rather X? Try to find that out.

ADD REPLY • link 4.1 years ago by ATpoint 81k

0

Entering edit mode

Thanks.

It is not a chrX or X problem. The vcf files are from "http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/". Genomic regions other than X work fine.

Here is the all command I used (sorry I don't know how to use code mode):

1, ls ALL.chr*.gz |sort -V > file

2, for a in ls ALL.chr*.gz; do bcftools index $a; done

3, bcftools concat -f file -o 1kg38.vcf --threads 10

4, sort bed -k1V -k2n > extract.bed The content of extract.bed looks like:

1 59814786 59876370 ENSG00000134709 +
2 218264123 218270257 ENSG00000127837 -
3 10285675 10292947 ENSG00000157017 -
11 118657316 118679690 ENSG00000118094 -
15 34229996 34338060 ENSG00000140199 -
17 63477061 63498380 ENSG00000159640 +
19 18434388 18438301 ENSG00000105655 -
X 15627318 15675012 ENSG00000147003 -
X 30653359 30731456 ENSG00000198814 +

5, bgzip 1kg38.vcf -@ 10

6, tabix -p vcf 1kg38.vcf.gz

7, tabix -h -R extract.bed 1kg38.vcf.gz > extract.vcf or bcftools view -R extract.bed 1kg38.vcf.gz > extract.vcf

ADD REPLY • link updated 4.1 years ago by ATpoint 81k • written 4.1 years ago by yiqiangz ▴ 10

0

Entering edit mode

Please stop using the answer field for comments, I already moved your previous one to comment for that reason. Also please highlight code with the 10101 button. The code itself is fine, the VCF simply does not contain the variants you look for. If you want to visualize it simply transform the chrX file to BED using

awk 'OFS="\t" {print "chr"$1, $2-1, $2}' <(zcat ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz) | grep -v '#' > chrX.bed

and then load into the IGV browser. There is nothing in the regions you are querying.

ADD REPLY • link 4.1 years ago by ATpoint 81k

0

Entering edit mode

Hi, I probably figure it out. It is not the problem of tabix, but the problem of the 1000G vcf records. There is a huge gap in the chrX vcf, the coordinate jumps from 2781457 to 155703812, leaving no info in the middle.

ADD REPLY • link 4.1 years ago by yiqiangz ▴ 10

score 4 · Accepted Answer · 2016-08-31

Using vcftools you could pass the chromosomes you want to keep (or exclude) with the position filtering option:

POSITION FILTERING

--chr <chromosome> 
--not-chr <chromosome>

Includes or excludes sites with indentifiers matching <chromosome>.
These options may be used multiple times to include or exclude more
than one chromosome.

you have also options for bed-based filtering.

https://vcftools.github.io/man_latest.html