Biostar Beta. Not for public use.
How to split vcf file by chromosome?
3
Entering edit mode
12 months ago
MAPK ♦ 1.4k
United States

I have tried several options on the web including a few post her on Biostar to split my VCF file by chromosome, but could not do it properly. Say I have a vcfile called myvcf.vcf.gz and want to split that per chromosome, what would be the best way to split it by chromosome?

vcf • 13k views
ADD COMMENTlink
0
16
Entering edit mode
15 months ago
venu 6.2k
Germany
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf

It will give chr1.vcf file containing variants for chr1. You can loop the last command over all the chromosomes. If you need vcf header also, use -h flag with last command.

ADD COMMENTlink
2
Entering edit mode

How do you extract chrX, chrY and chrM? Doesn't seem to work for those.

ADD REPLYlink
0
Entering edit mode

Then your chromosomes are either called something else or missing.

ADD REPLYlink
0
Entering edit mode

Same problem here for the chromosome Y.

ADD REPLYlink
5
Entering edit mode
23 months ago
cartoonist • 50
Germany

If the VCF file is tabix-indexed, extracting a subset by region name, let say chromosome 1, can be done by bcftools:

bcftools view -r 1 file.vcf.gz
ADD COMMENTlink
3
Entering edit mode
2.1 years ago
ricardo • 30
Brazil/Fiocruz/Minas Gerais

You can use the snpSift splitchr command. It separates a vcf file according to the chromosomes.

The command is simple:

java -jar snpSift splitChr file.vcf
ADD COMMENTlink
2
Entering edit mode

I dont know what version you used but right now this function is used like this :

  java -jar SnpSift.jar split file.vcf

great set of tools btw.

ADD REPLYlink
0
Entering edit mode

Yeah! You tell him!!

ADD REPLYlink
1
Entering edit mode
20 months ago
pyjiang2 • 30
United States

You can use vcftools. For example, if total 16 chromosomes

for i in {1..16};
do vcftools  --vcf  VCF_FILE  --chr $i  --recode --recode-INFO-all --out  VCF_$i;
done
ADD COMMENTlink
0
Entering edit mode
2.6 years ago
willgilks • 260
United Kingdom

The question can be answered in a single command, using a bash loop to feed chromosome names into GATK's "select variants" command as shown below, whereby "-L" specifies which chromosome to select (https://software.broadinstitute.org/gatk/). This has the advantage over other methods in that index files are generated "on-the-fly". Also GATK is usually pretty robust. If using a genome with many chromosomes named e.g 1-22, users should modify the loop parameters, to something like "for i in seq 1 22;do"

for i in chr2L chr2R chr3L chr3R chr4 chrX;do
        GenomeAnalysisTK -R ${ref_seq} \
            -T SelectVariants \
            -V my_flies.vcf \
            -L $i \
                -o my_flies.${i}.vcf
                        done;
ADD COMMENTlink
0
Entering edit mode

Is this a python script?

ADD REPLYlink
0
Entering edit mode

That's a bash loop executing a GATK java program.

ADD REPLYlink
0
Entering edit mode

Thanks much ! Btw, I just stumbled upon https://gigabaseorgigabyte.wordpress.com/2017/05/02/an-orphan-bioinformatician/ while following your profile in biostars and then wordpress.

I am mainly a evolutionary biologist with a huge transition into Bioinformatics. If you would like to discuss about your problem I hope I can help, though I am not a full fledged programmer at this point. I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

Interesting fact is that I also consider myself a orphan bioinformatician (I like this word !). My situation was even terrible; I started PhD using genome and RNASeq data analyses, then came upon this problem of haplotype phasing in F1 hybrids, which wasn't solvable using any tools available. Additionally, there was null support on the matters of programming in my department and lab, but learning in a hardway has opened my pathway to writing my own program.

Your Biostars rep is quite high so I am thinking you might be quite ahead of me, but I think it would not hurt to discuss.

Thanks, - Bishwa K.

ADD REPLYlink
0
Entering edit mode

Thanks for reading. Feel free to comment on my blog post if you have suggestions, remarks or something else to add.

I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

That's the thing, you always keep learning and improving. Good luck writing your script!
Every now and then I have a look at old code and give it a makeover, it's surprising how much you learn in a few months.

ADD REPLYlink
1
Entering edit mode

I think the best way to help and discuss is to put your tool on github (not matter how much small it might be). I just helps me or other people to create a branch and suggest the changes. Depending on how it works out for you or how it achieves the goal of the pipeline, you can merge it into the master branch.

Are you on github? Else, I have found these to be quite helpful :)

http://product.hubspot.com/blog/git-and-github-tutorial-for-beginners

https://git-scm.com/docs/git-push

https://github.com/cubeton/git101/tree/master/TurtorialInfo

ADD REPLYlink
0
Entering edit mode

We are going fairly off topic here...
Thanks for the links, but I already have a github account and a few repositories.

ADD REPLYlink
0
Entering edit mode
10 weeks ago
ATpoint 17k
Germany

You can use this code to split any VCF by all chromosomes present in the VCF, using parallel and tabix:

#!/bin/bash

## Assumes uncompressed VCF, extracts entries by chromosomes using tabix and parallel:
if [[ ${1: -4} != ".vcf" ]]
  then
  echo '[ERROR]: Please provide an uncompressed vcf file'
  exit; fi

## Get chromosomes that are present in the VCF:
cat $1 | mawk '$1 ~ /^#/ {next} {print $1 | "sort -k1,1 -u"}' > ${1%.vcf}_chrs.txt

## Compress & index file with tabix:
cat $1 | mawk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | bgzip -c > ${1%.vcf}_sorted.vcf.gz
tabix -p vcf ${1%.vcf}_sorted.vcf.gz

## Extract files by chr:
cat ${1%.vcf}_chrs.txt | parallel "tabix -h ${1%.vcf}_sorted.vcf.gz {} > ${1%.vcf}_{}.vcf"

## Usage: ./split_vcf_by_chr.sh input.vcf
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1