vcf file processing
0
0
Entering edit mode
5.9 years ago

I have a vcf file, have run SnpEff for annotation. I need to group these snps based on their belong genes. such as x, y and z snps belong to gene w, for all gene.

SNP next-gen gene • 1.4k views
ADD COMMENT
0
Entering edit mode

Are you trying to extract them into separate files per gene or are you trying to run a burden test or something sophisticated?

ADD REPLY
0
Entering edit mode

Thanks Vivek for quick response. Have vcf file and bed/gff file as input file. Actually I want separate files per gene.

ADD REPLY
1
Entering edit mode

There are more elegant solutions if you can do some scripting but here's a crude workflow:

If you have one line per gene in the bed file, you can initially split the BED file into one file per gene like this:

split -l 1 Genes.bed Genes-

Depending on the number of genes, you might produce a lot of files here.

Rename to bed extension

for file in `ls Genes-*`;do mv $file $file.bed;done

Then use Tabix to split your VCF

for bed in `ls Genes-*.bed`;do tabix variants.vcf -h -B $bed > variants-${bed}.vcf;done
ADD REPLY
0
Entering edit mode

It always helps if you can post some example data. Use datamash to group by gene and collapse all SNPs.

output:

$ datamash -H -g 1 collapse 2 < snps.txt 
GroupBy(gene)   collapse(snp)
x   a,b,c
y   d,e
z   f,g,h

input:

$ cat snps.txt 
gene    snp
x   a
x   b
x   c
y   d
y   e
z   f
z   g
z   h

Install datamash either from here or from distro repos (for debian based; sudo apt install datamash -y; for conda, conda install datamash -y).

ADD REPLY
0
Entering edit mode

Neeraj, can you post few lines of the data? I know it should be a standard vcf, still it helps !

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6