Question

Finding SNPs per gene for SNP Density in UNIX

0

Entering edit mode

8.5 years ago

crawlthrutacos • 0

I am attempting find out which gene harbors the most variation by using SNP density in a particular isolate. I am at a loss to figure out SNPs per gene, however. There's 445 SNPs present in the isolate with 9 genes. I used VarScan for that. I know their exact lengths but figuring out SNPs per gene is tripping me up. I can supposedly grep for patterns in UNIX (ssh shell) and then use an option to count the number of occurrences, but so far I'm not seeing what command to use/what pattern to grep for. Can anyone make sense of what grep pattern I'm looking for or lead me to an alternative way of figuring out SNPs per gene in unix...

grep SNP VarScan UNIX • 3.0k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by crawlthrutacos • 0

score 2 · Answer 1 · 2016-08-18

Via a combination of BEDOPS vcf2bed conversion and bedmap --count operation:

$ vcf2bed < snps.vcf | bedmap --echo --count --delim '\t' genes.bed - > genes_with_number_of_overlapping_snps.bed

You'll need a file called snps.vcf containing your SNPs, and a sorted file called genes.bed containing gene annotations of interest.

To calculate SNPs per gene, you sum all the counts and divide by the number of genes:

$ awk 'BEGIN { s = 0; } \
             { s += $NF; } \
         END { print (s / NR); }' genes_with_number_of_overlapping_snps.bed

score 0 · Answer 2 · 2015-11-04

I guess you have a table/text file containing the information you need but not in the snp per gene format you are seekong. Use data aggregation. If you are not that into command line approaches, use pivot tables from microsoft excel. Learning pivot tables (or any other way to do data aggregation) will make your life easier