Finding SNPs per gene for SNP Density in UNIX
2
0
Entering edit mode
8.5 years ago

I am attempting find out which gene harbors the most variation by using SNP density in a particular isolate. I am at a loss to figure out SNPs per gene, however. There's 445 SNPs present in the isolate with 9 genes. I used VarScan for that. I know their exact lengths but figuring out SNPs per gene is tripping me up. I can supposedly grep for patterns in UNIX (ssh shell) and then use an option to count the number of occurrences, but so far I'm not seeing what command to use/what pattern to grep for. Can anyone make sense of what grep pattern I'm looking for or lead me to an alternative way of figuring out SNPs per gene in unix...

grep SNP VarScan UNIX • 3.0k views
ADD COMMENT
2
Entering edit mode
7.7 years ago

Via a combination of BEDOPS vcf2bed conversion and bedmap --count operation:

$ vcf2bed < snps.vcf | bedmap --echo --count --delim '\t' genes.bed - > genes_with_number_of_overlapping_snps.bed

You'll need a file called snps.vcf containing your SNPs, and a sorted file called genes.bed containing gene annotations of interest.

To calculate SNPs per gene, you sum all the counts and divide by the number of genes:

$ awk 'BEGIN { s = 0; } \
             { s += $NF; } \
         END { print (s / NR); }' genes_with_number_of_overlapping_snps.bed
ADD COMMENT
0
Entering edit mode
8.5 years ago
Irsan ★ 7.8k
I guess you have a table/text file containing the information you need but not in the snp per gene format you are seekong. Use data aggregation. If you are not that into command line approaches, use pivot tables from microsoft excel. Learning pivot tables (or any other way to do data aggregation) will make your life easier
ADD COMMENT

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6