Biostar Beta. Not for public use.
bash, how can I distinguish synonymous from non-synonymous variants in VCF files?
Entering edit mode
14 months ago

My goal was to find the variants related to a specific gene in a VCF file, through bash in a terminal shell. The wonderful BioStars community helped me understand that in a previous discussion.

For example, to detect the variants of the ARIH1 gene in the inputFile.vcf.gz file, I can run the following commands:

bgzip -c inputFile.vcf > inputFile.vcf.gz
tabix input.vcf.gz
bcftools view inputFile.vcf.gz $ARIH1_genomic_coordinates

Now I have a new task: among the selected output variants, I have to select the non-synonymous variants. How can I do that on a shell terminal?

I thought that one of the VCF file variant fields contained that information, but I cannot find it. The fields are: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SQT_743_100


R vcf variant • 255 views
Entering edit mode

search this site for snpEff and snpSift, VEP ...

Entering edit mode
19 months ago
United States

That kind of information is calculated based on the genomic reference and the gene's coding frame. There may be some transcripts of the gene that do not include Exon2, for example, for which the variant would be intronic. In another transcription frame, the same variant may be a stop-gain.

In a table of ref/alt bases like a VCF represents, this information does not fit very well. So third party tools like snpEff and VEP have their own solutions, and pack the data into the INFO column. Your VCF may not have the information in it yet, and you would have to run an annotation tool on the VCF, producing a larger VCF, with more stuff in the INFO column.

Then, something like grep can pick out the lines of interest. A tool designed to filter vcf mightbe better suited. vcftools can pick out certain rows based on contents of the INFO column.

So take the results of your bcftools view, run it through an annotator, take that resutling VCF, and do another filter on the findings.

I'm being wordy about this because the contents of the INFO column will be some new format, possibly comma delimited, and definitely gene transcript specific. Maybe your gene ARIH1 doesn't have alternative splicing and there's only one transcript, but in general this is a complicated problem genome-wide.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3