Biostar Beta. Not for public use.
Filter VCF with bash commands
1
Entering edit mode
16 months ago
ajuiwl • 30

Hi, I have been trying to manually filter a vcf file by QD score, but I have not been able to do it properly. I thought of using awk and indicate that a specific field has a value greater than 8 (that is the value I want to filter it), but I have records where the QD variable does not occupy the same position.

1       2422614 .       G       A       3711.77 .       AF=1;DP=120;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.93;SOR=1.773      GT:AD:ADO:DP:GQ:PL      1/1:0,120:0:120:99:0

1       2430617 .       C       T       2052.76 .       AF=0.5;BaseQRankSum=-1.791;ClippingRankSum=0;DP=292;ExcessHet=3.0103;FS=6.978;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=7.03;ReadPosRankSum=-1.813;SOR=0.652      GT:AD:ADO:DP:GQ:PL  0/1:144,148:0:292:99:3740

Thank you very much

ADD COMMENTlink
2
Entering edit mode

Hello ajuiwl ,

why do you want to use awk and not a program that is already specilized on parsing and filtering a vcf like bcftools view?

If you realy need to use awk you will have to use something like match() and define a regex pattern.

fin swimmer

ADD REPLYlink
0
Entering edit mode

Some programs change the way variants are normalized, the header, require installations, require dependencies, and other issues that bash code can avoid, moreover I have more control over what I do.

ADD REPLYlink
1
Entering edit mode

Hello again,

I like bash as well. But when it comes to specific tasks where already widely used program exists, I strongly recommend using this. If reinventing the wheel, there is a high chance that you miss something very important. See the discussion about the perl solution below. The problems discussed there are valid for your solution as well.

I would suggest you take a look at bioconda. Using this you will have (nearly) no problems with installations and dependencies. (See also the first part of this tutorial a have written: Creating workflows with snakemake and conda )

fin swimmer

ADD REPLYlink
1
Entering edit mode

That is true, I just found that there are some variants in the vcf that didn't have INFO about QD nor DP, and the commands I did didn't considered this. So I agree with you using already made software in that sense. In addition I will add that sometimes you are working in a server that requires the software to be installed by an admin and it may take some time to get it installed. And that doing it by yourself, you spend more time but you can discover things that you had not considered before, increasing your knowledge about the topic. Thank you for your time!!

ADD REPLYlink
0
Entering edit mode

Hi, try this line; perl -ne 'print $_ if /QD=(\d+)/ && $1>8' test.txt |less -S

ADD REPLYlink
0
Entering edit mode

This answer does not consider decimal positions, in the way that 8.1 won't be kept.

ADD REPLYlink
0
Entering edit mode

Sorry, but this is easy to solve. perl -ne 'print $_ if /QD=(\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLYlink
0
Entering edit mode

Just a couple of notes, if I'm not mistaken... the perl script will not interpret correctly tags ending in QD, e.g. a line with XQD=10;QD=2 will be incorrectly returned. Also, negative numbers are not parsed correctly. If you filter for QD > -10, a line with QD=-1 will be missed.

ADD REPLYlink
0
Entering edit mode

OK, let's modify this further: perl -ne 'print $_ if /;QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

ADD REPLYlink
0
Entering edit mode

Sorry to be a pain... This fails if the QD tag is the first one in the INFO field since in such case it will not be preceded by ;.

ADD REPLYlink
0
Entering edit mode

Never mind. If so, you can add another '?' : perl -ne 'print $_ if /;?QD=(-?\d+(\.\d+)?)/ && $1>8' test.txt |less -S

It is just a solution for simple data as supplied in the post, with more complex data, we can do some modification of this, or use other better methods as below.

ADD REPLYlink
2
Entering edit mode
16 months ago
ajuiwl • 30

Thank you for your answers, I found the way of doing it with:

grep -Po 'QD=.*' $FILE | cut -d";" -f1 | sed -r 's/[QD\=]//g' | paste -d'\t' $FILE  - | awk 'BEGIN{OFS="\t"}{if ($11>8) {$11=""; print}}'
ADD COMMENTlink
1
Entering edit mode

Hello ajuiwl,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

Thank you!

ADD REPLYlink
2
Entering edit mode
13 months ago
WCIP | Glasgow | UK

If you want a pure awk solution, what about this?

awk -v FS="\t" '{
    qd=$8; 
    sub(/(.*;QD=)|(QD=)/, "", qd); 
    sub(/;.*/, "", qd); 
    qd= qd+0; 
    if($1 ~ "^#" || qd > 8) {
        print $0
    }
}' $FILE

However, I would go for bcftools view as suggested by @finswimmer.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1