Biostar Beta. Not for public use.
strand bias filtering with bcftools
0
Entering edit mode
14 months ago
user31888 • 60
United States

I called SNPs with samtools/bcftools from RNASeq data as follows, and then try to keep variants showing a strand bias p-value < 0.01 (Fischer's exact test).

#CALLING
samtools mpileup \
--fasta-ref my_ref.fasta \
--min-MQ 1\
--min-BQ 30 \
--uncompressed \
--output-tags AD,INFO/AD,DP,SP \
--skip-indels \
input.bam \
| \
bcftools call \
--consensus-caller \
--variants-only \
--skip-variants indels \
--output-type v \
--output call.vcf

#FILTERING
bcftools filter \
--include INFO/PV4[0] < 0.01 \
--mode x \
--output call_filtered.vcf
  • Since some records are missing the INFO/PV4 tag, would it be the same to filter the variants based on the FORMAT/SP value?

According to the vcf header, SP is the "Phred-scaled strand bias P-value", whereas PV4[0] is the "P-values for strand bias".

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

Are PV4[0] and SP both calculated with Fischer's exact test?

  • You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

In this case why they do not keep SNPs > 60 instead of < 60 (in my case above p-val < 0.01 should be equivalent to a strand bias Phred > 20 instead of < 20) ?

ADD COMMENTlink
0
Entering edit mode
13 months ago
Republic of Ireland

Would it mean that PV4[0] < 0.01 is equivalent to SP > 20 (Q = -10*log10(0.01) = 20)?

In bioinformatics, never assume anything without concrete evidence. So, I would not make this assumption without seeing the exact calculations that are being performed in order to generate both of these metrics.

I do believe that Fisher's test is being used, but I cannot confirm. Your better option may be to look in the manual for samtools or to contact Heng Li. who developed BWA and SAMtools / BCFtools.

--------------------------------------------

You can also read here that GATK keeps SNPs with strand bias p-value < 60 for filtering SNP (equivalent to p-value > 0.000001). But they also mention here that > A higher score indicates a higher probability that a particular > decision is correct, while conversely, a lower score indicates a > higher probability that the decision is incorrect.

These statements do not contradict each other. By keeping variants that are < Phred-scaled P value 60, only variants showing low evidence of strand bias are retained. The other statement is then just saying that a higher Phred-scaled P value score indicates that the "decision" that strand bias exists is indeed correct.

Kevin

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1