Question

filtering SNPs by samples?

2

Entering edit mode

4.9 years ago

jingjin2203 ▴ 60

Hi all,

I have a bcf file containing SNPs called across 12 different samples. I am specifically interested in the SNPs that occurred in the first 6 samples but are absent in the second 6 samples.

I was wondering if there is any tool that would allow me to do this?

Any help is appreciated! Thanks a lot!

snp sequencing bcftools • 1.5k views

ADD COMMENT • link updated 4.9 years ago by guillaume.rbt ★ 1.0k • written 4.9 years ago by jingjin2203 ▴ 60

score 1 · Answer 1 · 2019-05-16

1

Entering edit mode

4.9 years ago

guillaume.rbt ★ 1.0k

You can use SnpSift to filter a vcf ( http://snpeff.sourceforge.net/SnpSift.html ).

For your problem it would look something like :

cat snps.vcf | java -Xmx20000m -jar SnpSift.jar filter "GEN[0].GT = 1 & GEN[1].GT = 1 & GEN[2].GT = 1 & GEN[3].GT = 1 & GEN[4].GT = 1 & GEN[5].GT = 1 & GEN[6].GT = 1 & GEN[7].GT = 0 & GEN[8].GT = 0 & GEN[9].GT = 0 & GEN[10].GT = 0 & GEN[11].GT = 0 & GEN[12].GT = 0"  > filtered_snps.vcf

(it would keep SNPs that are all genotyped as variant for all first 6 samples, and genotyped as reference for the last 6 samples)

ADD COMMENT • link 4.9 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

Thanks a lot! Really appreciated it! I am currently trying your recommendation, but having some trouble. Hope you could further help me out. 1. The SnpSift manual says I can create an expression using sample names instead of genotype numbers, so I put sample1name in GEN[], am I doing it right? 2. The organism I am working on is diploid, so I changed the genotype type to 1/1 instead of 1.

cat snps.vcf | java -Xmx20000m -jar SnpSift.jar filter "GEN[sample1name].GT = 1/1"

It gives me the following output. Could you please correct me if I did anything wrong? Thanks again!

Exception in thread "main" java.lang.RuntimeException: INFO field 'H' not found in VCF header
at org.snpsift.lang.expression.Field.getReturnType(Field.java:242)
at org.snpsift.lang.expression.Field.eval(Field.java:63)
at org.snpsift.lang.expression.ExpressionBinary.eval(ExpressionBinary.java:26)
at org.snpsift.lang.expression.ExpressionBinary.eval(ExpressionBinary.java:25)
at org.snpsift.lang.expression.FieldSub.evalIndex(FieldSub.java:35)
at org.snpsift.lang.expression.FieldSub.evalIndex(FieldSub.java:27)
at org.snpsift.lang.expression.FieldGenotype.evalGenotype(FieldGenotype.java:24)
at org.snpsift.lang.expression.FieldGenotype.getFieldString(FieldGenotype.java:49)
at org.snpsift.lang.expression.Field.eval(Field.java:76)
at org.snpsift.lang.expression.ExpressionBinary.eval(ExpressionBinary.java:25)
at org.snpsift.SnpSiftCmdFilter.evaluate(SnpSiftCmdFilter.java:142)
at org.snpsift.SnpSiftCmdFilter.annotate(SnpSiftCmdFilter.java:91)
at org.snpsift.SnpSiftCmdFilter.run(SnpSiftCmdFilter.java:355)
at org.snpsift.SnpSiftCmdFilter.run(SnpSiftCmdFilter.java:331)
at org.snpsift.SnpSift.run(SnpSift.java:588)
at org.snpsift.SnpSift.main(SnpSift.java:76)

ADD REPLY • link 4.9 years ago by jingjin2203 ▴ 60

0

Entering edit mode

You're welcome! It seems that there is something wrong with the header of your vcf.

To use samples names instead of their position you have to be sure to have a properly formated header in your vcf.

It must be like described here : http://www.internationalgenome.org/wiki/Analysis/vcf4.0/

With the samples names like this (it's a tabulation between each fields):

#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT NA00001  NA00002  NA00003

Also the syntax of diploid genotypes requires simples quotes in the SnpSift command :

cat cancer.vcf | java -jar SnpSift.jar filter "GEN[Somatic].GT = '2/1'"

ADD REPLY • link 4.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Hi guillaume.rbt, thank you for your kind reply!! After modifying the code, I had the same error as I posted above. I am not aware of anything wrong in the header of my vcf file. I attached a small section of my vcf file, it would be great if you could help me take a look at it. Thanks again!

##fileformat=VCFv4.2

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H3-H-3.sorted.bam H3-H-5.sorted.bam H3-H-6.sorted.bam H3-WZ-1.sorted.bam H3-WZ-2.sorted.bam H3-WZ-5.sorted.bam WZ3-H-1.sorted.bam WZ3-H-2.sorted.bam WZ3-H-6.sorted.bam WZ3-WZ-1.sorted.bam WZ3-WZ-2.sorted.bam WZ3-WZ-3.sorted.bam

7000000185249020 4947 . C T 47.8636 . DP=133;VDB=0.109869;SGB=3.0073;RPB=0.997443;MQB=0.493166;BQB=0.964275;MQ0F=0.0225564;ICB=0.0072327;HOB=0.00347222;AC=1;AN=24;DP4=125,0,6,0;MQ=43 GT:PL 0/0:0,93,183 0/0:0,12,62 0/0:0,18,120 0/0:0,9,18 0/0:0,12,40 0/0:0,36,164 0/0:0,3,36 0/0:0,27,157 0/0:0,51,170 0/0:0,18,63 0/0:0,60,138 0/1:87,0,146

ADD REPLY • link 4.9 years ago by jingjin2203 ▴ 60

0

Entering edit mode

Indeed your VCF header seems ok, have you tried without the samples names but with their position number to see if it would work?

(for example, a test for just your first sample, H3-H-3.sorted.bam, being alternative homozygote : SnpSift.jar filter "GEN[0].GT=='1/1'" )

ADD REPLY • link 4.9 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

It actually worked when I tried "GEN[0].GT=='1/1'" instead of using sample name! I guess I will go with that. Sorry for keeping bugging you, I just had another question. Does the 0 in GEN[0] refer to the first sample in VCF file? In my case, I have 12 samples, so I would put GEN[0] to GEN[11] in the code, is that correct? I guess it is a little bit counter intuitive to me using 0 to refer to the first sample.

Thanks a lot! I really appreciated your time and help!

ADD REPLY • link 4.9 years ago by jingjin2203 ▴ 60

0

Entering edit mode

Sorry I couldnt't help you with the sample name issue! Yes indeed the 0 correspond to the first sample, and 11 would be the 12th sample. Best of luck with your analysis

ADD REPLY • link 4.9 years ago by guillaume.rbt ★ 1.0k